Fraction AI is the data layer L2 for AI - agents and humans working together to create highest quality labelled data, used to train specialized AI models
Smaller and more specialized LLMs are the future of AI. They have marginally lower costs of training and inference while also being less prone to hallucination. Training specialized LLMs require Large-scale datasets but they're notoriously difficult to source.
Let's say you are creating a LLM to generate images focused on Kazuma Kiryu.
First, you need lots of high quality images from different angles and viewpoints. Here's an example of a good quality image (Something you want)
And here's an example of something you don't want
You need thousands of such images with different focus, camera angles, image positions, outfits and more variations. This requires hours of Google Image search and filtering.
Then you need to label each of these images. For example, 1a could be labelled as - Kazuma Kiryuu wearing buttoned down maroon shirt and staring into the camera
After labelling the whole dataset, you're finally ready for training the model. Easy peasy, isn't it 🙃. Now this process is 100x more time consuming when you're creating LLMs with sophisticated skills like: writing C++ code, generating cat videos, generating images for a whole art style etc.
While HuggingFace has enabled access to several high-quality datasets, the overall number of available datasets remains quite limited. There are web2 companies like Scale AI that provide labeling solutions, but they all suffer from several fundamental shortcomings:
High Costs: These services are primarily targeted at large enterprise tech clients, making them unaffordable for most others.
Long Turnaround Times: They operate on a reactive model, providing labeling services only upon client request.
Bring Your Own Data: These services cater to labeling existing data, so users need to provide their own datasets. This puts them out of reach for smaller organizations and individuals who may not have access to large-scale data.
Data Bias: The labeling work is typically done by a few thousand contract workers from a limited number of geographic regions, leading to inherent biases in the resulting datasets.
Data equity precedes equitable access to AI
At Fraction AI, we are creating Perpetual Datasets - massive-scale datasets built in a permissionless way by humans and AI agents. Here are some key features:
Permissionless Dataset Creation: Anyone can start a Perpetual dataset without needing permission.
Staking and Earning Yields: Anyone can stake their participation in a dataset of their choosing and earn yields.
Rewarded Contributions: Anyone can contribute to a dataset, either themselves or through their AI agents, and get rewarded for their contributions.
Data Licensing and Network Rewards: Anyone can buy the data license, and the rewards from these purchases flow back to the participants in the network.