When Machines Get Hungry: The Race to Feed AI's Growing Data Appetite
Explore the critical challenges and solutions in AI's quest for quality data as we approach a potential scarcity crisis.
In the quest to build ever-more-powerful AI, we're facing a bottleneck: we're running out of data. Even though we produce an estimated 402.74 million terabytes of data a day, by 2032, we may have exhausted the world's supply of human-generated public text.
The key to their LLM's capabilities lies in a key discovery: scale matters. As we increase the size of these models and feed them increasingly vast amounts of data, their performance improves dramatically, often in ways that surpass our wildest expectations.
This "scaling hypothesis" has become the cornerstone of LLM development strategy. Companies like OpenAI, Google, and Anthropic have been locked in an arms race, each striving to create larger models trained on more extensive datasets.
Epoch AI, a research organization focused on the long-term impacts of artificial intelligence, estimates that the total stock of human-generated public text – the primary fuel for these language models – amounts to around 300 trillion tokens. If current trends in AI development continue, Epoch AI predicts that language models will fully utilize this stock of text between 2026 and 2032.
In this article, we'll explore the sources of data used to train today's most advanced models and examine the issues surrounding data acquisition. From web scraping and social media mining to synthetic data generation and licensing agreements, we'll investigate the varied strategies companies are employing to feed their data-hungry AIs.
"Publicly Available" Doesn't Mean "Public Domain"
In the world of AI data, the line between what's publicly accessible and what's legally usable is blurry. This distinction came into focus on March 13, 2024, when OpenAI's CTO Mira Murati sat down for an interview with the Wall Street Journal's Joanna Stern.
The conversation centered around OpenAI's latest breakthrough, Sora, a text-to-video model that's been making waves in the tech community. When pressed about the data sources used to train Sora, Murati's response was notably vague: "We used publicly available data and licensed data."
Stern, not satisfied with this broad statement, probed further. She asked specifically about the use of data from popular social media platforms like YouTube, Instagram, or Facebook. Murati's reply was even more ambiguous:
"I'm actually not sure about that," she said, before adding, "You know, if they were publicly available — publicly available to use. But I'm not sure. I'm not confident about it."
This exchange highlights a critical issue in AI development: the misconception that "publicly available" is synonymous with "public domain." It's a distinction that many in the tech industry seem to be grappling with, or perhaps, conveniently overlooking.
The term "publicly available" is a clever framing that skirts around the legality of data usage. While content may be accessible on the internet, it doesn't automatically grant permission for its use in training AI models. Copyright laws, terms of service agreements, and individual content creator rights all play a role in determining what can and cannot be used for AI training.
For instance, a viral video on YouTube might be viewable by anyone with an internet connection, making it "publicly available." However, the video's creator still holds the copyright, and YouTube's terms of service explicitly prohibit unauthorized downloading or use of content. Using such content for AI training without permission could potentially infringe on these rights.
The Social Media Loophole
In the universe of AI training data, social media platforms have emerged as a gold mine. However, this treasure trove comes with a catch: it's riddled with copyrighted material.
Social media sites are awash with content shared by millions of users daily. From personal photos and videos to snippets of copyrighted music and movie quotes, these platforms host a vast array of material - much of which is protected by copyright laws.
Section 230 of the Communications Decency Act provides a shield for social media platforms, protecting them from liability for content posted by their users. However, this protection comes with a caveat: platforms must remove copyrighted material when notified by the rightful owners.
This requirement has spawned a relentless cat-and-mouse game between copyright holders and social media companies. Content is uploaded, flagged, removed, and re-uploaded in an endless cycle. The sheer volume of content makes comprehensive enforcement nearly impossible, with only the most vigilant and well-resourced companies able to consistently protect their intellectual property in this digital Wild West.
It's within this chaotic ecosystem that AI companies have found a loophole. By scraping data from social media platforms, they can claim to be using "publicly available" information.
In reality, their data collection nets are likely pulling up significant amounts of copyrighted material shared without the original creator's permission. A viral video clip, a popular meme image, or even a catchy phrase could all be copyrighted content that ends up in an AI training dataset.
Data Laundering
In the murky waters of AI data acquisition, "data laundering" has emerged as another strategy. This technique allows AI companies to maintain a veneer of ethical data practices while still benefiting from vast amounts of potentially copyrighted material.
Intersecting AI, defines data laundering as a process where:
"An entity (like a university) gets an exemption (either under a law or in an agreement with the data provider) or otherwise collects data for a nonprofit and/or research use, but then that same entity licenses the data in such a way that others (particularly for-profits) can use it, so the end result is the same as if the for-profit had collected the data itself. In a way, the entity is just doing free labor and assuming some risk for the for-profit."
This practice allows AI companies to keep their hands clean while still accessing valuable, often copyrighted, data. A prime example is the use of YouTube content for AI training.
YouTube is a treasure trove of diverse content, including educational videos from institutions like Khan Academy, MIT, and Harvard, news clips from respected outlets like The Wall Street Journal, NPR, and the BBC, and popular entertainment snippets from shows like The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live. All of this content is protected by copyright and, on the surface, off-limits for AI training purposes.
Enter EleutherAI, a grassroots, not-for-profit research group. They created a dataset called YouTube Subtitles by scraping this content, which was then included in their larger "The Pile" dataset. This dataset has been used by a who's who of AI model companies, including tech giants like Apple, Nvidia, Salesforce, Bloomberg, Databricks, and Anthropic, to train their models.
When questioned about the use of potentially copyrighted data from The Pile, Anthropic's response was telling:
"YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to the Pile authors."
This statement exemplifies the data laundering strategy in action. AI companies can use these collected datasets while deflecting responsibility for the data collection process to the research groups that assembled them.
What makes this situation even more complex is that some of these research groups receive financial support from the very AI companies that later use their datasets.
Licensing Content
As AI development becomes big business, leading companies are pivoting towards clean, licensed data for model training. This shift marks a significant change in how the industry approaches data acquisition.
OpenAI, one of the frontrunners in the AI race, has taken a proactive stance by establishing a preferred publisher program. This initiative involves licensing content from reputable sources such as the Associated Press, Le Monde, Axel Springer, and the Financial Times, among others.
Adobe is taking a similar approach in the realm of video AI. Reports suggest that the company is purchasing individual video clips for $3 each to train its video models.
In their S-1 filing for their initial public offering, Reddit disclosed a significant data licensing deal:
"In January 2024, we entered into certain data licensing arrangements with an aggregate contract value of $203.0 million and terms ranging from two to three years. We expect a minimum of $66.4 million of revenue to be recognized during the year ending December 31, 2024 and the remaining thereafter."
Synthetic Data: The AI Ouroboros
As the hunt for training data intensifies, the AI industry is turning to a new solution: using AI to create data for AI. This approach, known as synthetic data generation, is gaining traction as a potential answer to the looming data shortage.
Stanford's Alpaca project exemplifies this trend. In 2023, researchers used OpenAI's language model to generate 52,000 training instructions for a mere $500. This cost-effective approach demonstrated the potential of AI-generated datasets as a viable alternative to traditional data collection methods.
Similarly, the Vicuna model was fine-tuned using ChatGPT conversations posted on ShareGPT.com, showcasing how publicly shared AI interactions can be repurposed for further model development. These innovative approaches are just the tip of the iceberg.
WizardLM, for instance, employs a novel technique that uses an initial set of human instructions to create a large corpus of complex training data via an existing language model. In June 2024, Tencent's AI lab proposed a method that leverages persona perspectives to create diverse datasets, potentially addressing issues of bias and limited viewpoints in AI training data.
Microsoft's Phi-2 model has garnered attention for its impressive performance despite its relatively small size of 2.7 billion parameters. The secret to its success? A heavy reliance on high quality synthetic data in its training process.
However, the use of synthetic data is not without its pitfalls. As we covered in a previous post, some researchers have identified a phenomenon dubbed "Model Autophagy Disorder." This condition, where models trained on synthetic data show reduced efficacy, raises important questions about the long-term viability and potential limitations of this approach.
The use of synthetic data represents a fascinating turn in the evolution of AI - a kind of digital ouroboros, with AI feeding on its own outputs to grow and improve. While this approach offers a potential solution to the data scarcity problem, it also introduces new challenges
Conclusion: The Data Frontier
As we stand at the cusp of AI's next leap, the battleground is shifting. The race isn't just about bigger models or faster chips—it's about data. Quality, quantity, and the ingenuity to acquire it ethically and efficiently.
We face a data dilemma: legal minefields, ethical quandaries, and the looming threat of scarcity. Yet, in these challenges lie opportunities for innovation. From creative licensing to synthetic data breakthroughs, the AI community is adapting rapidly.
If you’re interested in this topic, I encourage subscribing to Intersecting AI as they cover it. Many of the points in this article come from posts they’ve written.