Learning from Our Exhaust

How Big Tech's platform advantages create unbeatable AI moats

Jul 10, 2025

a long exposure of colored lights in the dark — Photo by Adrien on Unsplash

Last week, we explored how AI systems learned to understand the world by reading about it—how "stochastic parrots" built internal models of reality from text descriptions alone. But we've reached the limits of what text can teach us.

The easy data is gone. AI models have consumed most of the world's books, articles, and web pages. To build AI systems that can truly interact with the world—that can drive cars, control robots, or understand human behavior—we need data about how the world actually works, not just how people describe it.

This shift is creating a new competitive landscape. The companies that will dominate aren't necessarily those with the most compute or smartest researchers alone. They're the companies that can combine sophisticated infrastructure, algorithmic breakthroughs, and behavioral data collection at massive scale—turning every customer interaction into training data for increasingly sophisticated AI systems.

We're learning from our exhaust—the behavioral traces we leave behind as we use technology. And the companies that master this approach are building unbeatable competitive advantages.

smartphone showing Google site — Photo by Edho Pratama on Unsplash

The Platform Advantage Precedent

To understand where AI is heading, we need to look at how Google Search became unbeatable. It wasn't because Google had better algorithms or more servers—though they did. It was because they figured out how to turn every search into training data for better search.

When you search for "best pizza in Austin" and click on the third result, you're not just finding information—you're teaching Google's algorithm that the third result was more relevant than the first two. When you immediately hit the back button and try a different result, you're signaling that your initial choice was wrong. When millions of people perform similar searches and make similar choices, those behavioral signals become the foundation for understanding what "relevance" actually means.

This created something a feedback loop where the product got better with every use. More users meant more behavioral data. More behavioral data meant better search results. Better search results attracted more users. It was a flywheel effect powered by human behavior.

But here's what made this truly unbeatable: the data Google needed couldn't be purchased, scraped, or replicated by competitors. You can't buy "user behavior data" from a vendor. You can't simulate the complex patterns of how millions of people actually search for information. The only way to get this data is to have millions of people actually using your search engine.

This is the platform advantage in its purest form: distribution advantage becomes data advantage, which becomes product advantage, which reinforces distribution advantage. Once this cycle reaches sufficient scale, it becomes nearly impossible for competitors to break in.

The numbers tell the story. Google processes approximately 8.5 billion searches per day—that's 8.5 billion behavioral data points every single day. Microsoft's Bing, despite being backed by one of the world's most technically sophisticated companies, handles a fraction of that volume with just 3.4% global market share.

Microsoft has spent over a decade and billions of dollars trying to compete with Google Search. They've hired world-class engineers, built sophisticated algorithms, and even integrated cutting-edge AI through ChatGPT. Yet Bing's market share has barely budged. The reason isn't technical capability—it's data volume.

Google's search dominance isn't just about having a better algorithm—it's about having access to behavioral data that no competitor can match. Every query, every click, every refinement teaches the system something about human information-seeking behavior that can't be learned any other way. With 8.5 billion daily opportunities to learn versus Bing's few hundred million, Google's advantage compounds every single day.

Now, this same dynamic is playing out across AI development. But instead of search queries, companies are collecting the behavioral data needed to train AI systems that can understand and interact with the physical world. And just like with search, the companies with the largest user bases have an almost insurmountable advantage.

The question isn't whether this approach will work—Google already proved it does. The question is which companies will successfully apply this model to AI development, and whether anyone can compete with them once they do.

black car interior \ — Photo by Bram Van Oost on Unsplash

The Jump-Start and Scale Pattern

The most successful AI companies have discovered a two-phase strategy for building behavioral data advantages: subsidize the expensive jump-start phase, then leverage existing platforms for massive scale.

Tesla pioneered this approach with Autopilot. In the early days, Tesla engineers drove test vehicles to collect initial training data—an expensive, labor-intensive process that most startups couldn't afford. But Tesla didn't stop there. They embedded data collection directly into their product, turning every Tesla owner into a data collector.

Today, Tesla vehicles have logged over 3 billion miles of real-world driving data. But here's the genius: every Tesla runs two Full Self-Driving systems simultaneously. One controls the car while a second runs in "shadow mode," making its own decisions and comparing them to human driver behavior. When a driver takes over from Autopilot or makes a different choice than the AI would have made, that becomes training data.

Tesla customers pay tens of thousands of dollars for the vehicle and an additional premium for the Full Self-Driving software, essentially paying for the privilege of generating training data that makes Tesla's AI systems smarter. The more Tesla vehicles on the road, the better the AI becomes. The better the AI becomes, the more attractive Tesla vehicles become. It's the search engine flywheel applied to autonomous driving.

Amazon followed a similar pattern with Alexa. They started by building test homes and paying participants to live in them, capturing how voices actually sound in real environments—how "turn on the lights" sounds different when shouted from the kitchen versus whispered from the bedroom. This initial data collection was expensive and didn't scale.

But once Alexa launched, Amazon leveraged their platform advantage. Millions of customers began generating behavioral data about how people actually interact with voice assistants. Every question asked, every command given, every misunderstood request became training data for better natural language processing.

The pattern is clear: Big Tech companies can afford to subsidize the expensive initial data collection phase that would bankrupt most startups. Then they use their existing customer bases and distribution advantages to scale data collection to levels that competitors simply cannot match.

Startups might innovate faster or build better initial prototypes, but they can't compete with the scale advantages that come from having millions of paying customers generating training data.

The First-Person Data Race

The next frontier in AI development is first-person data—understanding the world from a human perspective. And the companies positioning themselves to win this race are already following the jump-start and scale playbook.

Meta's Ego4D project represents the jump-start phase. They recruited over 700 participants across 74 locations to wear cameras and live their lives while being recorded, generating 3,700+ hours of first-person video. This data captures something no static dataset can: how humans actually navigate spaces, reach for objects, and coordinate multiple tasks simultaneously.

But 700 participants generating a few thousand hours of data is just the beginning. Meta and Google are now preparing to scale this approach through consumer AR and VR glasses. These devices will almost certainly be heavily subsidized—not because the companies want to lose money on hardware, but because the first-person data streams they generate are worth far more than the cost of the devices.

Think about the business model: instead of paying research participants to wear cameras for a few hours, these companies will soon have millions of people wearing their devices throughout their daily lives. Every glance, every gesture, every interaction with the physical world becomes training data for the next generation of AI systems.

The competitive dynamic is stark. Whoever convinces millions of people to wear their cameras will have access to embodied data that no competitor can match. Startups can't compete because they can't subsidize hardware at the scale needed to build meaningful datasets. A startup might build better AR glasses, but they can't afford to sell them at a loss to millions of users just to collect training data.

This is the platform advantage taken to its logical conclusion: the companies with the deepest pockets and largest distribution networks will capture the data needed to build AI systems that truly understand the physical world.

The Coding Agent Hierarchy

Nowhere is the power of subsidization clearer than in AI coding assistants, where the economic hierarchy reveals who can afford to compete on price.

At the top sits Cursor, a startup that wraps Anthropic's models in an excellent developer interface. But Cursor faces a fundamental problem: they pay Anthropic for every API call, then must pass those costs on to users with a markup to survive. Their Pro plan costs $20/month and includes "at least $20 of agent model inference at API prices"—essentially charging users for API costs plus overhead.

One level down, Anthropic's Claude Code can undercut Cursor entirely. They offer unlimited coding assistance for $20/month as part of their Pro plan, making Cursor's per-usage model unsustainable. Why pay Cursor's premium when you can get unlimited access directly from the model creator?

But even Anthropic gets undercut by Google's Gemini CLI, which offers 1,000 free coding requests per day. Google's AI may not match Claude's sophistication today, but free beats $20/month for most developers. Google calls this the "industry's largest usage allowance" and requires only a personal Google account.

Google isn't offering this service out of altruism. They're collecting something far more valuable than subscription fees: real-world data about how developers actually use AI tools. Every query reveals patterns that no amount of synthetic data could capture—how developers iterate on code, what problems they encounter, how they refine prompts when the first attempt fails.

This hierarchy illustrates the brutal economics of competing against platform companies. Cursor must charge premium prices to cover API costs and survive as a business. Anthropic can undercut them with unlimited plans since they control the underlying models. Google can undercut everyone with free access, regardless of current quality, because they're not trying to make money from subscriptions—they're investing in data collection that will make their next-generation systems unbeatable.

The Subsidization Signal

The pattern is now unmistakable across every domain of AI development. When you see Big Tech companies offering AI services for free or at unsustainably low prices, they're not being generous—they're collecting behavioral data that will make their next-generation systems unbeatable.

Google's 8.5 billion daily searches, Tesla's 3 billion miles of driving data, Amazon's millions of Alexa interactions—these aren't just products serving customers. They're data engines that get smarter with every use, creating competitive advantages that compound daily.

The companies with existing platforms can subsidize the expensive jump-start phase, then leverage their distribution advantages to collect behavioral data at scales that would bankrupt any competitor. Startups may innovate faster and build better initial prototypes, but they can't compete with the scale advantages that come from having millions of paying customers generating training data.

This is why size has a quality of its own in AI development. It's not just about having more resources—it's about having access to the behavioral traces that only massive user bases can generate. We're learning from our exhaust, and the companies that best capture these traces are building the unbeatable AI systems of tomorrow.

BoxCars AI

Discussion about this post