The Three Laws Driving the AI Revolution: From Moore to Mosaic

Moore's Law set the stage for the semiconductor revolution; now Kaplan, Chinchilla, and Mosaic Laws are shaping the AI boom, but will they hold?

Sep 19, 2024

brown wooden framed glass window — Photo by Laura Ockel on Unsplash

In 1965, a young engineer named Gordon Moore penned a paper that would become the cornerstone of the digital revolution. Titled "Cramming More Components onto Integrated Circuits," this seemingly modest publication in the trade journal Electronics would go on to shape the trajectory of the semiconductor industry for nearly six decades. Moore's prediction, later dubbed "Moore's Law," kickstarted a technological arms race that has brought us from room-sized computers to pocket-sized supercomputers. Now, as we stand on the cusp of another technological revolution - the era of generative AI - three new laws are emerging as the guiding principles of this brave new world. Let's explore how these laws are shaping the future of AI, from the labs of OpenAI to the boardrooms of tech giants.

The Foundation: Moore's Law and the Semiconductor Revolution

Gordon Moore, co-founder and former CEO of Intel, made a bold prediction in his 1965 paper: the number of transistors on a microchip would double every year for the next decade. A decade later, in 1975, Moore revised his forecast to a doubling every two years. This simple observation became the driving force behind the explosive growth of the semiconductor industry.

A semi-log plot of transistor counts for microprocessors against dates of introduction, nearly doubling every two years

Moore's Law wasn't just a prediction; it became a self-fulfilling prophecy. Companies raced to keep up with this exponential growth, leading to rapid advancements in computing power, miniaturization, and cost reduction. This relentless pursuit of improvement gave us personal computers, smartphones, and the internet as we know it today.

As we enter the age of AI, three new laws are emerging that could have a similar impact on the development of large language models (LLMs) and generative AI. While it's too early to say if these laws will hold true in the long term, they provide a lens through which we can view the current AI boom.

Kaplan Scaling Laws: The Blueprint for Bigger, Better Models

In 2020, researchers at OpenAI were grappling with a fundamental question: How would a language model behave as it got "larger"? Their quest for answers led to the publication of a seminal paper, "Scaling Laws for Neural Language Models," which has since become known as the Kaplan Scaling Laws, named after lead author Jared Kaplan.

The Kaplan Scaling Laws identified three key variables that determine the performance of a language model:

Number of parameters in the machine learning model
Size of the training dataset
Compute required for the final training run (training compute)

In essence, the Kaplan Scaling Laws suggested that for every parameter in a model, you need about 1.7 text tokens in your training data. This insight provided a roadmap for creating increasingly powerful language models. For example, it indicated that a model with 175 billion parameters (like GPT-3) would require around 300 billion tokens of training data.

These laws helped explain why larger models like GPT-3 were so much more capable than their predecessors. They also set the stage for the next generation of even more massive models.

Chinchilla or Hoffman Scaling Laws: Optimizing the Balance

While the Kaplan Scaling Laws suggested that bigger was always better, researchers at DeepMind uncovered a more nuanced picture in 2022. Their paper, "Training Compute-Optimal Large Language Models," became known as the Chinchilla paper (named after the model it introduced) or the Hoffman Scaling Laws (after lead author Jordan Hoffman).

The Chinchilla paper revealed that many existing models were actually "undertrained" - they had too many parameters relative to the amount of training data used. The researchers found that a smaller model trained on more data could outperform a larger model trained on less data, even when using the same amount of compute.

This insight led to a reformulation of the optimal ratio between model size and training data. The Chinchilla laws suggested that for optimal performance, you should use about 20 tokens of training data per parameter - a significant increase from the 1.7 suggested by the Kaplan laws.

The implications of these findings are profound. Building ever-larger models is extremely expensive - estimates suggest that training GPT-3 cost around $10 million, while GPT-4 may have cost $100 million or more. Future "frontier" models like GPT-5 could cost billions to train. This escalating cost seemed to imply that only the largest tech companies would be able to compete in the race for more advanced AI.

Mosaic's Law: The Democratization of AI

Just when it seemed that AI development might become the exclusive domain of tech giants, a third law emerged to challenge this narrative. Naveen Rao, former CEO of MosaicML (now VP of GenAI at Databricks), observed a trend that he called "Mosaic's Law":

"A model of a certain capability will require 1/4 the [money] every year from [technological] advances. This means something that is $100m today goes to $25m next year goes to $6m in 2 yrs goes to $1.5m in 3 yrs."

This observation is supported by research published in the paper "Algorithmic progress in language models," which found that the compute required to reach a set performance threshold in language models has halved approximately every 8 months since 2012.

Mosaic's Law suggests that while the cost of cutting-edge AI models may be skyrocketing, the cost of producing models with a given level of capability is rapidly decreasing. This trend could democratize AI development, allowing smaller players and open-source projects to remain competitive.

The AI Landscape in Flux

As we navigate the rapidly evolving landscape of AI, these three laws - Kaplan, Chinchilla, and Mosaic - provide valuable guideposts. The Kaplan Scaling Laws showed us the power of scale in AI development. The Chinchilla Laws refined our understanding, emphasizing the importance of balancing model size with training data. And Mosaic's Law offers hope for a more diverse and competitive AI ecosystem.

Together, these laws paint a picture of an AI future that is both exciting and uncertain. While the race for bigger and more powerful models continues, rapid algorithmic improvements are simultaneously making AI more accessible. As with Moore's Law before them, these new principles are not immutable laws of nature, but rather observations of current trends that could shape the future of AI development.

BoxCars AI

Discussion about this post