From Golden Records to Large Language Models: The Quest to Compress Human Knowledge
Exploring the Frontiers of Knowledge Compression and the Quest to Encode Human Understanding
In 1977, as the Voyager 1 and 2 spacecraft prepared to embark on their journey, Carl Sagan and his team were faced with a daunting challenge: how to encapsulate the essence of human knowledge and creativity in a compact, durable format for any extraterrestrial civilization that might encounter the spacecraft in the distant future. Their answer was the Golden Record, a cosmic "message in a bottle", that aimed to compress the vast expanse of life on Earth into a golden phonograph record.
The contents of the Golden Record were carefully curated to represent the essence of human knowledge and creativity. It included spoken greetings in 55 languages, musical selections from various cultures, and a treasure trove of scientific information.
Interestingly, the quest to efficiently encode and compress information has been a fundamental pursuit of information technology since its inception. Just as the Golden Record sought to encapsulate the essence of human knowledge, modern compression techniques aim to reduce data size while retaining essential information. This article traces the evolution of compression from its theoretical foundations in the 1940s to the cutting-edge LLMs like GPT-3, which have seemingly compressed the vast content of the web into 175 billion parameters.
Claude Shannon's Information Revolution
In the early days of telecommunications, companies like Bell Labs grappled with the challenge of efficiently transmitting information over noisy telephone lines. The goal was to maximize the amount of data that could be sent while minimizing the impact of noise.
Enter Claude Shannon, a brilliant mathematician and engineer at Bell Labs in the 1940s. Shannon approached these problems from a theoretical standpoint, seeking to understand the fundamental properties of information and its relationship to probability and uncertainty. He believed that by developing a mathematical framework for communication, he could pave the way for more efficient transmission systems.
At the core of Shannon's work were two key concepts: entropy and compression. Entropy, in this context, refers to the measure of uncertainty or randomness in a message. The more unpredictable a message is, the higher its entropy, and the more information it contains. Conversely, the more predictable a message is, the more it can be compressed without losing information.
To illustrate this idea, consider the following two sequences:
"ABCABCABCABCABCABC"
"QWERTYUIOPASDFGHJK"
The first sequence is highly predictable, repeating the pattern "ABC." As a result, it has low entropy and can be easily compressed by replacing the repeated instances with a shorthand notation, such as "6(ABC)." On the other hand, the second sequence is much more random and unpredictable, containing no pattern. This sequence has high entropy and is more difficult to compress without losing information.
Shannon realized that by calculating the entropy of a message, one could determine the minimum number of bits needed to encode it without any loss of information. This insight laid the foundation for the development of information theory, providing a mathematical framework for the design of efficient communication systems.
The Evolution of Compression: From Lossless to Lossy
In the early days of personal computing, compression software like WinZip and PKZIP were essential tools for saving disk space and speeding up file transfers. These programs allowed users to compress large files into smaller, more manageable archives, which could then be easily shared or stored. Today, file compression is so ubiquitous that it's built directly into our operating systems and web browsers, with data being seamlessly zipped and unzipped as it's transmitted across the internet.
However, the compression techniques used by these programs, such as Shannon's entropy encoding, are examples of lossless compression. When a file is compressed using a lossless method, the original data can be perfectly reconstructed from the compressed version. This is crucial for preserving the integrity of text documents, executables, and other files where every bit is important.
But not all use cases require perfect fidelity. In many situations, it's acceptable to lose some data in exchange for greater compression ratios. This is where lossy compression comes into play. Lossy compression algorithms, such as those used for audio and video, discard some of the original data in order to achieve smaller file sizes.
Take a Zoom video call, for example. During a call, the video feed is constantly being compressed to reduce the amount of data that needs to be transmitted over the internet. This compression (H.264/AVC codec) is lossy, meaning that some visual information is being discarded in the process. However, as long as the compression level is well-tuned, the loss in quality is hardly noticeable to the human eye. The result is a video stream that maintains a high level of visual fidelity while consuming far less bandwidth than an uncompressed feed.
Meaning you can lose data and still retain information.
The Compression-Learning Connection: David MacKay's Vision for Machine Learning
But what do compression and information theory have to do with Machine Learning and Language models? According to David MacKay, a renowned British physicist and mathematician, they are "two sides of the same coin." MacKay was a polymath who made significant contributions to information theory and machine learning, emphasizing the deep connections between the principles of information processing and learning algorithms.
Why unify information theory and machine learning? Because they are two sides of the same coin. In the 1960s, a single field, cybernetics, was populated by information theorists, computer scientists, and neuroscientists, all studying common problems. Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.
David McKay, Information Theory, Inference, and Learning Algorithms,1995
MacKay's vision was rooted in the recognition that both information theory and machine learning are fundamentally concerned with extracting meaningful signals from noisy data. In information theory, this is the "noisy channel" problem, which aims to accurately decode a message corrupted by errors or distortions during transmission. In machine learning, the challenge is to discern the underlying patterns or structures in real-world data, often cluttered with irrelevant or misleading information.
MacKay observed that the strategies for overcoming these challenges are remarkably similar. In both cases, the key is to develop methods for quantifying and managing uncertainty and maximizing the data's signal-to-noise ratio.
One of MacKay's key insights was that learning itself can be viewed as a form of information processing. When a machine learning algorithm is trained on a dataset, it essentially tries to compress that data into a more compact and generalizable representation. This is not unlike the process of lossy compression that we discussed in the previous section, where some information is discarded to achieve a more efficient encoding.
As we'll see in the next section, these ideas are now being pushed to new frontiers with the development of large language models, which are in essence a powerful new tool for compressing and encoding the vast amounts of information contained in human language.
Language Models: Compressing Meaning, Not Just Data
So far, we've looked at compression techniques that deal with the representation of information at a fundamental level. When we compress a poem stored as a text file, we're essentially just rearranging the bits that represent the characters in the poem. The compressed version doesn't "understand" the poem any better than the original file - it's just a more compact representation of the same information.
In contrast, large language models (LLMs) like GPT-3 take a radically different approach to compression. Rather than focusing on the raw data itself, LLMs aim to compress the meaning and structure of data.
During pre-training, an LLM is fed a huge corpus of text. However, instead of storing this text verbatim, the model learns to predict the probability of the next word in a sequence based on the words that come before it. In doing so, the model builds up a representation of the patterns that characterize the source data.
You're basically zipping the world knowledge. It's not much more than that, actually.
Arthur Mensch, CEO of MistralAI, said on training a model.
The goal of an LLM is not to perfectly reconstruct its training data but rather to capture the underlying semantics in a way that allows it to generate new output.
Put another way, imagine trying to compress a book using a traditional compression algorithm. The compressed version might be a highly detailed map of the book's contents, allowing you to reconstruct the original text perfectly. On the other hand, an LLM is more like a knowledgeable scholar who has read and internalized the book. They might not be able to recite it verbatim, but they can engage in deep, insightful discussions about its themes and ideas.
This lossy compression means they can sometimes generate inconsistent, biased, or factually incorrect outputs - a phenomenon known as "hallucination." However, labeling this as a mere flaw overlooks the true significance of what language models are achieving. These models capture knowledge in an unprecedented way, offering a glimpse into a new realm of possibilities.
Not Just a Message, But a Messenger
Consider the Voyager Golden Record, the inspiration behind this article. The record contains a static collection of sounds, images, and music, carefully curated to represent the essence of Earth and its inhabitants. Now, imagine if instead of sending this data in its raw form, we had trained a language model on a vast corpus of human knowledge and sent the weights of that model into space. This hypothetical "Galactic Language Model" (GLM) would not merely be a repository of information but a dynamic, interactive system capable of engaging with and responding to queries about the contents of its training data.
The GLM would be more than just a message; it would be a messenger, an ambassador of Earth's collective knowledge. It would provide context, interpretation, and even generate original insights based on the patterns and relationships it has learned. Such a model would represent a new form of knowledge compression and transmission, one that transcends the boundaries of traditional data storage and communication. I don't know what to call it but Language model doesn't quite capture it.