The Fine-Tuned Funhouse: Navigating the Carnival of Custom AI Models
Step right up and discover the thrills and chills of selecting the perfect custom AI model from the dizzying array of fine-tuned language models on offer.
In the race to develop AI models that can revolutionize healthcare, Google fired a powerful salvo in March 2023 with the unveiling of Med-PaLM 2. This AI language model, born from over a year of intensive training by Google's Healthcare and Life Sciences team, blew past the competition with a staggering 86.5% score on the MedQA benchmark which approximates the grueling U.S. Medical Licensing Exam. Its predecessor, Med-PaLM, had already turned heads with a 67.6% score, but Med-PaLM 2 left it in the dust, sending waves through the AI medical community.
The announcement credited a massive team of around 97 contributors, hinting at the scale of Google's Med-PaLM endeavor. Rumor has it that the tech giant poured over $10 million into training the base PaLM model alone, with millions more invested in fine-tuning it for the medical domain.
But Google's reign at the top was short-lived. A few days ago, an underdog called OpenBioLLM-70B came out of nowhere to dethrone the Med-PaLM2 on Hugging Face's Medical LLM leaderboard. In a true David vs. Goliath story, this upstart model wasn't the product of a deep-pocketed tech titan but the work of a single prospective PhD student - Aaditya Ura.
Armed with Meta's newly released 70B Llama 3 model and a cache of domain-specific training data, Ura fine-tuned the open-source foundation into a Medical LLM. By standing on the shoulders of Meta's model instead of building from scratch, Ura claims to have produced an LLM comparable to Med-PaLM 2 without the investment put in by Google.
Note: Rankings and leaderboards aren't always truly reflective of real-world performance. I've written about the gaming of leaderboards before.
This pattern of fine-tuning models with domain-specific data to handle specialized use cases is becoming increasingly common as the number of high-quality open models grows. However, it raises an important question: When there are hundreds of potential models that could fit a particular use case, can any single model truly be considered a foundational model?
In this article, we'll explore this question, delving into the implications of the proliferation of fine-tuned models and what they could mean for the adoption of LLMs across various industries and applications.
The Rise of Open Models: A New "Best" Every Week
In the early history of Large Language Models (LLMs), it was assumed that only a few truly large language models could exist due to the immense computational resources and data required to build them. This assumption led many to believe that the development of these models would be limited to the largest companies and state actors. However, as with many aspects of the AI journey, this assumption hasn't held.
In just the first few months of 2024, we have witnessed a proliferation of advanced models available for anyone to host and run. It's important to clarify that most of these models are open-weights models, meaning that the publisher is making the model weights available but not necessarily the data and code used to produce the model. I'll explore this distinction between open-weights models and open-source models in a future article.
The chart below demonstrates that open models have not only caught up with but also exceeded GPT-3.5.
Starting with Google's Gemma on February 21, we've seen a new model released every few weeks, each one claiming to be the most advanced and capable. DataBricks followed suit with DBRX on March 27, and just a week later, on April 4, Cohere released Command R. The pace accelerated further with Mistral 8x22B on April 17, Meta's Llama 3 on April 18, and finally, Apple's OpenELM (Yes! Apple actually released the source code, data, and weights) on April 25. This rapid succession of releases, with each model touting impressive performance stats, has made it challenging to determine which model truly stands out as the best. However, it's clear that these open models are quickly catching up to their closed-source counterparts, and the competition among them is driving innovation at an unprecedented rate.
Data, Data Everywhere, and Lots of Models to Fine-tune
In a previous article, we discussed how foundational models can be considered as fresh college graduates that can be molded into professional employees with education and training. In this analogy, data plays the crucial role of curriculum in training our models to become highly skilled and specialized.
Fortunately, the landscape of accessible data has been rapidly expanding, fueled by both commercial efforts and open-source contributions.
Platforms like Hugging Face have become treasure troves of datasets, currently hosting an impressive 138,000 resources spanning various domains, from text and images to audio and video. The diversity of these datasets is remarkable, with offerings like the T5-recipe-generation dataset containing 2 million neatly organized recipes in JSON format, making it a goldmine for training models in the culinary domain.
On the commercial front, large IT companies are not shying away from using their checkbooks to gain access to private data. OpenAI has been negotiating deals with companies like Business Insider, while Apple is in talks with Conde Nast and. Adobe is paying $3 per minute for video clips to train its Generative AI models.
The open-source community has also been actively contributing to the data landscape. One notable effort is the Red Pajama dataset by together.ai, which is a highly cleansed version of the Common Crawl dataset. Common Crawl is a massive collection of web page data, sourced from monthly web crawls and made available for public use. By refining this data, together.ai has created a valuable resource for model training.
But what if you don't have enough data for your specific use case? Turns out, creating synthetic data to train models is becoming increasingly easy. The Alpaca model was the first to demonstrate this approach by using GPT to generate a synthetic dataset. WizardLM took this a step further with their Wizard EVOL mechanism, which generates high-quality training data. Companies like Gretel are also providing tools to generate large volumes of synthetic data, making it more accessible than ever before. In building Llama3, Meta used Llama 2 to generate the data to train it.
Fine-tuned Frenzy: Separating the AI Wheat from the Chaff
As the tooling for data extraction and synthetic data generation continues to improve, we find ourselves at the precipice of an explosion in fine-tuned, mission-specific models. Take, for instance, the Mental Health Counseling Conversations dataset on Hugging Face, comprising 3,500 Q&A responses scraped from various mental health sites and forums. It has spawned over 17 fine-tuned models built on quality open models like Llama, Mistral, and Falcon. The creator of this dataset went on to create 'Connor,' a mental health assistant focused on answering user questions based on responses from a psychologist.
But what comes of having so many different 'Connors' out there? Doesn't this create an issue when you have a variety of models with varying approaches and training data sources? How do you decide which Connor to choose?
Well, honestly, this is a challenge we already face in the world of human professionals. As the joke goes,
Question: "What do you call a medical student who graduates at the bottom of their class?"
Answer: "Doctor."
We have more therapists, doctors, and experts than we can choose between, and yet, we navigate this landscape through signals of trust: reputation, credentialing, reviews, and referrals. In the realm of fine-tuned models, we'll likely follow a similar path. Would you opt for a generic market research LLM, or would you pay to use the specialized LLM from a reputable firm like CB Insights?
I think It'll come down as it mostly does to this question by Alex Rampell, A16Z partner, "Will the startup get distribution before the incumbent gets innovation?"
PS: Shortly after I wrote this article, Google announced that their newly optimized medical LLM, Med-Gemini, achieved a 91.1% accuracy on the MedQA test, surpassing Ura’s model which led the leaderboard for just under a week.
Upcoming Talks, Panels, & Workshops
May 8th: How AI Is Changing The Marketing Game: Panel at Forrester’s B2B NA Summit.
May 14 & May 21: Generative AI Workshop presented with Eminence Strategies and McGinnis Lochridge.