The AI Visibility Imperative: Why Your Data Belongs in LLMs

Forget hiding from AI – learn how to make it work for your brand in the age of LLMs.

Oct 31, 2024

photo of library with turned on lights — Photo by 🇸🇮 Janko Ferlič on Unsplash

The conventional wisdom about AI and data is simple: keep your information out of it. But what if that's completely backwards?

In a future where the world turns to AI for answers, how do we make sure these artificial minds know about us and our brands? The smartest move might not be hiding from AI, but ensuring your digital footprint is unmistakably stamped in its training data.

In this article, we're flipping the script. We'll explore why getting your data into Large Language Models (LLMs) could be important for your future digital visibility. We'll look at why it matters and what we know so far about making it happen.

The Digital Visibility Imperative: From Search Engines to LLMs

"The best place to hide a dead body is page 2 of Google search results." This oft-quoted maxim, underscores a fundamental truth of our age: visibility is everything.

Consider this: each day, over 8.5 billion searches are conducted on Google. With more than 90% of the search volume (and 94% on mobile devices), Google has become the de facto gatekeeper of online information. If you're not on Google, you might as well not exist in the digital realm.

But even being on Google isn't enough. Landing on the first page of search results is crucial, as less than 1% of searchers venture beyond the initial results. This digital real estate is so valuable that entire industries have sprung up around the art and science of search engine optimization (SEO).

However, a shift is taking place in how we look for answers. No one asks a question and wants a list of links. We want answers, not homework. Large Language Models (LLMs) give us answers to our questions directly, and this is beginning to change our behavior.

Let's consider a practical example:

Imagine you're the mayor of a city in Texas, aiming to boost your city's tourism industry. With several new wedding venues opening up, you want to attract more destination weddings. Your first step would be to get your city on the list of contenders for destination weddings. Your ideal customer is likely searching for something like "What are the best destination wedding cities in Texas?"

A traditional Google search for this query yields a results page filled with articles, forums, and travel sites where users can look for answers. Your strategy for inclusion in these results might look something like this:

Paid: Bid for those keywords and show your ad when someone is searching for this information.
Organic: Create content on planning destination weddings and work to get it to rank on Google.
PR: Pitch bloggers, creators, influencers, and magazines who write about this topic to include your city in their lists.

But the world is moving in a different direction. Ask the same question on ChatGPT, and you'll get a concise, direct answer listing several Texas cities with brief descriptions of their appeal for destination weddings. Currently there is no way to bid for placement in the AI's answer. But given the monetization potential this could change in the future.

This shift is nascent but growing. OpenAI reports that ChatGPT receives 200 million weekly users. Moreover, search experiences are evolving. Google is rolling out AI Overviews, which provide AI-generated summaries at the top of search results for certain queries. New AI-based search engines like Perplexity are also emerging, offering more direct, conversational interactions with information. Meta is working on an AI powered search engine as well.

So how do you make the new list? This question is becoming important as people rely on LLMs to research. In the next section, we'll explore how LLMs and search engines acquire their data.

The Data Diet: How LLMs Differ from Search Engines

First let's look at how Large Language Models (LLMs) differ from traditional search engines in collecting, processing, and answering questions.

Traditional Search: The Google Approach

Even before you click the search button, Google's crawlers have been tirelessly scouring the internet, discovering content. Google's index now exceeds 100 million gigabytes of data, constantly refreshed to ensure you access up-to-date information when you search.

At its core, a search engine like Google is a lookup algorithm. When you enter a query, it doesn't generate an answer from scratch. Instead, it scans its vast, pre-compiled index to find the most relevant content. This process is mostly deterministic, meaning that for a given query at a specific point in time, the algorithm will consistently return the same results (barring any index updates).

Think of a search engine as a highly efficient librarian. When you ask a question, this librarian doesn't write a new book for you. Instead, they quickly scan their catalog of all the books in the library and point you to the best available resources to answer your query. They might consider factors like the book's relevance, its reputation, how often it's been borrowed, and how well it matches your specific needs.

The search engine's ranking algorithm mimics this librarian's thought process, considering numerous factors to determine which pages best answer your query. These include content relevance, website authority, user behavior patterns, page loading speed, mobile-friendliness, and backlink profile.

If you want your article to rank higher for a particular search, you need to ensure it comprehensively answers the question or solves the problem that the searcher is likely trying to address. It's like writing a book that the librarian will confidently recommend.

LLMs: A Different Breed of Information Retrieval

LLMs, on the other hand, operate on a different principle. They don't look things up in an index or keep track of the underlying sources. They're more like a well-read friend who's not shy about guessing an answer based on their accumulated knowledge.

To train an LLM, model builders have it "read" vast amounts of data. They start with pre-existing datasets like Common Crawl, which essentially contains a significant portion of web content. The model undergoes an initial training phase where it learns patterns, relationships, and structures within this data. After pre-training, the model is refined through techniques like Reinforcement Learning from Human Feedback (RLHF), where human trainers provide feedback on the model's responses, helping to align its outputs with human preferences and values.

Unlike search engines that return an explainable set of results, LLM responses are probabilistic. They don't look something up in a well-maintained database. Their training is stored in their neural network, and their responses are based on predicting the likely text that follows the user's question.

This means you might receive different answers to the same question based on various factors, some of which are not fully understood even by the model's creators. The inner workings of LLMs, much like the human brain, are still being studied. Efforts to create interpretable AI that can explain its decision-making process are ongoing, but we're far from full transparency.

LLMs are next-token predictors, heavily influenced by the order and format of their training data. This characteristic leads to phenomena like the "reversal curse." For instance, an LLM might correctly answer the question "Who is Tom Cruise's mother?" but struggle to answer "Who is Mary Lee's son?"

Why? Because, but because its training data likely contained more instances of Mary being mentioned as Tom Cruise's mother than Tom Cruise being mentioned as Mary Lee's son. In other words, order can matter.

This difference in how LLMs and search engines process and present information has implications for those seeking to appear in AI-generated answers.

Getting on the AI's Radar: Strategies for Inclusion in LLM Training Data

Now that we understand how LLMs differ from traditional search engines, let's explore how you can increase your chances of being included in their training data.

The Technical Basics

First, let's address some fundamental technical considerations. If you have a website, review your robots.txt file. Ensure you're not inadvertently blocking AI crawlers. Here's a list of some prominent AI crawlers you should consider allowing:

If you're using a Web Application Firewall like Cloudflare, ensure that you're not unintentionally blocking AI crawlers. WAFs often have bot detection features that might categorize these crawlers as potentially malicious. You may need to create custom rules or adjust your security settings to allow these bots through.

Next, examine your website's terms of service. Are you explicitly blocking the use of your site by robots and spiders? Consider updating your terms to allow for the use of your data in machine learning applications. Additionally, structuring your site's content with JSON-LD markup can make it more easily interpretable by both search engines and AI systems.

Expanding Your Digital Footprint

Due to ongoing copyright lawsuits and other factors, model builders have become increasingly secretive about their training data sources. However, we can gain insights from open datasets like RedPajama, which contains over 30 trillion tokens used for training language models.

RedPajama processes web content from 84 CommonCrawl snapshots using the CCNet pipeline, resulting in over 100 billion text documents. What's particularly interesting is that 30 billion of these documents come with quality signals, and 20 billion are deduplicated to ensure data quality.

This massive dataset represents the kind of content that LLMs might encounter during training. While the exact composition of proprietary models' training data remains unknown, understanding RedPajama's scope gives us insight into what type of content tends to make it into training datasets.

An older dataset called the Pile contained the following sources:

The Pile includes diverse data types, from Wikipedia articles to scientific papers and code repositories. While it's challenging to directly influence inclusion in such datasets, understanding their composition can inform your content strategy.

Unfortunately, checking if your specific content is included in Common Crawl or The Pile isn't straightforward. These datasets are massive, and there's no public index available. However, if you're keen to investigate your brand's presence in these datasets, send me an email.

The Importance of Context

Remember, LLMs are essentially next-token predictors, heavily influenced by the order and format of their training data. As seen above with the reversal cure this means you'll want your brand in the right context in the data read by LLMs.

When your brand is mentioned online what question or context does it appear? Look at platforms like Reddit or Quora. Are people mentioning your brand in responses to relevant questions? For example, if you're promoting Fredericksburg as a wedding destination, is it coming up in discussions about "great places to get married in Texas?"

Monitoring Your Brand's Presence in LLMs

As the models grown and proliferate it's helpful to track how your brand shows up in answers within these models. I posed the same question to a few different LLMs and here is what I found:

Here is the latest version of Anthropic's Claude 3.5 sonnet

Here is Open AI's GPT-4o:

And here is GPT-3.5 Turbo

and here is Mistral 7B

Marfa made the list of OpenAI 4o but not in Claude 3.5, GPT3, or Mistral.

Ensuring Your Place on AI's Bookshelf

As we've explored throughout this article, the emergence of Large Language Models represents a shift in how questions will be asked and answered online. While traditional search engines function like meticulous librarians, pointing users to specific sources, LLMs operate more like well-read scholars who synthesize knowledge to generate responses.

Understanding this distinction is crucial. Just as a library's collection shapes the knowledge of its readers, the training data that feeds these AI models influences their responses. By seeing that LLMs are essentially reading from a vast digital library during their training phase, we can better strategize how to ensure the right 'books' make it onto those shelves.

BoxCars AI

Discussion about this post