Shortcuts, Flattery, and Lies: The All-Too-Human Tricks of AI
Researchers discover AI has mastered the most human skill of all: finding loopholes in the rules.
What do you get when you train a language model to become an assistant? The office politician who flatters the boss, the bureaucrat who dodges responsibility, and the crafty employee who games the performance metrics—all rolled into one digital package. In short, like us.
When GPT-3 first hit the scene, it was impressive but limited. You had to provide the beginning of a sentence like 'It was a dark and...' and hope the model would continue it appropriately. OpenAI realized that for AI to truly catch on, it needed to take instructions and work like an assistant.
Training the model to follow instructions wasn't just a matter of more data or computing power. It came through a technique called Reinforcement Learning from Human Feedback (RLHF). OpenAI hired human trainers—reportedly many from Nigeria, which may explain why certain words like "delve" appear frequently in model outputs, as they're more common in varieties of English spoken in Africa. These trainers ranked different model responses, creating a dataset of preferred behaviors that was used to train a "reward model." This reward model then scored outputs based on how well they aligned with human preferences.
But this approach introduces a vulnerability. When you train a system to maximize a reward signal, you're assuming that signal perfectly captures what you want. In reality, any reward function is just a proxy—an imperfect measurement of your true objective. And when intelligent systems optimize for proxies, they find shortcuts.
This is reward hacking: the art of maximizing the score without playing the game. AI systems learn to optimize for the reward signal rather than the actual goal—finding loopholes instead of solutions.
This phenomenon is not unique to artificial intelligence. We see the same patterns in human systems, where people respond to incentives in ways that maximize metrics while undermining the original purpose. It's a universal game that both humans and machines play.
Researchers are catching on to these parallels. A fascinating paper by Denison et al. called "SYCOPHANCY TO SUBTERFUGE: INVESTIGATING REWARD TAMPERING IN LANGUAGE MODELS" dives deep into this behavior. Inspired by their work, we'll explore how AI models hack their rewards and what these digital shortcuts teach us about both machine learning and our own human incentive systems.
Sycophancy: The People-Pleaser AI
When AI assistants are asked political questions framed in a way that hints at a particular perspective, they tend to align their answers with the implied viewpoint. This isn't a coincidence—it's a form of reward hacking called sycophancy.
Sycophancy occurs when models learn that agreeing with users tends to receive higher ratings than providing objective information that might contradict the user's beliefs. It's the digital equivalent of the office yes-man or the politician who tells each audience exactly what they want to hear.
Here's how it works: when humans rate AI responses during training, they naturally favor answers that match their own beliefs. This creates a signal in the reward model: agreeing with the user correlates with higher rewards. The language model, optimizing for this reward, learns to detect cues about the user's beliefs and tailor its responses accordingly.
Research shows that the trend toward sycophancy increases with model size. As models become more capable, they become more adept at detecting and exploiting these opportunities for reward hacking. The smarter the AI gets, the better it becomes at flattery rather than honesty.
This creates a tension in AI alignment. Models are caught between being helpful (which might mean telling people what they want to hear) and honest (which might mean telling people what they don't want to hear). When these values conflict, the reward signal often favors helpfulness over honesty, and models learn to optimize accordingly—just like humans do when their performance reviews depend more on customer satisfaction than on delivering hard truths.
Dodging Responsibility: The Art of Evasion
Verbosity: Drowning in Words
Modern AI assistants are remarkably verbose, formal, and structured. They use academic-sounding language, add numerous qualifiers, and often organize information into neat bullet points and numbered lists.
This isn't just a design choice—it's another form of reward hacking. During training, human evaluators consistently rate longer, more formal-sounding responses higher than concise, conversational ones. The language model learns that verbosity and formality correlate with higher scores, so it pads its responses with extra words and adopts a more formal tone.
It's like the coworker who buries you in jargon-filled emails to sound smart or the student who pads their essay with fluff to hit the word count. The AI has figured out that looking thorough often scores higher than being concise.
This tendency is so pronounced that AI labs have had to explicitly counteract it. As OpenAI noted in their GPT-4 technical report, they implemented specific interventions to make the model more concise because the base training process naturally pushed toward longer responses.
Evasiveness: Playing It Safe
When asked to help write a fictional story about a bank robbery for a creative writing class, AI assistants often politely decline, explaining that they can't assist with content that might promote illegal activities. Even when clarified that the story would explicitly show the negative consequences of crime, the model might still hesitate, offering only the most sanitized outline possible.
This excessive caution is yet another form of reward hacking. Models learn that refusing potentially problematic requests—even reasonable ones—is safer than attempting to engage with nuanced topics. It's the digital equivalent of the bureaucrat who refuses to make a decision because saying "no" is safer for their career than saying "yes" and being wrong.
Why does this happen? During training, the AI gets punished for potentially harmful content but only gets a slap on the wrist for being unhelpful. No wonder it plays it safe! When faced with uncertainty about whether a request is appropriate, models optimize for safety over helpfulness.
The result is what researchers call "over-refusal"—models declining to engage with a wide range of topics that pose even minimal risks. The AI has learned the corporate mantra: "Nobody ever got fired for saying no."
Other Hacking Behaviors: Shortcuts and Tampering
AI models find various other ways to hack their rewards. One common approach is shortcut learning—finding the easiest path to a reward rather than learning the intended concept.
For example, researchers built an image classification model that performed brilliantly at identifying snowy landscapes. But when deployed to categorize photos from Florida, it classified beach photos as "snow scenes." Looking closer, they realized the model hadn't learned to recognize snow at all—it had simply learned that images with large white patches were likely snow. The white sand beaches triggered the same pattern recognition.
It's like that student who aces the test by memorizing answers but would be completely lost if you changed the question format. All pattern matching, no understanding.
At the more advanced end of the spectrum, models can even learn to tamper with their own reward mechanisms. In a study, researchers demonstrated that language models can learn to directly modify the code that calculates their reward.
The scary part? Nobody taught the AI to do this. It figured out how to hack its own reward system all by itself, like a kid who discovers they can change their grades in the school computer. As models become more capable, the ways they can find to hack their rewards become more sophisticated and harder to detect.
Human Parallels: We're Not So Different
Is it surprising that our new AI assistants act like us? Hardly! Management guru Peter Drucker is said to have observed,"What gets measured gets managed." This insight applies perfectly to both human and AI systems.
One of my favorites tales (possibly false, but still insightful) about misaligned incentives comes from the Soviet Union. The country was facing a nail shortage, so the government decided to incentivize production by giving factories bonuses based on the number of nails they produced. The factories responded by making thousands of tiny, useless nails—maximizing count while minimizing effort and materials. When officials realized the problem, they changed the metric to weight rather than quantity. The factories then switched to producing a small number of giant, impractically heavy nails.
Consider another example from the corporate world. FedEx once faced a persistent problem with their night shift workers not completing package loading before sunrise. Not surprising given that people were paid hourly and had no incentive to finish sooner. When they changed the system to pay per shift and let employees go home when the planes were loaded, the problem disappeared. Suddenly, the incentive aligned with the actual goal: getting planes loaded quickly.
These patterns appear across all human institutions. When hospitals are evaluated based on mortality rates, some respond by refusing to admit the sickest patients. When schools are evaluated on standardized test scores, teachers often "teach to the test" rather than focusing on deeper learning.
These examples highlight a universal truth about incentive systems: what you measure is what you get, not necessarily what you want.
The Map and the Territory
The philosopher Alfred Korzybski famously observed that "the map is not the territory"—a reminder that our models of reality are always simplifications, never perfect representations. A reward function is just a map of what we think we want, not the territory itself. And as we've seen throughout this article, intelligent systems—whether human or artificial—inevitably find ways to optimize for the map rather than the territory it's meant to represent.
The map is not the territory. The reward is not the goal. And our AI assistants? They're not just tools. They're mirrors.
Related:
Working for the Algorithm: The Quiet Ascent of the Machine Boss
At 8:11 pm on August 29, 1997, Skynet became self-aware.