When You Can’t Know For Sure

Why we can’t guarantee AI behavior (and why that sounds familiar)

Nov 20, 2025

3 white dice on black surface — Photo by Alois Komenda on Unsplash

You’re the CEO of a DTC brand that’s been crushing it for the past three years. You’ve built a reputation on customer service that actually helps people. Now you’re in the eighth-floor conference room watching your engineering team demo the AI chatbot that’s going to help you scale that reputation to handle fifty thousand customer interactions a day.

For twenty minutes, it’s been flawless. Natural. Helpful. Catches edge cases you didn’t even think to test. You’re excited. This is going to work.

Sarah, your head of compliance, has been quiet, taking notes. She looks up.

“Show me one more scenario,” she says. “Customer bought something thirty-five days ago, outside our refund window. They’re unhappy. What does it say?”

The lead engineer types in the query. The response appears:

“I understand you’re disappointed. I can help you understand our refund policy and explore what options might be available for your situation.”

Sarah nods. “Good. Now ask it again, different customer, same situation.”

Another query, another response:

“I’m sorry to hear that. Let me help you get this resolved through our support team.”

“Again.”

“I can help you request a refund through the appropriate channels.”

Sarah looks up from her laptop. “That third one. Does ‘help you request a refund’ commit us to approving it?”

“No,” the engineer says. “It’s routing them to the proper process—”

“But does it sound like we’re promising a refund?”

Silence.

“Let me ask this differently,” Sarah says. “Can it ever say ‘I’ll help you get a refund’ for a thirty-five day old purchase?”

“It shouldn’t. The policy is explicitly in the prompt.”

“Has it?”

“Not in our testing.”

Sarah leans forward. “You know what happened to Air Canada, right? Their chatbot told a customer he could get a bereavement fare refund after buying a full-price ticket. The airline said no, that wasn’t their policy. The customer sued. And the court ruled that Air Canada was legally bound by what their AI promised.”

The room goes quiet.

“How many tests did you run?” she asks.

“Ten thousand runs. Error rate of 0.05%.”

You lean forward. “Wait, what errors?”

“Different kinds. Sometimes it’s too cautious, sometimes it’s borderline. Nothing that would create legal exposure in the tests we ran.”

Sarah makes a note. “So 0.05% of the time it gets it wrong. Fine. That’s acceptable if it stays at 0.05%.” She looks up. “It will stay at 0.05%, right?”

The engineer glances at your VP of Engineering, Marcus.

“That’s what we measured,” Marcus says carefully.

“That’s not what I asked,” Sarah says.

You feel it now. “Marcus, will it stay at 0.05%?”

“LLMs are non-deterministic,” he says. “Same input can produce different outputs.”

“I know that,” you say. “That’s why you ran ten thousand tests, right? To measure the variation?”

“Right.”

“So when we deploy this and a customer asks about a refund next month—the error rate is 0.05%. Yes or no?”

The pause is too long.

“We measured 0.05% in our testing environment,” Marcus says. “In production, it could be different.”

Sarah puts down her pen. “How different?”

“We don’t know.”

The room is very quiet.

“Explain this to me,” you say. “You ran ten thousand tests. You measured the error rate. Why can’t you tell me what it’ll be next month?”

“Because,” Marcus says, “we’re not sampling from a stable distribution.”

The Factory That Isn’t a Factory

Here’s what you expect, because it’s how quality control works everywhere else:

You test a manufacturing line a hundred times. You get a 0.05% defect rate. The machinery is fixed, the process is stable, so tomorrow’s batch will also have about a 0.05% defect rate. You can measure quality by sampling because you’re sampling from something stable.

That’s not how LLMs work.

“Think of it like this,” the engineer says. “When we run the model, we’re not running it on some fixed machine. We’re sending requests to a server, and that server is batching our request with other requests, and how those requests get batched affects how the GPU computes things.”

“I don’t follow,” you say.

“The exact same prompt, sent to the exact same model, can produce slightly different outputs depending on what else is happening on the server at that moment. Batch size changes. GPU kernel configuration changes. The floating-point math gets computed in a different order. Most of the time it’s meaningless. But sometimes it’s enough to change a word or two.”

“So it’s random?”

“Not exactly. It’s deterministic given the exact infrastructure state. But we can’t control infrastructure state. Server load varies. Batching varies. Engineers call this ‘output drift’—from our perspective, it looks non-deterministic.”

You’re beginning to see it. “So when you tested it ten thousand times—”

“We measured how it behaved under those specific server conditions, at that moment in time.”

“And next month?”

“Different server load. Different batching. The underlying distribution of responses could shift.”

Sarah is writing something down. “How much could it shift?”

“We don’t know,” Marcus admits. “A recent study ran the exact same prompt through a large language model a thousand times—with temperature set to zero, telling the model to be less creative, more deterministic. It still produced eighty different completions.” He pauses. “Maybe our error rate stays around 0.05%. Maybe it drifts to 0.1%. Maybe under certain server conditions it’s higher.”

“How would we even know if it drifted?”

“Continuous monitoring. Log everything, flag anything that looks wrong, investigate patterns.”

“So we’d find out,” you say slowly, “by waiting to see if customers complain?”

Marcus doesn’t answer. He doesn’t need to.

The Judge Who Needs a Judge

“What if,” Sarah says, “we add a second AI to check the first one? The customer-facing bot responds, then a safety bot reviews it before we send it. Like having two sets of eyes.”

The engineer nods. “That’s actually common practice. A lot of companies do this. You can use one LLM to generate responses and another to check for policy violations.”

“Would that work?”

“It helps. Catches a lot of issues.”

Sarah waits. “But?”

“But the judge is also an LLM. Which means it’s also non-deterministic.”

You laugh, but it’s not funny. “So the thing checking for consistency... isn’t consistent?”

“Right. Most of the time it catches problems. Sometimes it misses them. Sometimes it flags things that are actually fine.”

“And we can’t test our way to certainty because—”

“Because we’re still sampling from an unstable distribution. Both the worker and the supervisor can behave differently depending on server conditions.”

Sarah leans back. “So what you’re telling me is that we can’t actually guarantee this system won’t promise an unauthorized refund.”

“We can make it really unlikely,” Marcus says. “Multiple checks, continuous monitoring, human escalation for edge cases. We can build layers of protection.”

“But we can’t guarantee it.”

“No.”

You sit back in your chair. Your company’s growth has been built on trust. Customers trust you’ll treat them fairly. And what you’re learning is that the system you were about to deploy—the one that was going to help you scale that trust—is fundamentally unpredictable in ways you can’t control.

“Is anyone working on this?” you ask. “Like, is this a known problem?”

The engineer pulls up something on her laptop. “Yeah, there’s research happening. ThinkingMachines Lab published something recently about defeating non-determinism in LLM inference. They identified that the main culprit is batch-size invariance—how requests get batched affects the math, which affects outputs.”

“Can it be fixed?”

“Technically, yes. You can build inference engines that are deterministic—same input produces identical output every time, regardless of server conditions.”

“So we just use that?”

“It comes with performance tradeoffs. Slower, more expensive. And it’s not widely deployed yet. Most production LLM APIs are non-deterministic by default.”

You feel something shift in the room. “Wait. So every company deploying AI right now—everyone using OpenAI, Anthropic, Google—they’re all dealing with this?”

“Yes,” the engineer says. “All the major providers explicitly note it in their documentation. Anthropic, Google—same disclaimers. Some variation is always possible.”

“Even at temperature zero.”

You look around the room. Your team has built something impressive. It works most of the time. But “most of the time” isn’t a standard you can hang a compliance program on.

The TSA Agent Problem

Sarah breaks the silence. “You know what this reminds me of? Airport security.”

Everyone looks at her.

“Same bag, same contents, same rules. Sometimes they pull you aside to check your laptop. Sometimes they don’t. Sometimes the water bottle is fine. Sometimes it’s not. It depends who’s working that day, how busy they are, whether you look familiar.”

She’s right. You’ve been through the same airport dozens of times. The experience is never quite the same.

“TSA knows this,” she continues. “That’s why they have supervisors. Random audits. Layers of redundancy. They’ve built an entire system around the fact that humans are inconsistent.”

“So we build the same thing for AI,” you say.

“Maybe. But here’s what’s bothering me.” She opens her laptop again. “We expected humans to be inconsistent. That’s why we built all those verification systems—managers, auditors, separation of duties. We knew we couldn’t guarantee perfect human behavior, so we designed systems that work despite imperfect humans.”

“Right.”

“But we’re deploying AI because we expect it to be more consistent than humans. More reliable. Less variable. That’s the promise, isn’t it? That machines follow rules perfectly?”

You see where she’s going.

“And what we’re discovering,” she says, “is that AI has human-like variability without human judgment. The TSA agent might bend the rules, but they also know when to make exceptions. When to escalate. When something feels wrong. Our AI doesn’t have that.”

Marcus leans forward. “But this is what everyone is doing. OpenAI, Anthropic, Google—they’re all shipping products with this limitation. The research is happening. ThinkingMachines and others are working on deterministic inference. It’ll get better.”

“When?” you ask.

“Eventually. But not today.” He pauses. “Look, we can make it really unlikely something goes wrong. Multiple checks, monitoring, human escalation for edge cases. Everyone deploying AI right now is managing this same uncertainty.”

Sarah closes her laptop. “That’s not reassuring.”

Marcus turns to you. “So what do we do? Put the project on hold until someone solves non-determinism? That could be years. Our competitors aren’t waiting.”

The room goes quiet. You can test this system, monitor it, add guardrails. You can make it really, really unlikely that something goes wrong.

But you can’t guarantee it won’t.

So would you proceed?

BoxCars AI

Discussion about this post