Hearing the reach

What the transcript erased, and what the next class of voice models doesn’t.

Jun 04, 2026

An abstract image of a purple and blue flower — Photo by Yuriy Vertikov on Unsplash

I had lunch with an old college roommate a few weeks ago, and somewhere in the middle of catching up he asked the question I have heard versions of from a lot of people this year. What happens if AI just stops improving? If the next model is no better than the last one, what then?

The question imagines AI’s progress as a single thing, one number going up or flat. But AI’s capabilities don’t move as one number. They move as a ragged edge, a boundary in many directions at once. Some directions have shot far out: coding, image generation, math. Others have barely moved. Researchers at Harvard and Wharton — Dell’Acqua, Mollick, and colleagues — call this shape the jagged frontier. The loud debate is a debate about its leading edge. Behind that edge, every other part of the boundary is moving at its own pace, mostly out of view. Even if the leading edge stops tomorrow, the rest has years of catching up to do. This is the story of one stretch of it, watched up close.

A corner of the frontier

A couple of years ago my mom had a stroke and developed aphasia. As I spent time sitting in on her speech therapy sessions, I got curious about the limits of what AI and software can do today.

To be honest, I only had a passing understanding of aphasia until those sessions. In a nutshell, words you have known your whole life stop being available to you. You look at a fork, reach for the name, and the word does not come. Sometimes a close cousin arrives instead, spoon or knife, and you can hear yourself register that it is wrong without being able to fix it. Letters that were intimately familiar, so familiar that words used to appear in your mind fully formed, now seem like strangers.

Like a stroke that impairs movement, aphasia is something patients can work back from. Speech therapists guide them through exercises that, over months, teach a different part of the brain to do what the old part used to do. Some words come back. Others stay puzzlingly lost.

I started building apps and games to make aphasia practice easier between sessions. I quickly ran into what software alone could not do. One of the techniques therapists use is called semantic feature analysis.

The therapist shows a picture, say a fork, and waits for the patient to reach for the word. If the patient says spoon, the therapist does not say wrong. She says something like, yes, also for eating, but think about what goes with it. If the patient says ladle, she pivots again, not quite, but you are in the kitchen, what would you pair it with. Which wrong word came out tells the therapist where the patient was reaching, and where to nudge next.

The first version of the app I built had pre-recorded audio hints. If the picture was a fork, the app could play a clue. “This is something you eat with.” “This rhymes with pork.”

But pre-recorded cues miss what the therapist actually does in a session. It is closer to improv than to a script. She is bridging, continuing, flowing with the patient, because the hint a patient needs in the middle of reaching for a word is the one that responds to what they just said. You cannot pre-record that. You have to hear them. And then you have to actually hear them, not just transcribe them.

Bringing in AI

Traditional software has its control flow baked in — if the user says x, then say y. AI doesn’t. Chatbots already react to whatever was just said, and an AI agent, I figured, could do for therapy what pre-recorded audio could not.

So I started experimenting with AI and voice. The first thing I noticed was that most voice apps today are not really voice apps. They are a text pipeline. The patient speaks. A speech-to-text model transcribes the audio into text. The text goes to an LLM, which produces text back. A text-to-speech model converts that text into audio and plays it through the speaker.

For a lot of use cases, this is fine. If you are asking Siri to set a kitchen timer for twelve minutes, or summarizing a long lecture, the transcript is all anyone needs. The substance is in the words transcribed, not in the spaces between them.

But for conversations between humans, the transcript is missing chasms.

What the transcript loses

Take a question like this: “Do you love your wife?”

The answer might be:

Yes.

Or:

...yes.

Or:

Umm. Yes.

To the transcript these are all the same word: yes. To the audio they are completely different answers. A clear Yes. is a vow. A hesitant ...yes. could be a marriage in trouble. The pipeline I described in the last section reduces every one of these to yes before the LLM ever sees it. Whatever the model says next is built on a sentence stripped of everything that made it meaningful.

But speech therapy depends on hearing the reach. The patient trying to recall fork might say spoon. They might also say ...spoon, with a half-second of hesitation that tells the therapist the word is close, that the patient knows it is close, that they just cannot pull it out. The therapist responds to the hesitation. A transcript erases it.

Anywhere the work is the conversation — not the relaying of information but the conversation itself — the transcript is not enough.

If you have used an AI voice app and felt something quietly off, this is likely what you have experienced. The words were right. Something else was missing.

Past the transcript

What if the audio never had to become text at all?

Text language models work by predicting the next word, or piece of a word. They can do this because text already arrives in chunks. Audio does not. Audio arrives as a wave, thousands of points per second, none of them individually meaningful.

A different class of models skips the transcript entirely. They break the audio directly into tokens of its own, each one capturing a tiny slice of how the speech actually sounded: the pitch, the timing, the breath, the hesitation. Then they predict the next audio token the way a text language model predicts the next word. Same recipe. Different alphabet.

What this new alphabet keeps is what the transcript threw away. It keeps the hesitation. It keeps the warmth or the flatness. It keeps the sigh that the transcript would have dropped on the floor.

Two things had to be true before this could work. The first was a way to break audio into tokens — the systems that do this are called neural audio codecs, and they took years to figure out. The second was data. Predicting the next audio token requires recordings of real people talking by the millions of hours, and podcasts and YouTube made that possible in a way it was not twenty years ago.

Once both of those were in place, the same transformer architectures that work on text could work on audio. The models being built today by OpenAI, Google, and labs like Kyutai are exactly this: language models, but with sound as the alphabet.

Six months apart

I have been trying to build my aphasia app with these voice-native models for about a year. The change between the version I tried in late 2025 and the version I tried in May has been the most encouraging thing I have seen in this corner of AI.

The first real-time voice model I built with could hear me. It could be interrupted. It could pick up on hesitation and respond to it. For the first time, the gap between what the therapist did naturally and what the software could do felt smaller than it had ever been.

The problem was that the model could not really do anything with what it heard. When a patient got a word right, the app would not reliably advance to the next picture. The model would sometimes hallucinate an object that was not part of the lesson. The conversation worked. The app around the conversation did not.

Last month, OpenAI released a new version. I rebuilt the prototype to see. It is meaningfully better. The model can now hear the reach and also drive the app. It feels less like a tech demo and more like a thing.

It is still awkward in places. Longer sessions get foggy as the context window fills, audio costs an order of magnitude more than text, and the model still drifts off-script in small ways.

Honestly, this is more or less what it felt like to use ChatGPT in 2023. Slow, expensive, prone to hallucination, awkward in ways that felt charming when you were not relying on it for anything important. This story could go the same way.

The jagged frontier

Most of the headlines this year have been about coding. But the same techniques are quietly being applied to other kinds of data. Voice is the one I have been thinking about. Other corners, like time series and numerical predictions, are being worked on by people for their own reasons. The voice corner specifically has a real commercial engine behind it. Every consumer voice assistant on the planet needs to learn to hear a tired voice, an angry voice, a confused voice, and not just a transcript.

Even within voice, progress will not arrive everywhere at the same time. The models are trained overwhelmingly on fluent, neurotypical speech. Aphasia speech, dysarthria, heavily accented speech — all of it underrepresented in the training data, all of it on a slower part of the edge than the one the rest of us are riding. So even when the next chapter arrives for most people, the version that reaches people with aphasia will be behind it.

That is life on the jagged frontier. Something is always ahead, something is always behind, and regardless of what happens at the leading edge, the rest of the boundary keeps moving anyway.

BoxCars AI

Discussion about this post

Ready for more?