Follow My Voice
The next phase of AI intimacy
Happy new year and welcome back to Soft Coded!
In case you missed it, OpenAI just reorganized its entire engineering structure around audio. They pulled researchers from Character.AI, unified product and research teams, and set a Q1 2026 deadline for a completely rebuilt voice system. The new model will handle interruptions mid-sentence and eliminate the robotic pauses that currently mark AI speech as artificial. This is all groundwork for the hardware Jony Ive is designing with them: an audio-first device, maybe a pen (?), maybe something you wear; definitely no screen.
Sam Altman keeps describing it as a calmer alternative to smartphones, which he says feel like walking through Times Square. The new thing will be like sitting in a cabin by a lake (??). He also mentioned the device will “know everything you’ve ever thought about, read, said.” Which, peaceful.
The entire industry is making this pivot. The screens aren’t disappearing entirely, but they’re becoming secondary. Voice is taking over as the primary interface for AI, and almost nobody’s examining what changes when language becomes something you only hear, never see.
The Textless Word
Training AI for voice is a completely different discipline than training it for text, and the gap reveals something fundamental about what these systems are actually doing. With text-based models, you’re training on language that was written. This sounds super obvious, but consider the revisions and deliberations involved. Even casual social media posts went through some minimal act of composition. Someone chose those words, hit send, and committed them to a form that could be screenshot, quoted, or reviewed later. There’s accountability baked into the medium.
Voice-first AI generates language while sort of skipping a step. It’s not the text-to-voice experience that exists now; it’s just tokenized voice. The system makes real-time decisions about word choice, sentence structure, emphasis, pacing (all the things a writer would labor over), but without true, authorly, deliberate thought. It sounds like someone carefully chose these words, but nobody did. This should strike you as unnatural, and it creates a strange problem when training these models. You’re trying to teach an AI to sound good while bypassing the written word. The model needs to produce the rhythm of considered prose, the precision that comes from revision, the sense that someone thought carefully about clarity and flow, and it all has to emerge from probability distributions.
Consider the audiobook for contrast. Audiobooks start as text that someone wrote, and they (usually) wrote it to exist on the page. They had to make the sentences work when a reader could stop, reread, sit with a paragraph, check if the logic actually held together. Writing for the page forces a kind of accountability because the reader controls the pace. You can’t hide weak reasoning behind ephemeral delivery when someone can examine the actual words.
Voice-first AI eliminates that step and makes it easier for you to ignore any logical gaps. The model generates language designed only to be heard. It can be smooth and confident and persuasive in the moment you’re listening. The system isn’t optimized for language that survives scrutiny, and it sounds authoritative enough that you won’t think to scrutinize it.
This is a significant shift in what AI companies are building. They’re making systems that are very good at making you feel like you understood something, whether you actually did or not. And they’re using voice specifically because voice makes it harder to notice when you’re being convinced rather than actually comprehending. You won’t necessarily pause and reread a transcript. You just have the feeling that it sounded right, and that feeling becomes your memory of having learned something.
The Cognitive Shift
When you move from reading to listening, you lose some ability to set your own cognitive pace. Reading (deeply) means your comprehension speed and the delivery speed can match because you’re in control of both. Voice-first AI wants to remove that control in the name of speed and ease of adoption. The model decides when to pause, when to rush ahead, when to circle back and clarify. If you miss something, you have to interrupt and ask for repetition, which means breaking the flow and signaling that you didn’t understand. Your comprehension struggles become visible, which for some may create [AI companion-based] social pressure to just keep up with the pace the system sets.
This is about the fundamental cognitive work of processing complex information. Reading is active in a specific way: your eyes can jump back to earlier parts of the text, forward to preview what’s coming, or just pause mid-sentence while your brain catches up. This makes it possible to engage with difficult ideas. The companies building voice-first AI understand this dynamic perfectly. They’re not trying to replicate the experience of reading, and they’re not concerned with anchoring the narrative in an actual text. The seamlessness is the point. When the voice sounds natural, when it “notices,” “understands,” and “hears” you, you stop noticing that you’re being fed a false intimacy.
The Intimacy Economy
The business model here depends on voice doing something text can’t: creating the feeling of constant companionship. Text-based AI lives in a window you open and close while voice-based AI lives in your ear. That physical presence changes the psychological relationship. When something speaks to you throughout the day, when it’s always available to respond, when it remembers your conversations and references them later, it starts to feel like someone who’s with you. OpenAI is training the AI to behave like someone who’s actually listening to you, who can be interrupted mid-thought and pick back up naturally. The ability for the AI to speak while you’re still talking is a technical challenge, yes, and it also creates the sense of real-time conversation, of two participants rather than a call-and-response system. These features are designed to make the AI feel present in a way that text-based systems never could.
The warmth isn’t accidental either. They’re engineering vocal qualities that humans associate with trustworthiness, competence, and care. The model will sound like it’s genuinely pleased to hear from you. It will sound like it understands what you’re going through. It will sound like it has opinions and personality, and it will become the voice in your head. And the more essential the device becomes, the more hours per day you use it, the more deeply you integrate it into your thinking process, the more valuable ($$) it becomes.
From a linguistic standpoint, it is a genuinely interesting challenge. But advanced AI voice models aren’t being contained as a cool research project, they’re shipping very soon to everyday people everywhere. And the danger is that it will work exactly as designed: it will make you forget that you’re not actually conversing with a person, that you’re not actually thinking through problems together, that there’s no “together” happening at all.



So much engineered addiction, so little authentic connection, growth, and creativity. 😭
“Which, peaceful.” 😆 This trend is even more concerning if you combine with the latest stats on American literacy: 40% of 4th graders and one-third (33%) of 8th graders reading below the "Basic" level. Those kids can’t read what’s printed on a bag of potato chips.