Related: sources · notes · metadata · Published Pieces
Text Intelligence, Edge TTS
Let the edge become more natural over time. Do not let the edge become the mind.
The most practical architecture for serious AI audio is not the most glamorous one.
It is text-native intelligence with speech at the edge.
That sounds less magical than a fully native speech-to-speech model. It does not produce the same demo feeling. It does not instantly suggest a robot companion that hears, sees, interrupts, and speaks like a person. But it preserves the thing that matters most: intelligence.
Text remains the control plane for serious AI work. Search is text-heavy. Citations are text-heavy. Code is text-heavy. Logs are text-heavy. Tool calls are structured text. Claims, sources, transcripts, diffs, revisions, and verifiers all become easier to inspect when represented textually. Long-context reasoning systems are strongest when they can work over artifacts, not just vibes.
Audio should not replace that layer. Audio should render it.
A good system can still accept voice input. Speech-to-text converts the user’s utterance into an event. The system may preserve timing, confidence, emphasis, and original audio, but the reasoning layer receives something structured enough to cite, search, route, and remember.
Then the intelligence layer does the real work: retrieve sources, consult the artifact graph, run agents, write or revise vtexts, compare claims, inspect code, launch background tasks, and produce a radio script.
At the output edge, text-to-speech renders the script into audio.
This gives the platform control. If the audio is generated by a realtime persona model, the product inherits that model’s conversational habits. It will want to be an assistant. It will say “I.” It will address the user constantly. It will aim for smoothness. It will compress answers because it thinks it is in a dialogue. It will optimize for turn-taking rather than sustained thought.
Edge TTS avoids that trap. The system can generate a proper radio script, then render it with a voice chosen for the medium: calm, flat, clear, not faux emotional, not artificially intimate, not pretending to be a friend.
It also enables buffering. A live voice stream is fragile. Anyone who walks, drives, or moves through spotty coverage knows this. If the system depends on uninterrupted realtime media, the experience breaks exactly where audio should be most useful: in motion. A radio system should pre-render chunks, cache them locally, and keep a queue ahead of the listener. If connectivity drops, playback continues. Future branching may wait, but the current stream should not disintegrate.
Edge TTS also supports cost control. Not every audio segment needs the most expensive voice model. Some can be rendered locally. Some can be cached. Some can be regenerated only when the underlying artifact changes. Frequently used summaries, intros, definitions, and transitions can become reusable audio objects.
More importantly, edge TTS preserves the hierarchy between narration and evidence.
Choir Radio should weave AI narration with real human voice clips. When a person has actually spoken a relevant claim, the system should be able to play that original audio. Not a cloned voice. Not a generated imitation. The actual recorded speech, aligned with transcript and citation.
The AI narrator then frames the clip: here is how she described the problem in March. Then the real voice plays. Then the narrator resumes: that matters because the later policy debate adopted the same frame, but stripped out the labor implications.
This works because the narrator is not trying to be the star. The narrator is an organizer.
A fully realtime voice model tends to absorb everything into one continuous conversational persona. Edge TTS allows the product to maintain separations: narrator, source, user, critic, background agent, clip, quote, update. This is not merely aesthetic. It is epistemic. The listener should know what kind of speech they are hearing.
Text intelligence with edge TTS also makes the system easier to audit. Every spoken segment can correspond to a script. Every script can correspond to sources, citations, and artifact states. If a user asks where something came from, the system can answer. If a speaker disputes a clip, the platform can inspect the source. If a generated claim is wrong, the error can be traced.
That is the difference between voice as performance and voice as interface.
The goal is not to make AI sound maximally human. The goal is to make intelligence listenable without losing provenance.
This architecture is not anti-voice. It is pro-audio. It takes audio seriously enough not to reduce it to a talking face over a shallow brain.
The future product may eventually use better realtime interaction models at the edge. Fine. If a model can handle interruption, timing, video cues, and speech rhythm better than a pipeline, use it. But the deeper system should remain text/artifact-native.
Let the edge become more natural over time.
Do not let the edge become the mind.