Most production voice agents in 2026 still feel robotic. Not because the language models are bad, but because the pipeline is serial. Speech-to-text waits for the user to finish, the LLM waits for the full transcript, text-to-speech waits for the LLM, and playback waits for TTS. The cumulative tail latency lands somewhere between 2 and 4 seconds — and humans notice conversational lag at roughly 200ms.
The architectural breakthrough that voice teams are converging on in 2026 is not a smarter model. It is pipeline overlap: every stage starts working before the previous one finishes. This article walks through the architecture, the reference stack, the failure modes, and what it takes to ship a voice agent that actually feels conversational.
The serial pipeline trap
The naive voice agent looks like this:
[mic capture] → [upload] → [ASR] → [LLM] → [TTS] → [download] → [playback]
Each arrow is a wait. Each stage holds the next one hostage. If your ASR takes 400ms, your LLM time-to-first-token (TTFT) is 600ms, and your TTS takes another 400ms before audio starts streaming, you are already at 1.4 seconds of minimum latency — before counting any network round trips, voice activity detection (VAD), or buffering.
Worse, every stage has a long tail. P99 ASR latency on a noisy mobile uplink can easily double the P50. Stack three serial stages, multiply the tails, and your conversation collapses into awkward pauses.
The serial model also makes barge-in (interrupting the agent mid-sentence) almost impossible. Playback has to drain, TTS has to be cancelled, and ASR has to be re-armed — all while the user is already mid-thought.
Pipeline overlap, explained
In an overlapped pipeline, the stages are not a queue. They are co-routines. Each one emits partial output that the next stage starts consuming immediately.
mic ──▶ streaming ASR ──▶ streaming LLM ──▶ streaming TTS ──▶ playback
(partial (begins generating (synthesizes the
transcripts on the first stable first audio chunk
every 100ms) words) on the first tokens)
Three principles make it work:
- Streaming ASR with partial hypotheses. Modern speech models emit a stabilizing transcript every 80–150ms, not a single result at end-of-utterance. The LLM does not wait for "end-of-turn"; it watches the transcript stream and starts forming intent.
- Speculative generation on stable partials. When the partial transcript stops mutating for a short window (or a punctuator like "." or "?" lands), the LLM begins generation. If the user adds more words, the speculative output is discarded and regenerated — cheap with small models, more expensive with frontier ones.
- Streaming TTS keyed to LLM token stream. TTS does not wait for a full sentence. As soon as the LLM emits the first few tokens, TTS synthesizes the opening audio chunk, and playback starts. The user hears the response begin while the LLM is still generating the rest.
Done correctly, perceived latency collapses from roughly 2.5 seconds to under 500ms, often closer to 300ms on the best stacks.
The reference 2026 voice stack
Across production voice teams, a few combinations keep showing up:
- Transport: LiveKit Agents, Twilio Media Streams, or Daily — all support WebRTC for sub-100ms uplink and full-duplex audio.
- VAD: Silero VAD or the transport's built-in turn detector — runs locally, decides when the user has actually stopped talking.
- ASR: Deepgram Nova-2, AssemblyAI Universal-Streaming, or Cartesia Ink — all stream partial hypotheses under 200ms.
- LLM: Groq (Llama 3.x / Kimi) for sub-500ms TTFT, Cerebras for similar latency profiles, or OpenAI gpt-realtime when you want speech-in / speech-out collapsed into one model.
- TTS: Deepgram Aura, Cartesia Sonic, or ElevenLabs Turbo — all support streaming synthesis with first-chunk latency under 250ms.
The exact picks matter less than the shape: every component must support streaming on both input and output, or the pipeline will degenerate back into serial waits at the slowest link.
A minimal overlapped agent loop
Here is a stripped-down Python sketch using LiveKit Agents — the structure is what matters, not the specific SDK:
from livekit.agents import Agent, JobContext, llm, stt, tts
from livekit.plugins import deepgram, groq, cartesia, silero
async def entrypoint(ctx: JobContext):
agent = Agent(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-2", interim_results=True),
llm=groq.LLM(model="llama-3.3-70b-versatile"),
tts=cartesia.TTS(model="sonic-english"),
chat_ctx=llm.ChatContext().append(
role="system",
text="You are a concise voice assistant. Reply in one or two short sentences.",
),
)
@agent.on("user_speech_committed")
async def on_user_speech(msg: llm.ChatMessage):
await agent.say(agent.llm.chat(chat_ctx=agent.chat_ctx))
await agent.start(ctx.room)Two things are worth noting:
interim_results=Trueon the STT is the difference between a serial and overlapped pipeline. Without it, the agent waits for end-of-utterance before it sees anything.agent.say(...)accepts an async LLM stream. The TTS plugin reads tokens as they arrive and pushes the first audio chunk before the LLM has finished generating.
The system prompt asking for "one or two short sentences" is not stylistic. Short responses are a latency strategy. The first audio chunk arrives faster when there is less to generate, and barge-in is more forgiving when the agent finishes quickly anyway.
Handling barge-in (full-duplex)
A real conversation has interruptions. The agent has to detect when the user starts talking while it is speaking, then immediately:
- Stop playback.
- Cancel the in-flight TTS request.
- Cancel the in-flight LLM generation.
- Re-arm ASR with the user's new utterance.
In WebRTC stacks, this is usually triggered by the VAD on the inbound audio firing while playback is active. The cancellation has to be fast — every millisecond between "user starts talking" and "agent goes silent" feels like the agent is talking over the user.
A simple guardrail: if the LLM has already streamed more than N tokens before barge-in, mark the partial response as "delivered" in the chat context so the next turn does not assume the user heard a clean response.
Measuring what matters
Three latencies matter, and most teams measure them wrong:
- End-of-speech to first audible response (EOS→TTFA). This is what users feel. It is not the time from the start of generation; it is from the moment the user stops talking.
- Time-to-first-token (TTFT) on the LLM. Streaming TTS cannot start audio until at least one token arrives, so this is the floor on TTFA.
- Tail latency (P95/P99) per stage. Averages lie. A pipeline with a 250ms P50 and 1.2s P95 will feel broken to one user in twenty.
Instrument every stage transition. The most common surprise is that the network, not the model, dominates tail latency — especially on mobile uplinks.
Failure modes you will hit
- Echo and self-listening. If the TTS audio bleeds back into the microphone, ASR will transcribe the agent's own voice and the LLM will respond to itself. Use a transport with proper echo cancellation (WebRTC's AEC, not naive HTTP streaming) or VAD gating during playback.
- Aggressive turn detection. If your VAD triggers end-of-turn too quickly, the agent interrupts the user mid-sentence. Too slow, and the agent feels sluggish. Tunable silence thresholds in the 400–800ms range are typical; pair with semantic turn detection (the LLM voting on whether the utterance is complete) for the best results.
- Accent and code-switching collapse. Streaming ASR models often degrade on Arabic-French code-switching common in MENA contexts. Test with real customers, not benchmark audio.
- TTS first-chunk warmup. Some TTS providers have a cold-start penalty on the first chunk. Pre-warm the connection at agent start, not when the first response arrives.
When voice agents are actually worth shipping
Voice is the right interface when the user's hands or eyes are busy (driving, walking, hands-on labor), when typing is slow (long-form data entry over a phone call), or when the channel is already voice (inbound support lines, outbound collections, appointment confirmations). It is the wrong interface when the response is a long list, a table, or anything the user needs to scan and re-read.
For most MENA SMEs, the highest-ROI voice agent in 2026 is still the inbound support deflection use case — answer common questions in Arabic, French, or English on the first ring, hand off to a human only when the intent is unclear or sensitive. The pipeline overlap architecture is what makes that handoff feel natural instead of robotic.
Bottom line
Voice AI in 2026 is not a model engineering problem. It is a real-time distributed systems problem. The teams shipping agents that feel conversational are the ones treating ASR, the LLM, and TTS as overlapping streams rather than a serial queue — and instrumenting EOS→TTFA as the single number that matters. If your current voice agent feels like a walkie-talkie, the fix is almost never a better model. It is the pipeline.
Related reading: