May 29, 2026 · 11 min read

Speech AI in 2026: The Infrastructure Layer for Voice Agents

Voice is the default UI for agentic systems in 2026. Not chat. Not forms. Not buttons. If you are building an AI agent that interacts with humans — scheduling, support, intake, triage — voice is where the majority of production deployments are landing. The $22.5B voice AI market in 2026 is not analysts inflating a number. It is the actual surface area where agentic AI is meeting people.

The infrastructure problem is not the LLM. Every team has solved the LLM selection question by now. The problem is the plumbing: getting audio from a human's mouth to a transcript, through the LLM, back out as speech, in under 600ms, reliably, in multiple languages, without garbling barge-ins or freezing on silence.

This post is about that plumbing.

The Latency Budget

Before you pick any component, you need to understand where the milliseconds go. A voice agent pipeline has six phases, and every one of them has a non-negotiable minimum.

Voice agent end-to-end latency budget waterfall

The full budget at realistic 2026 numbers:

VAD/Capture: 20ms — Voice Activity Detection flags that the user has stopped speaking. This is running locally via WebRTC's built-in VAD or a library like Silero. 20ms is essentially fixed.
STT (Deepgram Nova-3): 150ms — Streaming transcription. The audio is being transcribed in real-time as the user speaks, so this is the time from end-of-speech to final transcript. Nova-3 at P50 is 150ms.
Endpointing: 75ms — After VAD says "speech ended," the system waits to confirm the user is actually done (not just pausing mid-sentence). 50–100ms is the standard window. Too short: interrupts users. Too long: adds dead air.
LLM TTFT: 250ms — Time to first token from the language model. This is the dominant variable. A small 3B streaming model on dedicated hardware can do 80–120ms. Claude Sonnet on API is 200–400ms. GPT-4o on Azure is 150–300ms. This is where the optimization ceiling lives.
TTS TTFB (Cartesia): 40ms — Time to first audio byte from the TTS provider. Cartesia Sonic Turbo is 40ms. ElevenLabs Flash v2.5 is 75ms. The LLM and TTS pipeline to stream in parallel: TTS begins processing tokens as the LLM generates them, so TTFB is measured from when TTS starts receiving tokens, not from when the LLM finishes.
Network RTT: 35ms — Round-trip latency. On a Mumbai EC2 instance to a user in the same city, 20–30ms. International: 80–200ms.

Total: 570ms.

The "human feels natural" threshold is 300ms. Achievable only with: a local or edge LLM (or heavily cached responses), the fastest TTS provider, and a user on good Wi-Fi. At 570ms with the optimized stack, most users perceive it as slightly delayed but not robotic. Above 800ms is where it starts to feel like a bad phone call.

The critical insight: LLM TTFT is 44% of the budget. Switching your STT from Whisper (200ms) to Deepgram (150ms) saves 50ms. Switching your TTS from ElevenLabs (75ms) to Cartesia (40ms) saves 35ms. Switching your LLM from a 70B model to a 3B streaming model for the first token saves 150ms. These are not equivalent optimizations. Know where the ceiling is before you spend engineering time on it.

The LLM TTFT is 40–60% of your latency budget. Optimizing STT from 200ms to 150ms saves 50ms. Switching to a 3B parameter streaming model for the first token saves 150ms. Know where the ceiling is before you optimize.

STT Engines in 2026

The STT market consolidated fast. Three years ago, every team was running Whisper locally. Today, the managed streaming APIs have latency and accuracy that Whisper on any reasonable hardware cannot match for real-time use cases.

Deepgram Nova-3 Multilingual (March 2026) is the current production default for English and most European languages. 4.2% WER on English, 150ms P50 streaming latency, $0.0043/min. The March 2026 release cut batch WER by 34% over Nova-2. The multilingual model handles 36 languages without a language-detection step. For a 10-minute call at 10,000 calls/day, that is $0.43/call or $4,300/day — not cheap, but predictable.

AssemblyAI Universal-2 sits at 5.1% WER and 180ms P50. The streaming API is solid and the developer experience (Python SDK, webhook support) is excellent. At $0.0064/min it is 49% more expensive than Deepgram for the same audio. The difference is worth it if you need their speaker diarization or PII redaction in the same pipeline — those are genuinely good and tightly integrated. For pure streaming transcription, Deepgram is ahead on both latency and cost.

Whisper v3 Turbo is the open-source baseline. 8.6% WER on English on a self-hosted A10G, ~200ms latency in streaming mode (via faster-whisper or whisper.cpp). Near-zero cost at scale if you own the hardware. The catch: real-time streaming with Whisper requires chunking the audio into overlapping segments and stitching transcripts, which introduces artifacts at chunk boundaries and complicates endpointing. For async transcription (post-call processing), Whisper is excellent. For real-time voice agents, the managed APIs are ahead.

The India Angle: Sarvam Saaras v3

If you are building for India — and if you are in Mumbai, you probably are — the standard STT benchmarks are misleading. Nova-3's 4.2% WER on English becomes 24% WER on Hindi. That is not a small degradation; it breaks the user experience for the majority of Indian users who mix English and Hindi (Hinglish) in natural speech.

Sarvam Saaras v3 is purpose-built for Indian languages: Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Gujarati, Marathi, Punjabi, and Odia. 12% WER on Hindi vs Nova-3's 24%. ~170ms P50 streaming latency. Priced at ₹0.25/min (~$0.0025/min) — roughly half of Deepgram's international pricing, and the model is hosted in India (low latency from Mumbai).

For a production India-facing voice agent, the practical architecture is a language detection step at the start of the call, then route to Sarvam for Indian-language sessions and Deepgram for English-dominant sessions. This is one extra API call (~30ms) and saves substantial accuracy on the majority of calls.

TTS in 2026: Cartesia vs ElevenLabs

The TTS market has a clear cost/quality tradeoff in 2026. The two streaming providers worth knowing:

ElevenLabs Flash v2.5: ~75ms TTFB, $0.30/1,000 characters. The quality ceiling for voice cloning and emotional expressiveness. If the voice is a brand asset — a recognizable persona that users will hear repeatedly — ElevenLabs is worth the premium. The voice cloning from 30 seconds of audio is production-quality and handles emotion, pace, and style. At $0.30/1K chars for a 500-character average response: $0.15/call.

Cartesia Sonic Turbo: ~40ms TTFB, $0.05/1,000 characters. Six times cheaper than ElevenLabs. The quality is not identical — less emotional range, voice cloning requires more source audio — but for utility voice agents (support bots, scheduling assistants, intake flows), it is indistinguishable to most users. At $0.05/1K chars for the same 500-character response: $0.025/call.

At 10,000 calls/day, the TTS cost delta is: ElevenLabs $1,500/day vs Cartesia $250/day. That is $455,000/year on one component. The 35ms latency improvement compounds this: faster first audio byte means users perceive the agent as more responsive even when total call length is the same.

Voice cloning note: Both providers support voice cloning, but ElevenLabs' Instant Voice Cloning from 30–60 seconds of clean audio is still the best in class for brand voice. Cartesia's cloning is more sensitive to audio quality and requires 5–10 minutes of source material for comparable results.

Voice Agent Frameworks

You need more than STT and TTS. A voice agent requires: WebRTC handling, VAD, turn-taking logic, barge-in detection, session state, and the orchestration layer that connects audio streams to your LLM pipeline. Three frameworks dominate:

FRAMEWORK	ARCHITECTURE	COST OVERHEAD	BEST FOR
LiveKit Agents	WebRTC-native, open-source Python SDK	Infra only	Full control, custom pipeline, production
Vapi	Managed WebRTC + orchestration	+$0.05/min	Fast prototyping, less infra work
Retell AI	No-code builder + REST API	+$0.07/min	Non-technical builders, demos
Hume AI EVI	Emotion-aware, multimodal voice	Custom pricing	Empathy-forward, mental health use cases

LiveKit Agents is the production choice for teams building custom voice products. It is WebRTC-native (not WebSocket-based), which matters more than most teams realize.

WebRTC vs WebSockets for Voice

WebSockets are a reasonable transport for text and structured data. For real-time audio, they are the wrong tool:

Jitter buffer: WebRTC has a built-in adaptive jitter buffer that handles network packet reordering and drops. WebSockets deliver bytes in order; if a packet arrives late, the audio stutters.
SRTP encryption: WebRTC encrypts audio at the transport layer (SRTP). WebSockets over TLS encrypt the channel but not the media frames. For compliance use cases (HIPAA, financial services), SRTP matters.
Adaptive bitrate: WebRTC's congestion control adjusts audio quality dynamically based on network conditions. WebSockets have no such mechanism — you either buffer (adding latency) or drop (causing garbling).

The LiveKit engineering team has benchmarked this extensively. At 50ms average jitter (typical mobile network), WebRTC's jitter buffer absorbs the variance without perceptible quality degradation. A WebSocket-based transport at the same jitter produces audible artifacts every few seconds.

LiveKit Agents gives you the WebRTC infrastructure and a Python SDK that exposes sensible abstractions: VoicePipelineAgent, built-in VAD via silero, interrupt handling, and plugin architecture for swapping STT/LLM/TTS components.

Here is a minimal LiveKit agent with interrupt handling and VAD:

from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli, llm
from livekit.agents.voice_pipeline import VoicePipelineAgent
from livekit.plugins import deepgram, openai, cartesia, silero

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    # Initial context — persona and rules
    initial_ctx = llm.ChatContext().append(
        role="system",
        text=(
            "You are a voice assistant for GT Bharat. "
            "Keep responses under 2 sentences for voice. "
            "Never say 'certainly' or 'absolutely'."
        ),
    )

    assistant = VoicePipelineAgent(
        vad=silero.VAD.load(),                          # local VAD, ~20ms
        stt=deepgram.STT(model="nova-3-multilingual"),  # 150ms P50
        llm=openai.LLM(model="gpt-4o-mini"),            # fast, cheap first token
        tts=cartesia.TTS(voice="sonic-turbo"),           # 40ms TTFB
        chat_ctx=initial_ctx,
        # Interrupt handling: if user speaks while agent is talking,
        # stop the agent and re-run the pipeline with the new transcript.
        allow_interruptions=True,
        interrupt_speech_duration=0.3,   # 300ms of user speech triggers interrupt
        interrupt_min_words=1,           # even a single word interrupts
        min_endpointing_delay=0.5,       # 500ms silence before end-of-turn
    )

    assistant.start(ctx.room)
    await assistant.say("How can I help you today?", allow_interruptions=True)

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

The allow_interruptions=True flag is doing real work here. When the user speaks while the agent is talking, LiveKit's VAD detects speech onset, the agent stops the current TTS playback mid-stream, discards buffered audio, and re-runs the full STT → LLM → TTS pipeline with the new input. Without this, the agent sounds like a bad IVR that ignores you when it's talking.

Vapi abstracts all of this. You define your agent in JSON (STT provider, LLM, TTS provider, system prompt, functions), and Vapi handles the WebRTC connection, VAD, turn-taking, and streaming. The developer experience for a proof-of-concept is excellent — you can have a working voice agent in a day. The cost is $0.05/min on top of underlying provider costs, and the abstractions become constraints when you need custom turn-taking logic or non-standard VAD behavior.

Retell AI is further up the abstraction stack: a no-code builder with a visual flow editor. Useful for demos and non-technical stakeholders. Not the right choice for production systems that need custom behavior.

The Hard Parts

The benchmark numbers are clean. Production is not.

Barge-in detection is harder than the VAD tutorial makes it sound. The challenge: the user's microphone picks up the agent's TTS output through the speakers (acoustic coupling), which triggers false VAD activations. The fix is acoustic echo cancellation (AEC), which WebRTC includes but which requires tuning. LiveKit's AEC works well on desktop browsers; mobile WebView implementations vary significantly. Test this explicitly on Android Chrome — it is the most common failure point.

Silence detection and turn-taking in noisy environments: a user calling from a Mumbai street has ambient noise that triggers VAD continuously. Silero VAD with aggressive thresholds cuts off the user mid-sentence. Permissive thresholds mean the agent waits indefinitely. The practical solution is a hybrid: Silero VAD for speech onset detection, then a energy-threshold model to distinguish speech from background noise for endpointing. This is not documented anywhere clearly — it is the kind of thing you discover after the first demo in a loud environment fails.

Multilingual mid-call switching is the underdocumented hard problem for Indian deployments. A user starts in English ("Hi, I want to book an appointment"), switches to Hindi mid-call ("mujhe kal ka time chahiye"), and then code-switches back. A language detection step at call start does not handle this. The solution is continuous language detection on each transcribed segment (Deepgram Nova-3 Multilingual does this natively) and routing the response through a TTS voice that handles both languages. Sarvam's TTS handles Hinglish; ElevenLabs does not handle it well.

Echo and latency on mobile networks (3G/4G in tier-2/tier-3 cities): WebRTC's adaptive bitrate helps, but calls over congested networks with 150ms+ RTT expose every weakness in the endpointing logic. Users who pause between words to think get cut off. Users on slow connections have their audio arrive out-of-order and the jitter buffer absorbs it at the cost of added latency. Test on actual mobile networks before assuming Wi-Fi benchmarks are representative.

India-Specific Infrastructure

If you are building a voice agent for the Indian market — which from Mumbai is the obvious market — the infrastructure choices are different from the default global stack.

Sarvam is the anchor piece. Their Saaras v3 ASR handles 10+ Indian languages with models trained on actual Indian speech patterns (not just English recordings from Indian speakers, which most global models use). The API is hosted in India; round-trip from Mumbai EC2 is under 20ms. At ₹0.25/min, it is the cheapest production-grade option for Indian languages by a significant margin.

Bhashini (Government of India's National Language Translation Mission) provides free API access to ASR and TTS for 22 scheduled Indian languages. Quality is lower than Sarvam and latency is higher (~300ms), but for languages that Sarvam does not cover (Konkani, Dogri, Bodo), it is the only production option. The API reliability has improved substantially in 2025–2026.

Network routing: EC2 ap-south-1 (Mumbai) is the right anchor point. Cloudflare's Indian PoPs (Mumbai, Chennai, Delhi) reduce TLS handshake latency to 5–10ms for most Indian users. Combine with LiveKit's India relay servers and you get sub-20ms WebRTC signaling from most major Indian cities.

Language identification at call start: Run a 2-second audio sample through a fast language ID model (Whisper's detect_language() is adequate for this use case at ~50ms) before routing to the appropriate STT pipeline. This single 50ms decision saves significant downstream accuracy.

What Production Voice Agents Actually Need

A benchmark-passing demo and a production voice agent are different things. Here is the actual checklist:

End-to-end latency under 700ms at P95 — not P50. P95 matters because users notice the slow calls, not the average.
Barge-in that works in real environments — test with speakers playing back TTS audio (the actual echo cancellation scenario).
Graceful degradation on network issues — the agent should not freeze or error silently when a packet is lost. WebRTC handles this better than WebSockets by default.
Multilingual support from day one — retrofitting language support into a monolingual pipeline is expensive. Start with Nova-3 Multilingual + Sarvam routing if India is a target market.
Call session state — context window management across a 5-minute call. Most teams underestimate how many tokens accumulate in a long voice conversation.
Observability — log every segment: STT transcript, LLM input/output, TTS character count, segment latencies. This is the only way to debug why a specific call went wrong.
Cost modeling before scale — at 10,000 calls/day at 5 minutes average, your monthly STT cost is $129,000 (Deepgram) or $192,000 (AssemblyAI). Your TTS cost is $75,000 (Cartesia) or $450,000 (ElevenLabs). These numbers change the architecture decision.

Voice AI infrastructure in 2026 is a solved problem in the sense that every component exists and is production-grade. It is not solved in the sense that assembling those components into a reliable, low-latency, multilingual agent still requires significant engineering. The tools are better than they were 18 months ago. The hard parts are still hard.