End-to-end speech models cut voice latency by up to 70%, Deepgram says
The AI speech company says unified architectures bring response times below 250 milliseconds
Deepgram, the artificial intelligence speech platform, has said end-to-end text-to-speech models can reduce voice agent latency by 50% to 70% compared with traditional pipelined systems.
The company said unified models typically deliver response times in the 200 to 250 millisecond range, against 450 to 750 milliseconds for cascaded architectures that chain together separate components.
In conventional stacks, speech-to-text processing adds 100 to 300 milliseconds, large language model (LLM) inference contributes 200 to 800 milliseconds and text-to-speech conversion adds a further 150 to 400 milliseconds, with orchestration and network overhead frequently pushing total latency beyond conversational targets of under 300 to 500 milliseconds.
Deepgram said end-to-end models eliminate intermediate text handoffs, per-stage queuing and format conversions, keeping latency stable even under concurrent load.
The choice of transport protocol also has a significant effect, the company said, with WebRTC paths delivering 60 to 150 milliseconds compared with 220 to 400 milliseconds for TCP or WebSocket connections.
Deepgram added that concurrent usage commonly causes 30% to 50% latency degradation relative to single-user testing.
Related reading
- Phoneme Error Rate offers deeper insight into speech model performance, says McGillivray
- Google simplifies removal of explicit images from Search
- OpenAI pitches investors on ChatGPT rebound as $100 billion funding round nears close
The company said its Voice Agent API bundles speech-to-text, LLM orchestration and text-to-speech into a single product with per-minute pricing, designed to avoid the complexity of multi-vendor billing.
Deepgram recommended that developers validate 95th percentile time-to-first-byte below 250 milliseconds, load-test at two to three times expected peak traffic, and document quota and failover plans before launching voice applications.
The Recap
- End-to-end TTS cuts voice latency by fifty to seventy percent.
- Unified stacks achieve two hundred to two hundred fifty millisecond responses.
- Validate P95 time-to-first-byte under two hundred fifty milliseconds.