Deepgram, the audio intelligence company, has published guidance warning developers that their choice of streaming protocol for text-to-speech (TTS) can materially affect the conversational latency of real-time voice agents.
The company contrasts three delivery patterns: REST request-response, which requires the client to wait for a complete audio file before playback; HTTP chunked streaming, which enables progressive one-way playback; and WebSocket streaming, which delivers audio as it is generated and allows mid-utterance interruption.
Deepgram says WebSocket connections can save 50 to 100 milliseconds per request compared with REST in multi-turn conversations, a gap that compounds significantly in high-concurrency environments such as contact centres.
The guidance places particular emphasis on human perception thresholds, citing research suggesting users begin to notice delays at around 300 milliseconds, with 500 milliseconds commonly perceived as unresponsive.
Deepgram says teams targeting a conversational feel should aim for well under 200 milliseconds end-to-end.
The company also warns that telephony infrastructure can undermine transport-level gains, noting that the public switched telephone network's 8kHz sampling rate and carrier jitter buffers often absorb the latency advantages of faster protocols, making interruption control and pacing more important than raw protocol speed in those environments.
Deepgram recommends that developers test both protocols against their actual traffic patterns and playback targets rather than relying on general benchmarks.
Related reading
- YouTube launches Top Sports Podcast Lineup
- Nvidia brings 90 frames-per-second VR streaming to GeForce NOW as cloud gaming pushes into headsets
- The race to replace ElevenLabs in live AI systems puts latency and reliability above voice quality
The company's TTS application programming interface supports all three delivery methods, and Deepgram is offering new users $200 in free credits to run comparative tests via its developer console.
The guidance is aimed primarily at teams building voice agents for customer service, telephony and other latency-sensitive applications.
The recap
- Deepgram publishes guidance on choosing WebSocket or REST for TTS.
- WebSocket can save 50–100ms per request in multi-turn conversations.
- Developers can try both protocols with $200 free credits.