Scale is testing the top AI models against "the messiness of real speech"
The dataset and benchmark measure conversational robustness in native speech-to-speech models.
Scale AI has released Audio MultiChallenge, a benchmark designed to stress-test conversational robustness in some of the top speech-to-speech (S2S) models.
The benchmark is intended to measure progress toward voice agents that handle the messiness of real speech, Scale highlighted.
The company, in an announcement, said the benchmark isolates four capabilities that distinguish robust voice agents from brittle ones: voice editing, audio-cue gated memory, instruction retention and self-coherence.
Audio MultiChallenge focuses on real conversational patterns such as mid-utterance corrections, barge-ins and paralinguistic cues, and that models show a relative performance drop of 36.5% when audio cues are required alongside semantic retention.
Testers were instructed to "break" the systems to surface failures, the company noted.
Scale published a leaderboard showing Gemini 3 Pro Preview with an Average Rubric Score of 54.7%, followed by Gemini 2.5 Pro at 46.9%, Gemini 2.5 Flash at 40% and OpenAI’s GPT-4o Audio Preview at 25.4%, the statement said.
It also noted that cascaded systems using advanced LLM reasoning scored higher on semantic tasks, listing GPT-5 at 51.2% and Claude Opus 4.5 at 39.22%.
Scale described its data collection as a human-in-the-loop adversarial protocol: an automated Planner Agent generated conversation blueprints, humans acted out scenarios with a model in the loop, and grading used an LLM-as-a-judge approach with o4-mini that showed high agreement with human raters, the company said.
The Recap
- Scale releases Audio MultiChallenge to test conversational robustness.
- Top Average Rubric Score 54.7 percent by Gemini 3 Pro.
- Dataset built using human-in-the-loop adversarial data collection protocol.