Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Deepgram sets out real-world framework to benchmark medical speech recognition accuracy

The company argues that clinical safety requires testing speech-to-text systems on live hospital audio, not clean vendor samples, and proposes stricter, use-case-specific accuracy thresholds.

Defused News Writer profile image
by Defused News Writer
Deepgram sets out real-world framework to benchmark medical speech recognition accuracy
Photo by National Cancer Institute / Unsplash

Deepgram, the audio agent specialist, has published a framework for benchmarking medical speech recognition systems using audio captured from real clinical encounters, warning that accuracy measured on controlled datasets can collapse in live healthcare environments.

The company said research shows Word Error Rate (WER) can more than double when speech-to-text systems move from studio-quality recordings to noisy, multi-speaker clinical settings. Benchmark datasets such as LibriSpeech, it said, lack clinical terminology, realistic background noise, speaker overlap and the diversity of microphones and room acoustics found in hospitals and clinics.

To surface clinical safety risks, Deepgram’s framework recommends five core evaluation metrics. These include a weighted WER that assigns higher penalties to errors involving medications and diagnoses; Keyword Error Rate for high-stakes clinical terms, with single-digit percentage targets for critical drugs; and Character Error Rate to catch fine-grained mistakes such as decimal points or single-character differences that can change dosages or lab values.

The framework also calls for scoring medical entity extraction using the F1 metric, which Deepgram said typically falls between 0.8 and 0.9 in clinical settings, and for separate tracking of Institute for Safe Medication Practices (ISMP) high-alert medications. These include anticoagulants, insulin, opioids, neuromuscular blocking agents and chemotherapy drugs, where transcription errors can carry disproportionate patient risk.

On dataset construction, Deepgram said teams should record spontaneous clinician-patient conversations under a Health Insurance Portability and Accountability Act (HIPAA)-compliant protocol, with a minimum of 15 hours of audio. Recordings should span signal-to-noise ratios from 2 to 14 decibels and be stratified by medical specialty, speaker demographics, device type and conversation style.

Reference transcripts should be produced by certified medical transcriptionists or clinicians and standardised using clinical vocabularies such as SNOMED CT, Logical Observation Identifiers Names and Codes (LOINC) and RxNorm. The company said ambiguous segments should be flagged, inter-annotator agreement measured and all transcripts verified by a clinician to ensure clinical validity.

Deepgram also outlined suggested acceptance thresholds by use case. It proposed a conservative target of no more than 5% WER for electronic health record direct input, paired with full physician review; no more than 1.5% WER for final clinical documentation; 5% to 7% WER for medical transcription; up to 10% WER for ambient clinical scribe workflows; and 10% to 15% WER for analytics applications with periodic validation.

For ongoing governance, the company advised systematic monitoring, automated regression testing, quarterly revalidation of test sets, specialty-specific performance tracking and full audit trails with six-year retention in line with 45 CFR § 164.316(b)(2).

Deepgram said it offers healthcare-ready speech-to-text models and provides $200 in free credits for organisations that want to run benchmarks using the framework.

The Recap

  • Framework for benchmarking ASR with real clinical audio.
  • Target ≤5% WER for EHR direct input deployments.
  • Deepgram offers $200 in free credits for benchmarks.
Defused News Writer profile image
by Defused News Writer

Latest posts