Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Researchers show decoder-based LLMs can evaluate speech recognition quality far better than traditional metrics, achieving 92-94% agreement with human judges on the HATS dataset versus 63% for Word Error Rate. The finding suggests generative models offer a practical alternative to semantic embeddings for ASR evaluation.

Modelwire context

Explainer

The headline number (92-94% human agreement) is striking, but the more consequential finding is what it reveals about Word Error Rate's long-standing blind spots: WER penalizes transcription errors uniformly, treating a missed filler word the same as a missed proper noun, which is why a metric invented in the 1960s has been quietly misleading ASR benchmarks for decades.

None of the recent Modelwire coverage connects directly to ASR evaluation methodology. The closest thematic thread is the broader question of whether AI outputs can be reliably judged at all, a tension that surfaced in the WIRED piece on AI-assisted newsroom writing from mid-April, where editorial quality and automated productivity metrics were already pulling in opposite directions. That story was about text generation, not speech recognition, but the underlying problem is the same: automated proxies for quality tend to measure what is easy to count rather than what humans actually care about. This paper is essentially an argument that LLMs can close that gap for audio transcription the same way embedding models tried to for text.

Watch whether ASR benchmark leaderboards (particularly on CommonVoice and LibriSpeech) begin reporting LLM-judge scores alongside WER within the next two conference cycles. If major labs adopt this as a secondary metric before end of 2026, the HATS findings have real traction; if WER remains the sole reported figure, this stays a methods paper.

Coverage we drew on

AI Drafting My Stories? Over My Dead Body · WIRED — AI

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsHATS dataset · Large Language Models · Word Error Rate

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.