Where can I submit my own ASR model to the FFASR leaderboard?

Open the Submit tab on the FFASR Leaderboard space at huggingface.co/spaces/treble-technologies/ffasr, paste your Hugging Face model ID, and evaluation runs server-side against the held-out dataset. The audio itself is not exposed to submitters to avoid test-set contamination.

Does FFASR support languages other than English?

The current FFASR tracks focus on far-field acoustic robustness rather than multilingual evaluation. For non-English ASR benchmarking, you'll want to look at FLEURS or Common Voice multilingual splits instead.

Can I run the FFASR evaluation locally on my own model?

Submissions are evaluated server-side by the leaderboard team after you submit a Hugging Face model ID, so you don't need to run the full evaluation locally. WER and RTFx are reported using an NVIDIA L4 GPU under identical conditions across all submissions.

How often is the FFASR leaderboard updated with new models?

New model submissions are evaluated on a rolling basis as they're submitted by the community. There's no fixed cadence, but the FFASR forum on Hugging Face is the easiest way to follow updates and discussion.

What's the difference between FFASR and the existing Open ASR Leaderboard?

The Open ASR Leaderboard focuses largely on clean and semi-clean audio datasets where many models have converged to near-identical scores. FFASR specifically targets far-field acoustic conditions — reverberation and noise across 14 simulated rooms at three SNR tiers — that better reflect smart speaker and meeting room deployment.

FFASR Leaderboard: ASR Benchmarked on Real-World Audio

Clean audio benchmarks have been lying to you for years. The shiny word error rates you see on LibriSpeech or Common Voice? Those numbers come from studio-quality recordings, polite single speakers, and zero room acoustics. Real life sounds nothing like that.

The new FFASR Leaderboard, a collaboration between Treble Technologies and Hugging Face, is a direct shot at that mismatch. It benchmarks automatic speech recognition (ASR) systems on the kind of audio you actually encounter in the wild: meeting rooms with HVAC hum, far-field microphones picking up reverberation, and SNR conditions that range from clean to genuinely noisy.

And the early pattern in the rankings is brutal for models that look great on clean benchmarks.

What's the FFASR leaderboard?

FFASR (Far-Field ASR) is the first open, community-driven evaluation suite designed to measure how speech recognition models perform under realistic far-field acoustic conditions rather than clean studio recordings. It measures word error rate (WER) across reverberant, noisy, far-field audio, paired with RTFx (real-time factor) to capture both accuracy and inference speed.

Conference room with ceiling microphone array used in far-field ASR testing

It is the SWE-bench moment for far-field speech recognition. For years, the field optimized for benchmarks that didn't reflect deployment reality. FFASR forces models to prove they work when the microphone is meters away and an HVAC system or a stray cough is running in the background.

Why clean ASR benchmarks stopped being useful

LibriSpeech, the dominant ASR benchmark since 2015, uses audiobook recordings: trained narrators, professional mics, near-perfect signal-to-noise ratios. By 2026, every serious model crushes it. Whisper Large v3, NVIDIA's Parakeet and Canary, and a slew of community fine-tunes all sit at very low WER on LibriSpeech test-clean.

So what's the problem? The problem is that the numbers don't transfer. A model with near-perfect WER on LibriSpeech can degrade dramatically once room acoustics enter the picture. The gap is enormous and it has real consequences for anyone shipping voice products.

The Treble + Hugging Face team designed FFASR specifically to surface this gap. According to the official announcement, the leaderboard evaluates models across nine acoustic conditions, with four of them feeding the primary ranking score.

How the methodology works

FFASR doesn't reinvent the metric. Word error rate remains the standard, calculated as the sum of substitutions, insertions, and deletions divided by total reference words. What's different is the audio.

The held-out evaluation set uses 2,000 anechoic speech samples convolved across 14 simulated rooms at three SNR tiers, roughly 8 hours of audio per condition. Whisper-style text normalization is applied consistently before WER calculation, and the audio itself is never exposed to submitters to avoid test-set contamination.

Key things to know about the methodology:

Simulated acoustic spaces: Audio is generated with Treble's hybrid simulation engine (wave-based solver at low/mid frequencies, geometrical acoustics higher up), then validated against real lab measurements
14 rooms: A representative range of room geometries instead of a single curated dataset
Three SNR tiers: Each scene includes both a transient noise source (such as coughing) and a continuous source (such as HVAC), at three SNR levels
Lab Measured vs Lab Simulated: The leaderboard reports both, so you can see how closely simulation tracks reality
WER + RTFx: Inference speed is reported alongside accuracy, evaluated on an NVIDIA L4 GPU under identical conditions

The three far-field SNR tiers, per the official announcement, sit at these signal levels:

Far-field tier	SNR range
High SNR	above 14 dB
Mid SNR	8 to 12 dB
Low SNR	below 6 dB

Each model runs inference on the same audio files server-side once submitted. The leaderboard publishes per-condition scores so you can see where each model wins and loses on the gradient from dry to low-SNR far-field.

What the data actually shows

According to the FFASR team, a consistent pattern is emerging across all submitted models: near-field WER on clean dry speech looks comparable to what the same models achieve on established benchmarks, but far-field WER at low SNR is often several times higher.

The takeaway is simple. Models that look nearly identical on clean benchmarks diverge wildly once reverberation and noise enter the picture. The Pareto front of average WER against RTFx, the leaderboard team notes, paints a materially different picture of where the real differences between systems lie than clean-audio rankings ever did.

Chart comparing word error rates across FFASR audio categories

And that's the whole point of the leaderboard.

What the leaderboard exposes

A few patterns worth pulling out from how FFASR is designed.

Near-field vs far-field gap is large and grows with noise. The leaderboard reports both side by side. It lets you distinguish between a model that is genuinely accurate and one that is accurate on clean speech but brittle to acoustic conditions. That matters for deciding whether to invest in far-field fine-tuning, speech enhancement preprocessing, or a different architecture altogether.

Accuracy isn't the only axis. RTFx (audio seconds processed per inference second) shows up alongside WER, so a slightly less accurate but much faster model can be the right pick for streaming or on-device deployment.

Simulation validation is in the open. The Lab Measured and Lab Simulated columns let you directly check how well Treble's simulated acoustics track real recordings. If you don't trust simulated benchmarks, the data to evaluate that trust is right there.

What this means if you're building voice products

The practical takeaway depends on what you're shipping.

For smart speakers and voice assistants, far-field is the only category that matters. Multi-meter microphone distance with room reverb is your baseline, and the leaderboard's far-field WER numbers should set your expectations honestly.

If you're shipping meeting transcription, conditions on FFASR with longer microphone distances and HVAC-style continuous noise are a closer match to your reality than LibriSpeech ever was. The gap between best-on-clean and best-under-reverb is significant.

For embedded and on-device voice, the RTFx column matters as much as WER. A big model with better accuracy but unworkable latency on your target hardware isn't actually better.

The leaderboard is currently focused on far-field robustness. Multi-talker scenarios, microphone array evaluation, and echo cancellation are on the team's roadmap for future tracks but aren't part of the current evaluation, so plan accordingly if those are your conditions.

How FFASR fits the broader benchmark trend

This leaderboard is part of a wider shift in AI evaluation. SWE-bench did the same thing for coding models, replacing toy programming puzzles with real GitHub issues. ARC-AGI keeps pushing reasoning benchmarks toward problems that resist pattern matching. The pattern is clear: every domain eventually realizes its benchmarks have been gamed and needs a harder, more realistic replacement.

For far-field ASR, that moment is now. LibriSpeech has been solved. Many clean-speech leaderboards have converged to near-identical scores. FFASR resets the difficulty bar to match what production systems with real microphones in real rooms actually face.

Developer working on voice application code with headphones at standing desk

The team is open about wanting community contributions. You can submit by pasting a Hugging Face model ID into the Submit tab on the FFASR Leaderboard space, and evaluation runs server-side against the held-out dataset.

Limitations worth acknowledging

No benchmark is perfect, and FFASR has its own gaps.

The current scope is far-field robustness specifically — reverberation and noise — not the full universe of ASR failure modes. Telephony, multi-speaker overlap, and code-switching are not part of the present tracks. Other efforts like CHiME, URGENT, and NOIZEUS cover overlapping ground for those scenarios.

The audio is simulated, not recorded. Treble's engine validates well against measured rooms (the Lab Measured vs Lab Simulated columns make this checkable), but it still relies on simulation as the data source rather than physical recordings at scale.

WER as a metric has its own issues. It treats every error equally, but missing a key entity (a name, a number) is usually worse than missing a filler word. FFASR doesn't replace entity-aware or downstream task evaluations, it complements them.

And benchmark gaming is always a risk. The audio is held out from submitters, which helps prevent overfitting, but the leaderboard team will still need to keep rotating evaluation conditions as the field improves.

The bottom line

FFASR is the far-field ASR benchmark the field actually needed. Clean-audio leaderboards stopped telling us anything useful when every model converged to near-identical scores, and the gap between benchmark numbers and production performance has been growing ever since.

If you ship voice, our advice is simple: stop quoting LibriSpeech WER as if it predicts production behavior. It hasn't for years.

If you're picking an ASR model in 2026 and your deployment involves any kind of microphone distance, start with FFASR results, then validate on your own audio. The leaderboard is a much better starting point than LibriSpeech ever was for real far-field applications.

For researchers and model builders, the message is even clearer. Optimizing for clean-audio WER is no longer impressive. The next generation of ASR models needs to prove itself on far-field, reverberant, noisy audio. That's where the actual frontier is.

Go check the leaderboard. Pick the model that matches your acoustic reality. Stop trusting LibriSpeech scores.

Sources