Benchmarking OpenASR against whisper.cpp: one host, one clip, an honest number
How the bench-suite harness measures OpenASR against whisper.cpp on the same model — leading with a committed 7.9% on a single host, presenting the spread, and treating the result as a regression guard rather than a standing win over open source.
Benchmark claims are cheap, and defensible ones are not. It is easy to publish a speedup; it is hard to publish one you can still stand behind a year later, on a machine that is not the one you tuned on. So instead of a headline number, OpenASR keeps a committed performance “ruler” in the repo — a suite definition, a baseline JSON, and a doc — and the whisper.cpp comparison is just one gated entry inside it. This post is about what that entry actually measures, and what it deliberately does not.
The comparison lives in a single bench-suite entry, whisper-turbo-q8-vs-cpp. It runs OpenASR and whisper.cpp on the same model — whisper-large-v3-turbo at q8_0 — and records both as best-of-N wall clock so background load on the host cannot quietly decide the winner.
Same substrate, not just the same model
Worth stating plainly: this is a same-substrate comparison. OpenASR’s inference backend is a thin Rust layer built on a fork of ggml (third_party/openasr-ggml), the MIT C library that originates from the llama.cpp / whisper.cpp projects. Both sides of this benchmark ride the same computational core, so a faster wall clock here is a tuning-and-integration difference, not a different engine. We gratefully credit Georgi Gerganov and the ggml / llama.cpp / whisper.cpp communities whose work makes this project possible.
The committed number
On a single macos/aarch64 host, the committed baseline records OpenASR at 2370 ms against whisper.cpp at 2573 ms on that turbo q8_0 pack. That is (2573 − 2370) / 2573 ≈ 7.9% faster. The baseline JSON — not this paragraph, and not any rerun — is the authoritative source, and 7.9% is the number to lead with.
$ openasr bench-suite --family whisper
# best-of-3 wall clock; each entry runs in its own subprocess The spread, not the headline
One run is not a result. The authors report the advantage as a ~8–11% range — 7.9% / 9.5% / 10.9% across three differently-configured runs on one host — with the committed baseline sitting at the 7.9% low end. The matched-thread (-t 4) configuration, for example, measured 2790 ms vs 3083 ms ≈ 9.5%. Presenting the low end as the headline is deliberate: if the number ever has to move, it should move toward the figure already in the repo.
Treat it as a measured same-model comparison plus a regression guard — not a standing claim of having finally beaten open source across the board.
paraphrased from perf/PERFORMANCE.md
This is the load-bearing caveat, and it travels with every number above. The comparison is single-model, single-clip, single-host. It is not a general statement about OpenASR versus whisper.cpp, and it is certainly not a claim about open-source runtimes in general.
A guard, not a trophy
The point of committing the entry is not to brag — it is to fail loudly if OpenASR ever regresses. The gate (gate_vs_cpp = true) is configured with cpp_slack = 0.05, meaning it only fires if OpenASR slips to more than 5% slower than whisper.cpp. With OpenASR currently ahead, that is a large amount of headroom; the gate exists to catch a future regression, not to assert a present victory.
Why the slack is one-sided
The vs-cpp gate is a floor, not a target. It does nothing while OpenASR leads, and fails the suite only if a change pushes the same-model wall clock past whisper.cpp by more than the slack. A guard, not a scoreboard.
What makes it comparable
Comparability comes from holding everything else still. Every family in the suite runs the same fixed input: a 6.13 s LibriSpeech test-clean clip (237-134500-0000.wav) with a shared 17-word reference transcript. That second number matters more than it looks. With 17 reference words, a single wrong word is worth about 5.88% of word error rate, so WER on this clip moves in coarse steps — you should not over-read a small WER delta as a quality signal. The clip is a comparability fixture, not a quality benchmark, and OpenASR makes no public WER guarantee from it. The same fixture turns up elsewhere — it is also where we look at what quantization trades on that clip.
Why you can trust the method
A benchmark is only as honest as the code path it exercises. The harness drives the real production call path — transcribe_with_backend → NativeBackend, the same one the transcribe --benchmark path uses — rather than a stripped-down re-implementation tuned to look fast. And each entry runs in a fresh subprocess via --run-single-entry, so a process-level peak-RSS high-water mark is not contaminated by whatever the previous entry allocated. Wall clock and WER are deterministic across that boundary; the isolation is there to keep the memory numbers clean.
None of this turns 7.9% into a marketing line, and that is the point. It is a measured number, on one host and one clip, wired into a gate that fails closed if the runtime ever gives it back. The honest version is less quotable than “faster than whisper.cpp” — and a lot easier to defend.