What quantization actually trades: size, speed, and the long tail of quality

Quantization gets sold as a free lunch: shrink the weights, keep the accuracy, ship a smaller model. The honest version is a trade, and the only way to know what you traded is to measure it. OpenASR’s perf suite does exactly that — on a single macos/aarch64 host, best-of-N, against one 6.13-second, 17-word LibriSpeech clip. That reference is narrow enough that one wrong word moves the word-error rate by about 5.88%, so the numbers below describe this build on this host, not a guarantee about your audio.

The runtime exposes three quantization profiles across families — fp16, q8_0, and q4_k — and Qwen3-ASR adds a fourth, q3_k. Choosing one is usually framed as a size decision. The suite shows it is really three decisions at once: size, speed, and the long tail of quality, and they do not always move in the same direction.

Speed tracks precision, not compression

Take qwen3-asr-0.6b on that clip. Every profile transcribes it at WER 0.00%, so quality is held constant and only the real-time factor (lower is faster) moves:

q4_k — RTF 0.295
q3_k — RTF 0.340
q8_0 — RTF 0.411
fp16 — RTF 0.585

The intuition that “more compression is always faster” breaks at q3_k: it is a smaller pack than q4_k, yet it decodes slower. At batch size one the decode is bandwidth-bound, and unpacking 3-bit weights costs more arithmetic per byte than 4-bit, so the extra compression buys size, not speed.

One clip, one host

These figures are best-of-N on a single macos/aarch64 machine against one short LibriSpeech utterance — the same perf-suite clip and whisper.cpp methodology used elsewhere in the suite. WER granularity on a 17-word reference is coarse — a single substitution is about 5.88% — so read the RTF ladder, which is stable, more confidently than tiny WER deltas.

Compression that’s almost free

Sometimes quantization really is mostly a size lever. The perf suite’s comment for cohere-transcribe records its q4_k mode shrinking the pack by roughly 39% at WER 0.00% — a round figure the suite notes to justify cutting cold-load I/O. Worth stating plainly: the word-error rate stayed at zero, but the transcript text is not byte-for-byte identical to q8_0. So this is “−39% at WER 0.00%”, not “the same bytes for free.”

A size win is not always a speed win

parakeet-ctc-0.6b goes the other way. Its q4_k pack is about 55% smaller than fp16, which looks like an unambiguous win — until you time it. On this small, compute-bound conformer, q4_k decodes measurably slower than q8_0, because the 4-bit k-quant unpack is heavier than q8’s; the depthwise convolutions even stay in f16 to protect quality. So q4_k here is deliberately a size choice, not a speed choice.

The long tail is the model, not the quantizer

The case that matters most for honesty is moonshine-tiny. Its q4_k pack lands at WER 11.76% on the same reference — 2 of the 17 words wrong. That is easy to blame on aggressive quantization, and it would be wrong: the transcript is byte-for-byte identical to HuggingFace transformers’ own Moonshine reference, with the two errors being model quirks (“greed/read” and “angryer/angrier”). This is the genuine quality ceiling of a 27M-parameter model, and quantization did not introduce it. Conflating the two is exactly the mistake the suite is built to prevent.

WER is not universally 0%. Most levels hold; a few small models are genuinely lossy — and that loss is the model, not the quantizer.
OpenASR perf suite

What we don’t claim

All of this is measured on internal smoke lanes. OpenASR makes no public quality or WER guarantee, and there is no multilingual or accent-quality gate behind these numbers — those are deferred, not proven. The suite’s job is narrower and more useful: to keep size, speed, and quality as three separate, measured axes, so that a smaller pack ships because someone looked at what it traded, not because compression was assumed to be free.

bashtime a quantized pack on your own host

$ openasr transcribe clip.wav --benchmark --model-pack ./qwen3-asr-0.6b.q4_k.oasr --format text
# reports elapsed time and real-time factor for one pack

What quantization actually trades: size, speed, and the long tail of quality

Speed tracks precision, not compression

Compression that’s almost free

A size win is not always a speed win

The long tail is the model, not the quantizer

What we don’t claim

Keep reading

OpenASR 0.1.0

Who-said-what, the dependency-light way: a pure-Rust late-fusion diarization substrate

Trusting a model you didn't compile: a signed catalog and a fail-closed pull chain