Batched serve-mode decode: GPU-only throughput that leaves single-stream byte-identical

During decode, a single ASR stream barely touches the GPU. Each token is one memory-bandwidth-bound batch=1 GEMV: the kernel reads the entire weight matrix to produce one column of output, then does it again for the next token. The arithmetic is trivial; the cost is moving the weights. So while one request decodes, the device sits mostly idle, waiting on bandwidth rather than compute. Speeding up that single stream itself is a separate line of work — see moving the KV cache onto the GPU.

The fix is the classic one: run several concurrent streams as a single batched GEMM per step instead of N separate GEMVs. One weight read now amortizes across N sequences, which is exactly the work the idle bandwidth was leaving on the table. In OpenASR this lands as one n_seq parameter threaded through the shared decode seam in nn/decoder.rs, so qwen, cohere, moonshine, and whisper all inherit batching from the same change rather than a per-model fork. The batched GEMM and the n_seq-widened ops themselves execute through OpenASR’s vendored ggml fork (third_party/openasr-ggml) — the MIT C library from llama.cpp / whisper.cpp — so they run on the same substrate that drives single-stream decode. The whole thing is opt-in and default-off — unset, you get today’s N=1 path unchanged.

What the curve actually says

The throughput-vs-batch-width benchmark is a manual diagnostic (#[ignore]), not a CI gate, and it was run once. On a single AMD RX 9060 XT with the moonshine-tiny q8_0 pack, HIP scaled aggregate decode throughput to 4.28x at N=8 (370 to 1588 tok/s), near-monotonic across the ladder. That is a batching multiplier measured against N=1 on the same GPU — not a comparison against any other engine, and not an absolute win on a production-size model.

The backend shape matters. Vulkan did not scale the same way: it peaked at 2.85x at N=4 and then plateaued, reaching 2.95x at N=8. Read that as a plateau value, not near-linear scaling — the extra slots past N=4 bought almost nothing on that backend.

On CPU, batching was actively counterproductive — 0.10 to 0.21x, i.e. slower than running the streams one at a time. That is the useful negative result: it confirms the win is a GPU-bandwidth phenomenon, and it’s why the engine gates strictly to the is_gpu_class() lane and fails closed on CPU.

4.28x is a batching multiplier on one GPU versus N=1 on that same GPU — not a cross-engine speedup, and not yet an absolute win on a production-size model.
from the #35 serve-mode batched-decode design

Default-off, fail-closed

The engine is enabled by OPENASR_SERVE_BATCH, accepting a value in the 2..8 range with a default MAX_BATCH of 8. Leave it unset and serve behaves exactly as it does today: one request inflight runs the degenerate N=1 step, never padded up to the batch width. On a CPU or scheduler-backed path the batched engine declines and falls through to the existing single-request decode.

bashenable batched serve-mode decode

$ OPENASR_SERVE_BATCH=8 OPENASR_GGML_BACKEND=hip openasr serve --model moonshine-tiny
# unset OPENASR_SERVE_BATCH keeps today's N=1 path

Because the lever is opt-in and collapses to the current shapes at N=1, turning it off is the safe default and turning it on is a deliberate, GPU-only choice.

N=1 stays byte-identical

The reason this can ship default-off without risk is that the single-request path is provably untouched. At N=1 every widened tensor collapses to today’s exact shape, so the op sequence and kernel selection are identical to the existing run_step_reused driver — the output is byte-identical, not merely close.

That invariant is enforced, not asserted. The real-pack parity workflow (serve-batch-parity.yml) runs 6 tests against the public moonshine-tiny pack on a CPU backend and gates merges into main — and those tests were previously all #[ignore], run only by hand. There are two distinct invariants behind it: N=1 batched output equals the single-stream run, and N genuinely different sequences batched together produce the same tokens as N serial single-stream runs. The second is the one single-stream parity can’t catch — a ragged-mask bug yields plausible-but-wrong text, so it gets its own check.

What we have not measured

The 4.28x curve is relative tokens/s on moonshine-tiny only. An earlier 5–7x at N=8 figure was an estimate against an RTX 3060 baseline — an estimate, not a result, and the measured wins came in below it. The absolute, end-to-end serve-mode win on a production-size model (qwen or whisper) is explicitly not yet measured; for a model this tiny, GPU single-stream is launch-bound, so the absolute win needs a larger model on a GPU box. The harness is family-portable, and that measurement is the next step.

The honest summary

Batched serve-mode decode is a real, default-off throughput lever for the case it targets: concurrent same-model requests on a GPU. The measured multiplier is 4.28x at N=8 on one GPU with one tiny model, the backend shape varies (Vulkan plateaus near 2.9x), and CPU is excluded by design. The single-stream path it sits beside is byte-identical and CI-gated. What’s left is the part that needs hardware we haven’t pointed at it yet: the absolute win on a production-size model. We’d rather state that limit than round it away.

Batched serve-mode decode: GPU-only throughput that leaves single-stream byte-identical

What the curve actually says

Default-off, fail-closed

N=1 stays byte-identical

The honest summary

Keep reading

OpenASR 0.1.0

Who-said-what, the dependency-light way: a pure-Rust late-fusion diarization substrate

Trusting a model you didn't compile: a signed catalog and a fail-closed pull chain