Moving the KV cache onto the GPU: turning a transfer-bound decoder compute-bound

Every number in this post comes from a single host: one AMD RX 9060 XT (RDNA4, gfx1200) running the qwen3-asr-0.6b q8 pack. None of it generalizes to “AMD” or to OpenASR as a whole — it is one GPU, one model, one set of recorded measurements, and the author flags all of them as empirical figures still pending re-verification against a release benchmark. With that scope fixed, the story is a clean one: a decoder that looked compute-heavy turned out to be transfer-heavy, and the fix was to stop shipping the KV cache back and forth across the bus.

The honest version of an optimization story starts with the measurement that proved the first guess wrong. Partway through this work the encoder looked like the culprit — an incomplete profile pinned roughly 86% of the time on it. Adding the missing instrumentation reversed that completely.

Measure first, then cut

On a 61-second clip the full decode profile (HIP, with graph reuse already in place) broke down like this:

decode total: 95.4s (94.2 ms/token)
real GPU compute: 13.8s (13.6 ms/token)
KV/IO — the host-to-GPU round-trip: 81.6s (80.6 ms/token)
the audio encoder, for contrast: 4.0s

So the host-to-GPU KV round-trip was running at roughly 6x the cost of the actual GPU compute, and the encoder — the thing that briefly looked like the bottleneck — was only 4.0s against 95.4s of decode. The decoder wasn’t slow because the math was slow. It was slow because every step it re-uploaded the full key/value prefix to the GPU and read projected K/V back out again.

The encoder was a red herring

An earlier, incomplete profile put about 86% of the time on the audio encoder. Re-running with the KV/IO accounted for separately showed the encoder was only 4.0s of a 95.4s decode. The lesson is the boring one: instrument and re-measure before you refactor the wrong thing.

Making the KV cache device-resident

The change (the step labelled S5 in the design notes) keeps the KV cache resident in a GPU arena and accumulates new tokens in place, rather than rebuilding the history on the host and re-uploading it each step. The host-side upload of the prefix and the device-to-host readback of projected K/V both go away.

The effect on that per-token component was direct: KV/IO dropped from 80.6 to 1.1 ms/token. That is a per-token component change, not an end-to-end speedup — those are different units and they don’t belong on the same arrow. Put another way, the ~81.6s of KV/IO on the 61s clip is what got eliminated; compute now dominates the decode.

The decoder wasn’t compute-slow. It was transfer-slow — and you only see that once the KV round-trip is on its own line in the profile.
from the x86 GPU architecture notes

What it did end to end

Across all the steps in the sequence (S0 through S5, cumulative — not the KV step alone), the cumulative HIP result on this GPU was:

jfk clip: RTF 2.50 → 0.717 (a cumulative 3.5x, now faster than realtime)
61s clip: RTF 2.56 → 0.356 (a cumulative 7.2x, also faster than realtime)

The 7.2x is the cumulative figure for the whole sequence, so it would be wrong to attribute it to the KV change by itself. The device-resident KV step on its own moved the 61s clip from RTF 1.64 to 0.356 — roughly 4.6x of the cumulative win. Both numbers are scoped to the RX 9060 XT and qwen3-asr-0.6b q8; treat them as recorded measurements, not a guarantee.

One code path, two GPU backends

The same source, with no backend-specific code, also runs under Vulkan. Built with --features vulkan, qwen3-asr produced fully correct output on Vulkan (RADV, gfx1200) at jfk RTF 0.932 — also faster than realtime. On this host HIP (jfk 0.717) came out ahead of Vulkan (0.932), which matches the expectation that HIP is the more mature path on AMD. There is no Vulkan 61s figure to report.

That backend-portability is the point of the exercise as much as the speedup is: device-resident KV and the GPU-class gating are backend-agnostic, so one execution path covers HIP and Vulkan (and Metal on Apple) instead of forking into a separate implementation per backend.

Credit where it is due: those HIP, Vulkan, and Metal backends — and the KV and flash-attention kernels underneath them — are ggml’s, the MIT C library from llama.cpp and whisper.cpp. The device-residency layer described here is built on top of that substrate, not in place of it.

bashcheck which GPU backend and quant types are detected

$ openasr doctor
# reports the detected GPU device and supported quantized matmul types

To keep the caveats attached: this is one AMD RX 9060 XT (RDNA4, gfx1200), one qwen3-asr-0.6b q8 pack, the Vulkan figure is the jfk clip only, and the whole table is recorded empirical measurement pending release-benchmark re-verification. Within that scope the conclusion is sturdy — a decoder that read as compute-bound was transfer-bound, and moving the KV cache onto the device is what flipped it. For the paired GPU-decode angle — scaling throughput by batching many concurrent decodes on one device — see the batched serve-mode decode deep dive.

Moving the KV cache onto the GPU: turning a transfer-bound decoder compute-bound

Measure first, then cut

Making the KV cache device-resident

What it did end to end

One code path, two GPU backends

Keep reading

OpenASR 0.1.0

Who-said-what, the dependency-light way: a pure-Rust late-fusion diarization substrate

Trusting a model you didn't compile: a signed catalog and a fail-closed pull chain