← All posts
Deep dive

Who-said-what, the dependency-light way: a pure-Rust late-fusion diarization substrate

OpenASR's diarization runs as WhisperX-style sibling stages around an untouched ASR core, with the project's first pure-Rust forward passes. Each one is validated bit-close to its ONNX reference — a numerical match, not a diarization-accuracy claim.

Speaker diarization answers a deceptively simple question — who said what — and the obvious way to add it is to reach inside each ASR model and teach it about speakers. OpenASR took the opposite route. Diarization here is a set of sibling stages bolted around an ASR core that never learns they exist, joined back to the transcript by nothing more than overlapping time intervals.

Late fusion: siblings around an untouched core

The architecture is WhisperX-style late fusion. The ASR executor stays exactly as it is; voice-activity detection and diarization run as pre- and post-stages that talk to the rest of the engine through one narrow vocabulary: time ranges. Attribution — the actual who-said-what step — is a pure interval max-overlap function with zero model coupling. For each word it accumulates the intersection with every overlapping speaker turn and assigns the longest-overlap winner; a word with no overlapping turn is left unassigned, or snapped to the nearest by policy.

Because that join is arithmetic on intervals rather than anything learned, the same assign_speakers code attributes speakers for whisper, qwen, cohere, parakeet, moonshine, and the wav2vec2 CTC path — with no per-family fork.

i

Why late fusion

Keeping the ASR core untouched is the whole point: every accuracy and parity guarantee the engine already has survives intact, and diarization is a thing you bolt on rather than a rewrite of each model.

License-clean by default

The defaults are chosen so they can ship inside an Apache-2.0 repo without dragging in a viral or non-commercial obligation. VAD is Silero v6 (MIT). Speaker-change segmentation is pyannote segmentation-3.0 — MIT, but only via the un-gated onnx-community mirror; the official repo is gated. The speaker embedder is WeSpeaker’s ResNet34 (VoxCeleb) — CC-BY-4.0, permissive and commercial-friendly, with an attribution requirement rather than Apache-2.0’s grant. Clustering is roughly 150 lines of in-house Rust doing agglomerative clustering on cosine distance, with no model weights at all.

A couple of well-known options were left out on purpose: TEN VAD is excluded because its license carries a non-compete clause that an open repo can’t satisfy, and NVIDIA’s Sortformer is opt-in pull-only, never bundled.

The first pure-Rust forward passes

These are the project’s first pure-Rust forward passes, and each was checked against its ONNX reference before being trusted. Silero’s VAD probabilities sit within 1e-3 max absolute error of the upstream model. The WeSpeaker front-end matches torchaudio’s filterbank to 1.3e-3 max error, and the full audio-to-embedding path lands at cosine 1.000000. The pyannote PyanNet forward agrees to 6e-5 max absolute error over 293 frames.

Read those numbers narrowly. They are bit-closeness measurements — how faithfully the Rust port reproduces the reference network’s outputs — and nothing more. They are not diarization accuracy, not DER, not a claim about how often the speaker label is right. They say the math is the same; how good the who-said-what actually is remains a separate question this work does not yet answer.

A cosine of 1.000000 means the port reproduces the reference network — not that the speaker label is correct.

from the VAD & diarization design

It’s worth being precise about the “pure Rust” part, too. These small VAD and diarization models are the only pure-Rust inference in the tree. The ASR engine itself still runs entirely on ggml’s FFI — C all the way down — and the nn/* modules are ggml graph builders, not Rust math. Shedding C (去C) is a long-term direction, not today’s state.

Privacy is the default

Diarization here means telling speakers apart, not naming them. Labels are anonymous and session-relative — SPEAKER_00, SPEAKER_01 — and the embeddings that produced them are discarded when the session ends. The single identity feature is an off-by-default enrolled primary user: one local centroid that can relabel a cluster as you. In remote mode the mapping is designed to stay on the client; the server is intended to return only anonymous labels and not receive the enrolled voiceprint — the same local-by-default privacy posture the rest of the project follows.

i

Diarize, don't identify

By design, the enrolled voiceprint stays on the device; in remote-compute mode identity is meant to remain a client-side post-step layered on top of the server’s anonymous labels — not something the engine ships by default.

What’s wired, and what honestly isn’t

The substrate is additive and parity-gated. With diarization off, output is byte-identical to today — the golden-diff / WER-0 oracle the whole engine is held to — and when on, diarization only writes a speaker field, never text or timestamps. The batch path (a BatchDiarizer behind a per-request diarize flag) is wired for any model, but it currently emits one speaker per VAD segment, and the agglomerative merge threshold of 0.35 is an un-tuned per-domain knob, not a tuned optimum. Late fusion also collapses overlap: it assigns one speaker per word, so two people talking at once become a single label.

So the honest framing is this: --diarize is a shipped, opt-in feature, not just engine-side substrate — it attributes anonymous SPEAKER_NN labels onto any model family’s transcript via the pure-Rust WeSpeaker and pyannote capability packs, pulled on demand. Cohere packs can additionally emit inline speaker tokens on top of that. The work here is tightening the same who-said-what path everywhere, measured against its references rather than marketed ahead of them.