![Whisper encoder-decoder, tasks, chunked inference, fine-tune](../assets/whisper.svg) **Architecture.** Standard transformer encoder-decoder. - Input: 30-second log-mel spectrogram, 80 mels, 10 ms hop → 3000 frames. Clips shorter are zero-padded, clips longer are chunked. - Encoder: conv-downsample (stride 2) + `N` transformer blocks. For Large-v3: 32 layers, 1280-dim, 20 heads. - Decoder: `N` transformer blocks with causal self-attn + cross-attn to encoder output. Same size as encoder. - Output: BPE tokens over a 51,865-token vocab. Large-v3 has 1.55B params. Turbo uses a 4-layer decoder (from 32), cutting latency 8× with a <1% WER hit. **The prompt format.** Whisper is a multitask model steered by special tokens in the decoder prompt: ``` <|startoftranscript|><|en|><|transcribe|><|notimestamps|> Hello world.<|endoftext|> ``` - `<|en|>` — language tag; forces translation-vs-transcription behavior. - `<|transcribe|>` or `<|translate|>` — translate English output from any-language input, or verbatim. - `<|notimestamps|>` — skip word-level timestamps (faster). The prompt is what lets one model do many tasks. Change `<|en|>` to `<|fr|>` and it transcribes French. **30-second window.** Everything is pinned to 30 seconds. Longer clips need chunking; shorter clips are padded. Windows are not streamed natively — this is why WhisperX, Whisper-Streaming, and faster-whisper exist. **Log-mel normalization.** `(log_mel - mean) / std` where the stats come from Whisper's own training corpus. You *must* use Whisper's preprocessing (`whisper.audio.log_mel_spectrogram`), not `librosa.feature.melspectrogram`. ### Variants in 2026 | Variant | Params | Latency (A100) | WER (LibriSpeech-clean) | |---------|--------|----------------|------------------------| | Tiny | 39M | 1× realtime | 5.4% | | Base | 74M | 1× | 4.1% | | Small | 244M | 1× | 3.0% | | Medium | 769M | 1× | 2.7% | | Large-v3 | 1.55B | 2× | 1.8% | | Large-v3-turbo | 809M | 8× | 1.58% | | Whisper-Streaming (2024) | 1.55B | streaming | 2.0% | ### Fine-tuning Canonical workflow in 2026: 1. Collect 10–100 hours of target-domain audio with aligned transcripts. 2. Run `transformers.Seq2SeqTrainer` with `generate_with_loss` callback. 3. Parameter-efficient: LoRA on `q_proj`, `k_proj`, `v_proj` of attention layers reduces GPU memory 4× with <0.3 WER cost. 4. Freeze the encoder if you have <10 hours. Only tune the decoder. 5. Use Whisper's own tokenizer and prompt format; never swap tokenizers. Community results: fine-tuning Medium on 20 hours of medical dictation drops WER from 12% to 4.5% on medical vocabulary. Fine-tuning Turbo on 4 hours of Icelandic drops WER from 18% to 6%.