AI From Scratch

Phase 06/17 lessons/~18 hours

Speech & Audio

The other half of human communication. Hear, understand, speak.

0 / 17 complete0%

Lessons

01Audio Fundamentals — Waveforms, Sampling, Fourier TransformUp nextWaveforms are the raw signal. Spectrograms are the representation. Mel features are the ML-friendly form. Every modern ASR and TTS pipeline walks this ladder, and the first rung is understanding sampling and Fourier.Learn/~45 minutes/Python 02Spectrograms, Mel Scale & Audio FeaturesNeural nets do not consume raw waveforms well. They consume spectrograms. They consume mel spectrograms even better. Every ASR, TTS, and audio classifier in 2026 lives or dies by this single preprocessing choice.Build/~45 minutes/Python 03Audio Classification — From k-NN on MFCCs to AST and BEATsEverything from "dog barking vs siren" to "which language is this" is audio classification. The features are mels. The architecture moves each decade. The evaluation stays AUC, F1, and per-class recall.Build/~75 minutes/Python 04Speech Recognition (ASR) — CTC, RNN-T, AttentionSpeech recognition is audio classification at every timestep, glued together by a sequence model that knows English and silence. CTC, RNN-T, and attention are the three ways to do it. Pick one and understand why.Build/~45 minutes/Python 05Whisper — Architecture & Fine-TuningWhisper is a 30-second-window transformer encoder-decoder, trained on 680k hours of multilingual weakly-supervised audio-text pairs. One architecture, multiple tasks, robust across 99 languages. The 2026 reference ASR.Build/~75 minutes/Python 06Speaker Recognition & VerificationASR asks "what did they say?" Speaker recognition asks "who said it?" The math looks the same — embeddings plus cosine — but every production decision hinges on a single EER number.Build/~45 minutes/Python 07Text-to-Speech (TTS) — From Tacotron to F5 and KokoroASR inverts speech to text; TTS inverts text to speech. The 2026 stack is three parts: text → tokens, tokens → mel, mel → waveform. Each part has a default model that fits in a laptop.Build/~75 minutes/Python 08Voice Cloning & Voice ConversionVoice cloning reads your text in someone else's voice. Voice conversion rewrites your voice into someone else's while preserving what you said. Both hang on the same decomposition: separate speaker identity from content.Build/~75 minutes/Python 09Music Generation — MusicGen, Stable Audio, Suno, and the Licensing Earthquake2026 music generation: Suno v5 and Udio v4 dominate commercial; MusicGen, Stable Audio Open, and ACE-Step lead open-source. The technical problem is mostly solved. The legal problem (Warner Music $500M settlement, UMG settlement) reshaped...Build/~75 minutes/Python 10Audio-Language Models — Qwen2.5-Omni, Audio Flamingo, GPT-4o Audio2026 audio-language models reason over speech + environmental sound + music. Qwen2.5-Omni-7B matches GPT-4o Audio on MMAU-Pro. Audio Flamingo Next beats Gemini 2.5 Pro on LongAudioBench. The gap between open and closed is essentially close...Learn/~45 minutes/Python 11Real-Time Audio ProcessingBatch pipelines process a file. Real-time pipelines process the next 20 milliseconds before the next 20 arrive. Every conversational AI, broadcast studio, and telephony bot lives and dies by this latency budget.Build/~75 minutes/Python 12Build a Voice Assistant Pipeline — The Phase 6 CapstoneEverything from lessons 01-11, stitched together. Build a voice assistant that listens, reasons, and talks back. In 2026 that is a solved engineering problem, not a research problem — but the integration details decide whether it ships.Build/~120 minutes/Python 13Neural Audio Codecs — EnCodec, SNAC, Mimi, DAC and the Semantic-Acoustic Split2026 audio generation is almost all tokens. EnCodec, SNAC, Mimi, and DAC turn continuous waveforms into discrete sequences that a transformer can predict. The semantic-vs-acoustic token split — first-codebook as semantic, rest as acoustic...Learn/~60 minutes/Python 14Voice Activity Detection & Turn-Taking — Silero, Cobra, and the Flush TrickEvery voice agent lives or dies on two decisions: is the user speaking now, and are they done? VAD answers the first. Turn-detection (VAD + silence-hangover + semantic endpoint model) answers the second. Get either wrong and your assistant...Build/~45 minutes/Python 15Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both abandon the ASR → LLM → TTS pipeline for a unified full-duplex...Learn/~75 minutes/Python 16Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerifyVoice cloning shipped faster than defenses. 2026 production voice systems need two things: a detector (AASIST, RawNet2) that classifies real vs fake speech, and a watermark (AudioSeal) that survives compression and editing. Ship both or do...Build/~75 minutes/Python 17Audio Evaluation — WER, MOS, UTMOS, MMAU, FAD, and the Open LeaderboardsYou cannot ship what you cannot measure. This lesson names the 2026 metrics for every audio task: ASR (WER, CER, RTFx), TTS (MOS, UTMOS, SECS, WER-on-ASR-round-trip), audio-language (MMAU, LongAudioBench), music (FAD, CLAP), and speaker (E...Learn/~60 minutes/Python