AI From Scratch/Lesson 15/~75 minutes

Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue

2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both abandon the ASR → LLM → TTS pipeline for a unified full-duplex...

LearnPythonNo prerequisites

Loading lesson page...