AI From Scratch/Lesson 17/~75 minutes

Disaggregated Prefill/Decode — NVIDIA Dynamo and llm-d

Prefill is compute-bound; decode is memory-bound. Running both on the same GPU wastes one resource. Disaggregation splits them onto separate pools and transfers KV cache between them over NIXL (RDMA/InfiniBand or TCP fallback). NVIDIA Dyna...

Learn

Loading lesson page...