Phase 17: Infrastructure & Production
AI From Scratch/Lesson 11/~60 minutes

Multi-Region LLM Serving and KV Cache Locality

Round-robin load balancing is actively harmful for cached LLM inference. A request that does not land on the node holding its prefix pays full prefill cost — roughly 800 ms at P50 on a long prompt versus ~80 ms with a cache hit. In 2026 th...

Learn
Loading lesson page...