Phase 12: Multimodal AI
AI From Scratch/Lesson 21/~180 minutes

Embodied VLAs: RT-2, OpenVLA, π0, GR00T

The first time a model read a recipe off a website and executed it in a kitchen robot was RT-2 (Google DeepMind, July 2023). RT-2 discretized actions as text tokens, co-fine-tuned a VLM on web data plus robot-action data, and proved that w...

LearnPython (stdlibaction tokenizer + VLA inference skeleton)
Loading lesson page...