Phase 12: Multimodal AI
AI From Scratch/Lesson 10/~120 minutes

InternVL3: Native Multimodal Pretraining

Every open VLM before InternVL3 followed the same three-step recipe: take a text LLM trained on trillions of text tokens, bolt on a vision encoder, then fine-tune the seams. This works but has alignment debt — the text LLM has spent its fu...

Learn
Loading lesson page...