### The ViT-MLP-LLM architecture ```mermaid flowchart LR IMG["Image
(H x W x 3)"] --> ViT["Vision encoder
(ViT, CLIP-L,
SigLIP, DINOv3)"] ViT --> FEATS["Image tokens
(N, d_vit)"] FEATS --> PROJ["Projector
(2-4 layer MLP
or Q-former)"] PROJ --> VTOK["Image tokens
in LLM space
(N, d_llm)"] TXT["Text prompt"] --> TOK["LLM tokenizer"] TOK --> TTOK["Text tokens
(M, d_llm)"] VTOK --> CONCAT["Interleave
or concat"] TTOK --> CONCAT CONCAT --> LLM["Decoder LLM
(Qwen3, LLaMA, etc.)"] LLM --> OUT["Text answer"] style ViT fill:#dbeafe,stroke:#2563eb style PROJ fill:#fef3c7,stroke:#d97706 style LLM fill:#dcfce7,stroke:#16a34a ``` 1. **Vision encoder** — a pretrained ViT (CLIP-L/14, SigLIP, DINOv3, or a fine-tuned variant). Produces patch tokens. 2. **Projector** — a small module (2-4 layer MLP, or a Q-former) that maps vision tokens into the LLM's embedding dimension. This is where most of the fine-tuning happens. 3. **LLM** — a decoder-only language model (Qwen3, Llama, Mistral, GLM, InternLM). Reads the vision + text tokens in sequence, generates text. All three pieces are trainable in principle. In practice, the vision encoder and LLM stay mostly frozen while the projector trains — a few billion parameters of signal for cheap. ### DeepStack Vanilla projection uses only the last ViT layer. DeepStack (Qwen3-VL) samples features from multiple ViT depths and stacks them. Deeper layers carry high-level semantics; shallower layers carry fine-grained spatial and textural information. Feeding both into the LLM closes the gap between "what does the image contain" (semantics) and "where exactly" (spatial grounding). ### Three training stages Modern VLMs train in stages: 1. **Alignment** — freeze ViT and LLM. Train only the projector on image-caption pairs. Teaches the projector to map vision space into language space. 2. **Pre-training** — unfreeze everything. Train on large-scale interleaved image-text data (500M+ pairs). Builds the model's visual knowledge. 3. **Instruction tuning** — fine-tune on curated (image, question, answer) triples. Teaches conversational behaviour and task formats. This is what turns a "vision-aware LM" into a usable assistant. Most LoRA fine-tunes target stage 3 with a small labelled dataset. ### Model family comparison (early 2026) | Model | Params | Vision encoder | LLM | Context | Strengths | |-------|--------|----------------|-----|---------|-----------| | Qwen3-VL-235B-A22B (MoE) | 235B (22B active) | custom ViT + DeepStack | Qwen3 | 256K | General SOTA, GUI agent | | Qwen3-VL-30B-A3B (MoE) | 30B (3B active) | custom ViT + DeepStack | Qwen3 | 256K | Smaller MoE alternative | | Qwen3-VL-8B (dense) | 8B | custom ViT | Qwen3 | 128K | Production dense default | | InternVL3.5-38B | 38B | InternViT-6B | Qwen3 + GPT-OSS | 128K | Strong MMBench / MMVet | | InternVL3.5-241B-A28B | 241B (28B active) | InternViT-6B | Qwen3 | 128K | Competitive with GPT-4o | | LLaVA-Next 72B | 72B | SigLIP | Llama-3 | 32K | Open, easy to fine-tune | | GLM-4.6V | ~70B | custom | GLM | 64K | Open-source, strong OCR | | MiniCPM-V-2.6 | 8B | SigLIP | MiniCPM | 32K | Edge-friendly | ### Visual agents Qwen3-VL-235B reaches top global performance on OSWorld — a benchmark for **visual agents** that operate GUIs (desktop, mobile, web). The model sees a screenshot, understands the UI, and emits actions (click, type, scroll). Combined with tools, it closes the loop on common desktop tasks. This is what most 2026 "AI PC" demos run under the hood. ### Agentic capabilities + RoPE variants VLMs need to know **when** a frame is in a video. Qwen3-VL evolved from T-RoPE (temporal rotary position embeddings) to **text-based time alignment** — explicit timestamp text tokens interleaved with video frames. The model sees "`` frame, prompt" and can reason about temporal relationships. ### The alignment problem 12% of image-text pairs in a crawled dataset contain descriptions not fully grounded in the image. A VLM trained on this silently learns to hallucinate — fabricate objects, misread numbers, invent relationships. In production this is the dominant failure mode. Skywork.ai introduced the **Cross-Modal Error Rate (CMER)** to track it: ``` CMER = fraction of outputs where the text confidence is high but the image-text similarity (via a CLIP-family checker) is low ``` High CMER means the model is confidently saying things not grounded in the image. Monitoring CMER and treating it as a production KPI cut hallucination rate by ~35% in their deployment. The trick is not "fix the model" but "route high-CMER outputs to human review." ### Fine-tuning with LoRA / QLoRA Full fine-tuning of a 70B VLM is out of reach for most teams. LoRA (rank 16-64) on attention + projector layers, or QLoRA with 4-bit base weights, fits on a single A100 / H100. Cost: 5,000-50,000 examples, $100-$5,000 in compute, 2-10 hours of training. ### Spatial reasoning is still weak Current VLMs score 50-60% on spatial reasoning benchmarks (above-below, left-right, counting, distance). If your use case depends on "which object is on top of which," validate heavily — generic VLM performance is below human. Better-than-VLM alternatives for pure spatial tasks: a specialised keypoint / pose estimator, a depth model, or a detection model with box geometry post-processed.