AI From Scratch

Phase 05/29 lessons/~30 hours

NLP — Foundations to Advanced

Language is the interface to intelligence. Master every layer.

0 / 29 complete0%

Lessons

01Text Processing — Tokenization, Stemming, LemmatizationUp nextLanguage is continuous. Models are discrete. Preprocessing is the bridge.Build/~45 minutes/Python 02Bag of Words, TF-IDF, and Text RepresentationCount first, think later. TF-IDF still beats embeddings on well-defined tasks in 2026.Build/~75 minutes/Python 03Word Embeddings — Word2Vec from ScratchA word is the company it keeps. Train a shallow net on that idea and geometry falls out.Build/~75 minutes/Python 04GloVe, FastText, and Subword EmbeddingsWord2Vec trained one embedding per word. GloVe factorized the co-occurrence matrix. FastText embedded the pieces. BPE bridged to transformers.Build/~45 minutes/Python 05Sentiment AnalysisThe canonical NLP task. Most of what you need to know about classical text classification shows up here.Build/~75 minutes/Python 06Named Entity RecognitionPull the names out. Sounds easy until you deal with ambiguous boundaries, nested entities, and domain jargon.Build/~75 minutes/Python 07POS Tagging and Syntactic ParsingGrammar was unfashionable for a while. Then every LLM pipeline needed to validate structured extraction, and it came back.Build/~45 minutes/Python 08CNNs and RNNs for TextConvolutions learn n-grams. Recurrences remember. Both are superseded by attention. Both still matter on constrained hardware.Build/~75 minutes/Python 09Sequence-to-Sequence ModelsTwo RNNs pretending to be a translator. The bottleneck they hit is the reason attention exists.Build/~75 minutes/Python 10Attention Mechanism — The BreakthroughThe decoder stops squinting at a compressed summary and starts looking at the whole source. Everything after this is attention plus engineering.Build/~45 minutes/Python 11Machine TranslationTranslation is the task that paid for NLP research for thirty years and keeps paying now.Build/~75 minutes/Python 12Text SummarizationExtractive systems tell you what the document said. Abstractive systems tell you what the author meant. Different tasks, different pitfalls.Build/~75 minutes/Python 13Question Answering SystemsThree systems shaped modern QA. Extractive found spans. Retrieval-augmented grounded them in documents. Generative produced answers. Every modern AI assistant is a mix of the three.Build/~75 minutes/Python 14Information Retrieval and SearchBM25 is precise but brittle. Dense casts a wide net but misses keywords. Hybrid is the 2026 default. Everything else is tuning.Build/~75 minutes/Python 15Topic Modeling — LDA and BERTopicLDA: documents are mixtures of topics, topics are distributions over words. BERTopic: documents cluster in embedding space, clusters are topics. Same goal, different decompositions.Learn/~45 minutes/Python 16Text Generation Before Transformers — N-gram Language ModelsIf a word is surprising, the model is bad. Perplexity makes surprise a number. Smoothing keeps it finite.Build/~45 minutes/Python 17Chatbots — Rule-Based to Neural to LLM AgentsELIZA replied with pattern matches. DialogFlow mapped intents. GPT answered from weights. Claude runs tools and verifies. Each era solved the previous one's worst failure.Learn/~75 minutes/Python 18Multilingual NLPOne model, 100+ languages, zero training data for most of them. Cross-lingual transfer is the practical miracle of the 2020s.Learn/~45 minutes/Python 19Subword Tokenization — BPE, WordPiece, Unigram, SentencePieceWord tokenizers choke on unseen words. Character tokenizers blow up sequence length. Subword tokenizers split the difference. Every modern LLM ships on one.Learn/~60 minutes/Python 20Structured Outputs & Constrained DecodingAsk an LLM for JSON. Get JSON most of the time. In production, "most" is the problem. Constrained decoding turns "most" into "always" by editing the logits before sampling.Build/~60 minutes/Python 21Natural Language Inference — Textual Entailment"t entails h" means a human reading t would conclude h is true. NLI is the task of predicting entailment / contradiction / neutral. Boring on the surface, load-bearing in production.Learn/~60 minutes/Python 22Embedding Models — The 2026 Deep DiveWord2Vec gave you a vector per word. Modern embedding models give you a vector per passage, cross-lingual, with sparse, dense, and multi-vector views, sized to fit your index. Pick wrong and your RAG retrieves the wrong thing.Learn/~60 minutes/Python 23Chunking Strategies for RAGChunking configuration influences retrieval quality as much as the choice of embedding model (Vectara NAACL 2025). Get chunking wrong and no amount of reranking saves you.Build/~60 minutes/Python 24Coreference Resolution"She called him. He did not answer. The doctor was at lunch." Three references to two people and nobody is named. Coreference resolution figures out who is who.Learn/~60 minutes/Python 25Entity Linking & DisambiguationNER found "Paris." Entity linking decides: Paris, France? Paris Hilton? Paris, Texas? Paris (the Trojan prince)? Without linking, your knowledge graph stays ambiguous.Build/~60 minutes/Python 26Relation Extraction & Knowledge Graph ConstructionNER found the entities. Entity linking anchored them. Relation extraction finds the edges between them. A knowledge graph is the sum of nodes, edges, and their provenance.Build/~60 minutes/Python 27LLM Evaluation — RAGAS, DeepEval, G-EvalExact-match and F1 miss semantic equivalence. Human review does not scale. LLM-as-judge is the production answer — with enough calibration to trust the number.Build/~75 minutes/Python 28Long-Context Evaluation — NIAH, RULER, LongBench, MRCRGemini 3 Pro advertises 10M tokens of context. At 1M tokens, 8-needle MRCR drops to 26.3%. Advertised ≠ usable. Long-context evaluation tells you the actual capacity of the model you are shipping on.Learn/~60 minutes/Python 29Dialogue State Tracking"I want a cheap restaurant in the north... actually make it moderate... and add Italian." Three turns, three state updates. DST keeps the slot-value dict in sync so the booking works.Build/~75 minutes/Python