Schedule
Introduction to Language Models and Inference (Aug 26)
Content:
- What is a language model?
- What is an inference algorithm?
- What will we not cover?
- What are transformers?
- How do modern LMs work?
- Modeling errors and search errors
- Prompting as a means of model control
- Instruction following behavior
Code: Code
Reading Material
- Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)
Probability Review (Aug 28)
Content:
- Probability review
- Transformer implementation
- Generation and evaluation
- Meta-generation
Code: Code
Reading Material
None
Common Sampling Methods for Modern NLP (Sep 2)
Content:
- Common sampling methods for modern NLP
- Diversity-quality tradeoffs
Slides: Google slides
Code: n/a
Reading Material
- Reference: A Thorough Examination of Decoding Methods in the Era of LLMs
- Reference: Trading Off Diversity and Quality in Natural Language Generation
- Optional: Calibration of Pre-trained Transformers
- Optional: Locally Typical Sampling
- Optional: Forking Paths in Neural Text Generation
- Optional: Truncation Sampling as Language Model Desmoothing
- Optional: Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity
- Optional: The Curious Case of Neural Text Degeneration
- Optional: Calibrated Language Models Must Hallucinate
- Optional: An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search
Beam Search and Variants (Sep 4)
Content:
- Beam search and variants
- Inadequacies of the mode
Slides: Google slides
Code: TBA
Reading Material
- Reference: Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
- Reference: Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement
- Optional: Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation
- Optional: If beam search is the answer, what was the question?
- Optional: Gumbel-max trick and weighted reservoir sampling
Intro to A* and Best First Search (Sep 9)
Content:
- Introduction to A* and best first search
- A* methods for controlled generation
Slides: TBA
Code: TBA
Reading Material
- Reference: Best-First Beam Search (arXiv)
- Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)
Assignments
Other Controlled Generation Methods (Sep 11)
Content:
- Other controlled generation methods
- Decoding-time distributional modifiers
Slides: TBA
Code: TBA
Reading Material
- Reference: FUDGE: Controlled Text Generation With Future Discriminators (arXiv)
- Reference: Contrastive Search Is What You Need For Neural Text Generation (arXiv)
- Reference: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model (arXiv)
Assignments
- Homework 2 out: implementation of beam search, mirostat, temperature, top-p, top-k sampling and comparison on the shared tasks
Chain of Thought and Intermediate Steps (Sep 16)
Content:
- Chain of thought / scratchpad, intermediate steps
- Why does chain of thought work?
- Tree of thoughts
Slides: TBA
Code: TBA
Reading Material
Self-Refine and Self-Correction Methods (Sep 18)
Reasoning Models (Sep 23)
Incorporating Tools (Sep 25)
Content:
- Incorporating tools: math/verification based, search, etc.
Slides: TBA
Code: TBA
Reading Material
Agents and Multi-Agent Communication (Sep 30)
Reward Models and Best-of-N (Oct 2)
Content:
- Reward models, best-of-n theory and practice
- Monte Carlo Tree Search
Slides: TBA
Code: TBA
Reading Material
Reference: Why reward models are key for alignment (by Nathan Lambert)
Reference: Theoretical guarantees on the best-of-n alignment policy (arXiv)
Assignments
Minimum Bayes Risk and Multi-Sample Strategies (Oct 7)
Content:
- What do we get when we sample more?
- Minimum Bayes Risk and similar methods
Slides: TBA
Code: TBA
Reading Material
Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)
Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)
Assignments
- Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks
Systems not Models (Oct 7)
Guest Lecturer: Omar Khattab
Content:
- Parallels to older “pipeline NLP”
- Ensembling
- Visualizing and evaluating systems
- Human-in-the-loop decoding
- Brief discussion of HCI perspectives
Slides: TBA
Code: TBA
Reading Material
Assignments
- Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks
Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)
Content:
- What do we get when we sample more?
- Minimum Bayes Risk and similar methods
Slides: TBA
Code: TBA
Reading Material
Systems not Models (Oct 9)
Content:
- Parallels to older “pipeline NLP”
- Ensembling
- Visualizing and evaluating systems
- Human-in-the-loop decoding
- Brief discussion of HCI perspectives
Slides: TBA
Code: TBA
Reading Material
No Class - Fall Break (Oct 14)
No Class - Fall Break (Oct 16)
Inference Scaling vs Model Size (Oct 21)
Content:
- Inference scaling versus scaling model size
- Differences in cost and latency considerations
- Modeling scaling behavior
Slides: TBA
Code: TBA
Reading Material
- Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Token Budgets and Training-Time Distillation (Oct 23)
Content:
- Token budgets
- Training-time distillation of inference algorithms
- Draft CoT
- Early exit voting
Slides: TBA
Code: TBA
Reading Material
Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)
Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)
Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)
Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)
Diffusion Models (Oct 28)
Content:
- Introduction to diffusion models
- Denoising diffusion probabilistic models (DDPM)
- Score-based generative models
- Diffusion models for text generation
- Comparison with autoregressive models
- Inference techniques for diffusion models
- Applications in multimodal generation
Slides: TBA
Code: TBA
Reading Material
Defining Efficiency (Oct 30)
Content:
- How do we define efficiency?
- Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
- Brief review of hardware for inference
Slides: TBA
Code: TBA
Reading Material
No Class - Democracy Day (Nov 4)
Inference and Hardware (Nov 6)
Content:
- Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
- Memory bandwidth, compute, and latency considerations
- Parallelism strategies and deployment tradeoffs
Slides: TBA
Code: TBA
Reading Material
Library Implementations and Optimizations (Nov 11)
Content:
- Library implementations
- Lazy softmax
- Flash attention
- How do vLLM/SGLang/similar speed up generation?
Slides: TBA
Code: TBA
Reading Material
Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)
Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)
Assignments
Prefix Sharing and KV Cache Optimizations (Nov 13)
Content:
- Prefix sharing
- KV cache reuse
- Key-value cache compression
- Model compression
- Brief quantization overview
Slides: TBA
Code: TBA
Reading Material
Draft Models and Speculative Decoding (Nov 18)
Content:
- Draft models
- Speculative decoding
- Other latency improving methods
Slides: TBA
Code: TBA
Reading Material
Linearizing Attention and Sparse Models (Nov 20)
Transformer Alternatives (Nov 25)
Content:
- Transformer alternatives
Slides: TBA
Code: TBA
Reading Material
- Reference: The Annotated S4