Schedule
Introduction to Language Models and Inference (Aug 26)
Content:
- What is a language model?
- What is an inference algorithm?
- What will we not cover?
- What are transformers?
- How do modern LMs work?
- Modeling errors and search errors
- Prompting as a means of model control
- Instruction following behavior
Slides: TBA
Code: TBA
Reading Material
- Required: The Illustrated Transformer (Jay Alammar)
- Required: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Assignments
- HW1 Released: Math homework with small coding section to run inference on models with Hugging Face / VLLM
Probability Review and Shared Task Introduction (Aug 28)
Content:
- Probability review
- Using models review – APIs, local models, etc.
- Introduction to shared task: 2-3 tasks with multiple evaluations
- Given a fixed model and token/compute budgets, get the best possible outputs on the unseen test set
Slides: TBA
Code: TBA
Reading Material
- Required: Probability Review from Rob Hall (or similar– will make our own slides)
Assignments
- Homework 1 out: primarily math homework, with some small coding section to run inference on a few models with Hugging Face / VLLM to show setup is working
Common Sampling Methods for Modern NLP (Sep 2)
Content:
- Common sampling methods for modern NLP
- Diversity-quality tradeoffs
Slides: TBA
Code: TBA
Reading Material
- Required: A Thorough Examination of Decoding Methods in the Era of LLMs
- Required: Trading Off Diversity and Quality in Natural Language Generation
- Optional: Truncation Sampling as Language Model Desmoothing
- Optional: Calibration of Pre-trained Transformers
- Optional: Locally Typical Sampling
- Optional: Forking Paths in Neural Text Generation
- Optional: An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search
Beam Search and Variants (Sep 4)
Content:
- Beam search and variants
- Inadequacies of the mode
Slides: TBA
Code: TBA
Reading Material
Intro to A* and Best First Search (Sep 9)
Content:
- Introduction to A* and best first search
- A* methods for controlled generation
Slides: TBA
Code: TBA
Reading Material
- Required: Best-First Beam Search
- Required: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics
Assignments
- Homework 1 due
Other Controlled Generation Methods (Sep 11)
Content:
- Other controlled generation methods
- Decoding-time distributional modifiers
Slides: TBA
Code: TBA
Reading Material
- Required: FUDGE: Controlled Text Generation With Future Discriminators
- Required: Contrastive Search Is What You Need For Neural Text Generation
- Required: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
Assignments
- Homework 2 out: implementation of beam search, mirostat, temperature, top-p, top-k sampling and comparison on the shared tasks
Chain of Thought and Intermediate Steps (Sep 16)
Content:
- Chain of thought / scratchpad, intermediate steps
- Why does chain of thought work?
- Tree of thoughts
Slides: TBA
Code: TBA
Reading Material
Self-Refine and Self-Correction Methods (Sep 18)
Content:
- Self-refine and self-correction methods
Slides: TBA
Code: TBA
Reading Material
Reasoning Models (Sep 23)
Reasoning Models Part Two (Sep 25)
Content:
- Reasoning models part two: efficiency and attribution
Slides: TBA
Code: TBA
Reading Material
- TBA
Incorporating Tools (Sep 30)
Content:
- Incorporating tools: math/verification based, search, etc.
Slides: TBA
Code: TBA
Reading Material
Agents and Multi-Agent Communication (Oct 2)
Content:
- Agents and multi-agent communication
Slides: TBA
Code: TBA
Reading Material
Minimum Bayes Risk and Multi-Sample Strategies (Oct 7)
Content:
- What do we get when we sample more?
- Minimum Bayes Risk and similar methods
Slides: TBA
Code: TBA
Reading Material
- Required: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk
- Required: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
Assignments
- Homework 2 due
Reward Models and Best-of-N (Oct 9)
Content:
- Reward models, best-of-n theory and practice
- Monte Carlo Tree Search
Slides: TBA
Code: TBA
Reading Material
- Required: Why reward models are key for alignment (by Nathan Lambert)
- Required: Theoretical guarantees on the best-of-n alignment policy
Assignments
- Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks
Systems not Models (Oct 21)
Content:
- Parallels to older “pipeline NLP”
- Ensembling
- Visualizing and evaluating systems
- Human-in-the-loop decoding
- Brief discussion of HCI perspectives
Slides: TBA
Code: TBA
Reading Material
Inference Scaling vs Model Size (Oct 23)
Content:
- Inference scaling versus scaling model size
- Differences in cost and latency considerations
- Modeling scaling behavior
Slides: TBA
Code: TBA
Reading Material
Using External Verifiers (Oct 28)
Content:
- Using external verifiers
- Can reasoning be a verifiable game?
- Can LLMs learn to plan?
- O1 / DeepSeek-R1 / similar
Slides: TBA
Code: TBA
Reading Material
Token Budgets and Training-Time Distillation (Oct 30)
Content:
- Token budgets
- Training-time distillation of inference algorithms
- Draft CoT
- Early exit voting
Slides: TBA
Code: TBA
Reading Material
- Required: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Required: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
- Required: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding
- Required: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods
No Class - Democracy Day (Nov 4)
No Class - Democracy Day
Defining Efficiency (Nov 6)
Content:
- How do we define efficiency?
- Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
- Brief review of hardware for inference
Slides: TBA
Code: TBA
Reading Material
- Required: Transformer Inference Arithmetic
Library Implementations and Optimizations (Nov 11)
Content:
- Library implementations
- Lazy softmax
- Flash attention
- How do vLLM/SGLang/similar speed up generation?
Slides: TBA
Code: TBA
Reading Material
- Required: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Required: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY
Assignments
- Homework 3 due
- Homework 4 out: transformer hardware math, implement speculative decoding and KV caching
Prefix Sharing and KV Cache Optimizations (Nov 13)
Content:
- Prefix sharing
- KV cache reuse
- Key-value cache compression
- Model compression
- Brief quantization overview
Slides: TBA
Code: TBA
Reading Material
Draft Models and Speculative Decoding (Nov 18)
Content:
- Draft models
- Speculative decoding
- Other latency improving methods
Slides: TBA
Code: TBA
Reading Material
Linearizing Attention and Sparse Models (Nov 20)
Content:
- Linearizing attention
- Sparse models
Slides: TBA
Code: TBA
Reading Material
- TBA
Assignments
- Test set released for shared task (without labels)
Transformer Alternatives (Nov 25)
Content:
- Transformer alternatives
Slides: TBA
Code: TBA
Reading Material
- Required: The Annotated S4
Assignments
- Shared task systems due: code plus validation and test outputs
No Class - Thanksgiving (Nov 27)
No Class - Thanksgiving
Wrapup and Historical Perspective (Dec 2)
Content:
- Wrapup/catchup time
- A bit of a historical view: what was decoding like five, ten years ago?
- What does the future hold for decoding?
Slides: TBA
Code: TBA
Reading Material
- Required: Learning to Reason with LLMs
- Required: Lattice-based Viterbi decoding techniques for speech translation
Assignments
- Homework 4 due
Shared Task Results and Poster Sessions (Dec 4)
Content:
- Shared task results
- Poster sessions
Slides: N/A
Code: N/A
Assignments
- Shared task report due