Schedule
Introduction to Language Models and Inference (Aug 26)
Content:
- What is a language model?
- What is an inference algorithm?
- What will we not cover?
- What are transformers?
- How do modern LMs work?
Slides: TBA
Code: TBA
Reading Material
- Required: The Illustrated Transformer (Jay Alammar)
- Required: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Assignments
- HW1 Released: Math homework with small coding section to run inference on models with Hugging Face / VLLM
Probability Review and Shared Task Introduction (Aug 28)
Content:
- Probability review
- Using models review – APIs, local models, etc.
- Introduction to shared task: 2-3 tasks with multiple evaluations
- Given a fixed model and token/compute budgets, get the best possible outputs on the unseen test set
Slides: TBA
Code: TBA
Reading Material
- Required: Probability Review from Rob Hall (or similar)
Assignments
- Homework 1 out: primarily math homework, with some small coding section to run inference on a few models with Hugging Face / VLLM to show setup is working
Common Sampling Methods for Modern NLP (Sep 2)
Content:
- Common sampling methods for modern NLP
- Diversity-quality tradeoffs
Slides: TBA
Code: TBA
Reading Material
Beam Search and Variants (Sep 4)
Content:
- Beam search and variants
- Inadequacies of the mode
Slides: TBA
Code: TBA
Reading Material
Intro to A* and Best First Search (Sep 9)
Content:
- Introduction to A* and best first search
- A* methods for controlled generation
Slides: TBA
Code: TBA
Reading Material
- Required: Best-First Beam Search
- Required: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics
Assignments
- Homework 1 due
Other Controlled Generation Methods (Sep 11)
Content:
- Other controlled generation methods
- Decoding-time distributional modifiers
Slides: TBA
Code: TBA
Reading Material
- Required: FUDGE: Controlled Text Generation With Future Discriminators
- Required: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
Assignments
- Homework 2 out: implementation of beam search, mirostat, temperature, top-p, top-k sampling and comparison on the shared tasks
Prompting as a Means of Model Control (Sep 16)
Content:
- Prompting as a means of model control
- Instruction following behavior
Slides: TBA
Code: TBA
Reading Material
- TBA
Chain of Thought and Intermediate Steps (Sep 18)
Content:
- Chain of thought / scratchpad
- Intermediate steps
- Why does chain of thought work?
Slides: TBA
Code: TBA
Reading Material
Self-Refine and Self-Correction Methods (Sep 23)
Content:
- Self-refine and self-correction methods
Slides: TBA
Code: TBA
Reading Material
Monte Carlo Tree Search (Sep 25)
Content:
- Monte Carlo Tree Search
- Tree of thoughts
Slides: TBA
Code: TBA
Reading Material
Minimum Bayes Risk (Sep 30)
Content:
- Minimum Bayes Risk and similar methods
Slides: TBA
Code: TBA
Reading Material
Reward Models (Oct 2)
Content:
- Reward models
- Best-of-n theory and practice
Slides: TBA
Code: TBA
Reading Material
- Required: Why reward models are key for alignment (by Nathan Lambert)
- Required: Theoretical guarantees on the best-of-n alignment policy
Incorporating Tools (Oct 7)
Content:
- Incorporating tools: math/verification based, search, etc.
Slides: TBA
Code: TBA
Reading Material
- Required: Toolformer: Language Models Can Teach Themselves to Use Tools
- Required: What Are Tools Anyway? A Survey from the Language Model Perspective
Assignments
- Homework 2 due
Agents and Multi-Agent Communication (Oct 9)
Content:
- Agents and multi-agent communication
Slides: TBA
Code: TBA
Reading Material
- Required: DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
- Required: A Survey on LLM-based Multi-Agent Systems: Workflow, Infrastructure, and Applications
Assignments
- Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks
Systems not Models (Oct 21)
Content:
- Parallels to older “pipeline NLP”
- Ensembling
- Visualizing and evaluating systems
- Human-in-the-loop decoding
- Brief discussion of HCI perspectives
Slides: TBA
Code: TBA
Reading Material
Inference Scaling vs Model Size (Oct 23)
Content:
- Inference scaling versus scaling model size
- Differences in cost and latency considerations
- Modeling scaling behavior
Slides: TBA
Code: TBA
Reading Material
Using External Verifiers (Oct 28)
Content:
- Using external verifiers
- Can reasoning be a verifiable game?
- Can LLMs learn to plan?
- O1 / DeepSeek-R1 / similar
Slides: TBA
Code: TBA
Reading Material
Token Budgets (Oct 30)
Content:
- Token budgets
- Training-time distillation of inference algorithms
- Draft CoT
- Early exit voting
Slides: TBA
Code: TBA
Reading Material
- Required: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Required: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
- Required: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding
- Required: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods
Defining Efficiency (Nov 6)
Content:
- How do we define efficiency?
- Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
- Brief review of hardware for inference
Slides: TBA
Code: TBA
Reading Material
- Required: Transformer Inference Arithmetic
Library Implementations and Optimizations (Nov 11)
Content:
- Library implementations
- Lazy softmax
- Flash attention
- How do vLLM/SGLang/similar speed up generation?
Slides: TBA
Code: TBA
Reading Material
- Required: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Required: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY
Assignments
- Homework 3 due
- Homework 4 out: transformer hardware math, implement speculative decoding and KV caching
Prefix Sharing and KV Cache Optimizations (Nov 13)
Content:
- Prefix sharing
- KV cache reuse
- Key-value cache compression
- Model compression
- Brief quantization overview
Slides: TBA
Code: TBA
Reading Material
Draft Models and Speculative Decoding (Nov 18)
Content:
- Draft models
- Speculative decoding
- Other latency improving methods
Slides: TBA
Code: TBA
Reading Material
Linearizing Attention and Sparse Models (Nov 20)
Content:
- Linearizing attention
- Sparse models
Slides: TBA
Code: TBA
Reading Material
- TBA
Assignments
- Test set released for shared task (without labels)
Transformer Alternatives (Nov 25)
Content:
- Transformer alternatives
Slides: TBA
Code: TBA
Reading Material
- Required: The Annotated S4
Assignments
- Shared task systems due: code plus validation and test outputs
Wrapup and Historical Perspective (Dec 2)
Content:
- Wrapup/catchup time
- A bit of a historical view: what was decoding like five, ten years ago?
- What does the future hold for decoding?
Slides: TBA
Code: TBA
Reading Material
- Required: Learning to Reason with LLMs
- Required: Lattice-based Viterbi decoding techniques for speech translation
Assignments
- Homework 4 due
Shared Task Results and Poster Sessions (Dec 4)
Content:
- Shared task results
- Poster sessions
Slides: N/A
Code: N/A
Assignments
- Shared task report due