Schedule

Date	Topic	Lecturer
26-Aug	Introduction to Language Models and Inference HW1 Released	Graham
28-Aug	Probability Review and Shared Task Introduction	Graham
02-Sep	Common Sampling Methods for Modern NLP	Amanda
04-Sep	Beam Search and Variants	Amanda
09-Sep	Intro to A* and Best First Search HW1 Due	Graham
11-Sep	Other Controlled Generation Methods HW2 Released	Amanda
16-Sep	Chain of Thought and Intermediate Steps	Graham
18-Sep	Self-Refine and Self-Correction Methods	Graham
23-Sep	Reasoning Models	Graham
25-Sep	Reasoning Models Part Two	Graham
30-Sep	Incorporating Tools	Graham
02-Oct	Agents and Multi-Agent Communication	Graham
07-Oct	Minimum Bayes Risk and Multi-Sample Strategies HW2 Due	Amanda
09-Oct	Reward Models and Best-of-N HW3 Released	Amanda
21-Oct	Systems not Models	Amanda
23-Oct	Inference Scaling vs Model Size	Amanda
28-Oct	Using External Verifiers	Graham
30-Oct	Token Budgets and Training-Time Distillation	Amanda
04-Nov	No Class - Democracy Day
06-Nov	Defining Efficiency	Graham
11-Nov	Library Implementations and Optimizations HW3 Due HW4 Released	Graham
13-Nov	Prefix Sharing and KV Cache Optimizations	Amanda
18-Nov	Draft Models and Speculative Decoding	TBD
20-Nov	Linearizing Attention and Sparse Models Test Set Released	Amanda
25-Nov	Transformer Alternatives Shared Task Systems Due	TBD
27-Nov	No Class - Thanksgiving
02-Dec	Wrapup and Historical Perspective HW4 Due	Graham
04-Dec	Shared Task Results and Poster Sessions Shared Task Report Due

Introduction to Language Models and Inference (Aug 26)

Content:

What is a language model?
What is an inference algorithm?
What will we not cover?
What are transformers?
How do modern LMs work?
Modeling errors and search errors
Prompting as a means of model control
Instruction following behavior

Slides: TBA

Code: TBA

Reading Material

Required: The Illustrated Transformer (Jay Alammar)
Required: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Assignments

HW1 Released: Math homework with small coding section to run inference on models with Hugging Face / VLLM

Probability Review and Shared Task Introduction (Aug 28)

Content:

Probability review
Using models review – APIs, local models, etc.
Introduction to shared task: 2-3 tasks with multiple evaluations
Given a fixed model and token/compute budgets, get the best possible outputs on the unseen test set

Slides: TBA

Code: TBA

Reading Material

Required: Probability Review from Rob Hall (or similar– will make our own slides)

Assignments

Homework 1 out: primarily math homework, with some small coding section to run inference on a few models with Hugging Face / VLLM to show setup is working

Common Sampling Methods for Modern NLP (Sep 2)

Content:

Common sampling methods for modern NLP
Diversity-quality tradeoffs

Slides: TBA

Code: TBA

Reading Material

Required: A Thorough Examination of Decoding Methods in the Era of LLMs
Required: Trading Off Diversity and Quality in Natural Language Generation
Optional: Truncation Sampling as Language Model Desmoothing
Optional: Calibration of Pre-trained Transformers
Optional: Locally Typical Sampling
Optional: Forking Paths in Neural Text Generation
Optional: An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search

Beam Search and Variants (Sep 4)

Content:

Beam search and variants
Inadequacies of the mode

Slides: TBA

Code: TBA

Reading Material

Required: Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Required: If beam search is the answer, what was the question?

Intro to A* and Best First Search (Sep 9)

Content:

Introduction to A* and best first search
A* methods for controlled generation

Slides: TBA

Code: TBA

Reading Material

Required: Best-First Beam Search
Required: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics

Assignments

Homework 1 due

Other Controlled Generation Methods (Sep 11)

Content:

Other controlled generation methods
Decoding-time distributional modifiers

Slides: TBA

Code: TBA

Reading Material

Assignments

Homework 2 out: implementation of beam search, mirostat, temperature, top-p, top-k sampling and comparison on the shared tasks

Chain of Thought and Intermediate Steps (Sep 16)

Content:

Chain of thought / scratchpad, intermediate steps
Why does chain of thought work?
Tree of thoughts

Slides: TBA

Code: TBA

Reading Material

Required: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Required: Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Self-Refine and Self-Correction Methods (Sep 18)

Content:

Self-refine and self-correction methods

Slides: TBA

Code: TBA

Reading Material

Required: PAL: Program-aided Language Models
Required: Self-Refine: Iterative Refinement with Self-Feedback

Reasoning Models (Sep 23)

Content:

Reasoning models

Slides: TBA

Code: TBA

Reading Material

Required: Why We Think

Reasoning Models Part Two (Sep 25)

Content:

Reasoning models part two: efficiency and attribution

Slides: TBA

Code: TBA

Reading Material

Incorporating Tools (Sep 30)

Content:

Incorporating tools: math/verification based, search, etc.

Slides: TBA

Code: TBA

Reading Material

Required: Toolformer: Language Models Can Teach Themselves to Use Tools
Required: What Are Tools Anyway? A Survey from the Language Model Perspective

Agents and Multi-Agent Communication (Oct 2)

Content:

Agents and multi-agent communication

Slides: TBA

Code: TBA

Reading Material

Minimum Bayes Risk and Multi-Sample Strategies (Oct 7)

Content:

What do we get when we sample more?
Minimum Bayes Risk and similar methods

Slides: TBA

Code: TBA

Reading Material

Assignments

Homework 2 due

Reward Models and Best-of-N (Oct 9)

Content:

Reward models, best-of-n theory and practice
Monte Carlo Tree Search

Slides: TBA

Code: TBA

Reading Material

Required: Why reward models are key for alignment (by Nathan Lambert)
Required: Theoretical guarantees on the best-of-n alignment policy

Assignments

Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks

Systems not Models (Oct 21)

Content:

Parallels to older “pipeline NLP”
Ensembling
Visualizing and evaluating systems
Human-in-the-loop decoding
Brief discussion of HCI perspectives

Slides: TBA

Code: TBA

Reading Material

Required: The Shift from Models to Compound AI Systems

Inference Scaling vs Model Size (Oct 23)

Content:

Inference scaling versus scaling model size
Differences in cost and latency considerations
Modeling scaling behavior

Slides: TBA

Code: TBA

Reading Material

Required: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Using External Verifiers (Oct 28)

Content:

Using external verifiers
Can reasoning be a verifiable game?
Can LLMs learn to plan?
O1 / DeepSeek-R1 / similar

Slides: TBA

Code: TBA

Reading Material

Required: Stream of Search (SoS): Learning to Search in Language
Required: Awesome-o1
Required: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Token Budgets and Training-Time Distillation (Oct 30)

Content:

Token budgets
Training-time distillation of inference algorithms
Draft CoT
Early exit voting

Slides: TBA

Code: TBA

Reading Material

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Defining Efficiency (Nov 6)

Content:

How do we define efficiency?
Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
Brief review of hardware for inference

Slides: TBA

Code: TBA

Reading Material

Required: Transformer Inference Arithmetic

Library Implementations and Optimizations (Nov 11)

Content:

Library implementations
Lazy softmax
Flash attention
How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

Required: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Required: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY

Assignments

Homework 3 due
Homework 4 out: transformer hardware math, implement speculative decoding and KV caching

Content:

Prefix sharing
KV cache reuse
Key-value cache compression
Model compression
Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

Draft Models and Speculative Decoding (Nov 18)

Content:

Draft models
Speculative decoding
Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Required: Fast Inference from Transformers via Speculative Decoding
Required: A Hitchhiker’s Guide to Speculative Decoding

Linearizing Attention and Sparse Models (Nov 20)

Content:

Linearizing attention
Sparse models

Slides: TBA

Code: TBA

Reading Material

Assignments

Test set released for shared task (without labels)

Transformer Alternatives (Nov 25)

Content:

Transformer alternatives

Slides: TBA

Code: TBA

Reading Material

Required: The Annotated S4

Assignments

Shared task systems due: code plus validation and test outputs

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Wrapup and Historical Perspective (Dec 2)

Content:

Wrapup/catchup time
A bit of a historical view: what was decoding like five, ten years ago?
What does the future hold for decoding?

Slides: TBA

Code: TBA

Reading Material

Required: Learning to Reason with LLMs
Required: Lattice-based Viterbi decoding techniques for speech translation

Assignments

Homework 4 due

Shared Task Results and Poster Sessions (Dec 4)

Content:

Shared task results
Poster sessions

Slides: N/A

Code: N/A

Assignments

Shared task report due

Schedule

Introduction to Language Models and Inference (Aug 26)

Probability Review and Shared Task Introduction (Aug 28)

Common Sampling Methods for Modern NLP (Sep 2)

Beam Search and Variants (Sep 4)

Intro to A* and Best First Search (Sep 9)

Other Controlled Generation Methods (Sep 11)

Chain of Thought and Intermediate Steps (Sep 16)

Self-Refine and Self-Correction Methods (Sep 18)

Reasoning Models (Sep 23)

Reasoning Models Part Two (Sep 25)

Incorporating Tools (Sep 30)

Agents and Multi-Agent Communication (Oct 2)

Minimum Bayes Risk and Multi-Sample Strategies (Oct 7)

Reward Models and Best-of-N (Oct 9)

Systems not Models (Oct 21)

Inference Scaling vs Model Size (Oct 23)

Using External Verifiers (Oct 28)

Token Budgets and Training-Time Distillation (Oct 30)

No Class - Democracy Day (Nov 4)

Defining Efficiency (Nov 6)

Library Implementations and Optimizations (Nov 11)

Prefix Sharing and KV Cache Optimizations (Nov 13)

Draft Models and Speculative Decoding (Nov 18)

Linearizing Attention and Sparse Models (Nov 20)

Transformer Alternatives (Nov 25)

No Class - Thanksgiving (Nov 27)

Wrapup and Historical Perspective (Dec 2)

Shared Task Results and Poster Sessions (Dec 4)