Schedule

Date	Topic
26-Aug	Introduction to Language Models and Inference HW1 Released
28-Aug	Probability Review and Shared Task Introduction
02-Sep	Common Sampling Methods for Modern NLP
04-Sep	Beam Search and Variants
09-Sep	Intro to A* and Best First Search HW1 Due
11-Sep	Other Controlled Generation Methods HW2 Released
16-Sep	Prompting as a Means of Model Control
18-Sep	Chain of Thought and Intermediate Steps
23-Sep	Self-Refine and Self-Correction Methods
25-Sep	Monte Carlo Tree Search
30-Sep	Minimum Bayes Risk
02-Oct	Reward Models
07-Oct	Incorporating Tools HW2 Due
09-Oct	Agents and Multi-Agent Communication HW3 Released
21-Oct	Systems not Models
23-Oct	Inference Scaling vs Model Size
28-Oct	Using External Verifiers
30-Oct	Token Budgets
06-Nov	Defining Efficiency
11-Nov	Library Implementations and Optimizations HW3 Due HW4 Released
13-Nov	Prefix Sharing and KV Cache Optimizations
18-Nov	Draft Models and Speculative Decoding
20-Nov	Linearizing Attention and Sparse Models
25-Nov	Transformer Alternatives Shared Task Systems Due
02-Dec	Wrapup and Historical Perspective HW4 Due
04-Dec	Shared Task Results and Poster Sessions Shared Task Report Due

Introduction to Language Models and Inference (Aug 26)

Content:

What is a language model?
What is an inference algorithm?
What will we not cover?
What are transformers?
How do modern LMs work?

Slides: TBA

Code: TBA

Reading Material

Required: The Illustrated Transformer (Jay Alammar)
Required: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Assignments

HW1 Released: Math homework with small coding section to run inference on models with Hugging Face / VLLM

Probability Review and Shared Task Introduction (Aug 28)

Content:

Probability review
Using models review – APIs, local models, etc.
Introduction to shared task: 2-3 tasks with multiple evaluations
Given a fixed model and token/compute budgets, get the best possible outputs on the unseen test set

Slides: TBA

Code: TBA

Reading Material

Required: Probability Review from Rob Hall (or similar)

Assignments

Homework 1 out: primarily math homework, with some small coding section to run inference on a few models with Hugging Face / VLLM to show setup is working

Common Sampling Methods for Modern NLP (Sep 2)

Content:

Common sampling methods for modern NLP
Diversity-quality tradeoffs

Slides: TBA

Code: TBA

Reading Material

Required: A Thorough Examination of Decoding Methods in the Era of LLMs
Required: Trading Off Diversity and Quality in Natural Language Generation

Beam Search and Variants (Sep 4)

Content:

Beam search and variants
Inadequacies of the mode

Slides: TBA

Code: TBA

Reading Material

Required: Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Required: If beam search is the answer, what was the question?

Intro to A* and Best First Search (Sep 9)

Content:

Introduction to A* and best first search
A* methods for controlled generation

Slides: TBA

Code: TBA

Reading Material

Required: Best-First Beam Search
Required: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics

Assignments

Homework 1 due

Other Controlled Generation Methods (Sep 11)

Content:

Other controlled generation methods
Decoding-time distributional modifiers

Slides: TBA

Code: TBA

Reading Material

Assignments

Homework 2 out: implementation of beam search, mirostat, temperature, top-p, top-k sampling and comparison on the shared tasks

Prompting as a Means of Model Control (Sep 16)

Content:

Prompting as a means of model control
Instruction following behavior

Slides: TBA

Code: TBA

Reading Material

Chain of Thought and Intermediate Steps (Sep 18)

Content:

Chain of thought / scratchpad
Intermediate steps
Why does chain of thought work?

Slides: TBA

Code: TBA

Reading Material

Required: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Self-Refine and Self-Correction Methods (Sep 23)

Content:

Self-refine and self-correction methods

Slides: TBA

Code: TBA

Reading Material

Required: PAL: Program-aided Language Models
Required: Self-Refine: Iterative Refinement with Self-Feedback

Monte Carlo Tree Search (Sep 25)

Content:

Monte Carlo Tree Search
Tree of thoughts

Slides: TBA

Code: TBA

Reading Material

Required: Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Minimum Bayes Risk (Sep 30)

Content:

Minimum Bayes Risk and similar methods

Slides: TBA

Code: TBA

Reading Material

Reward Models (Oct 2)

Content:

Reward models
Best-of-n theory and practice

Slides: TBA

Code: TBA

Reading Material

Required: Why reward models are key for alignment (by Nathan Lambert)
Required: Theoretical guarantees on the best-of-n alignment policy

Incorporating Tools (Oct 7)

Content:

Incorporating tools: math/verification based, search, etc.

Slides: TBA

Code: TBA

Reading Material

Required: Toolformer: Language Models Can Teach Themselves to Use Tools
Required: What Are Tools Anyway? A Survey from the Language Model Perspective

Assignments

Homework 2 due

Agents and Multi-Agent Communication (Oct 9)

Content:

Agents and multi-agent communication

Slides: TBA

Code: TBA

Reading Material

Assignments

Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks

Systems not Models (Oct 21)

Content:

Parallels to older “pipeline NLP”
Ensembling
Visualizing and evaluating systems
Human-in-the-loop decoding
Brief discussion of HCI perspectives

Slides: TBA

Code: TBA

Reading Material

Required: The Shift from Models to Compound AI Systems

Inference Scaling vs Model Size (Oct 23)

Content:

Inference scaling versus scaling model size
Differences in cost and latency considerations
Modeling scaling behavior

Slides: TBA

Code: TBA

Reading Material

Required: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Using External Verifiers (Oct 28)

Content:

Using external verifiers
Can reasoning be a verifiable game?
Can LLMs learn to plan?
O1 / DeepSeek-R1 / similar

Slides: TBA

Code: TBA

Reading Material

Required: Stream of Search (SoS): Learning to Search in Language
Required: Awesome-o1
Required: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Token Budgets (Oct 30)

Content:

Token budgets
Training-time distillation of inference algorithms
Draft CoT
Early exit voting

Slides: TBA

Code: TBA

Reading Material

Defining Efficiency (Nov 6)

Content:

How do we define efficiency?
Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
Brief review of hardware for inference

Slides: TBA

Code: TBA

Reading Material

Required: Transformer Inference Arithmetic

Library Implementations and Optimizations (Nov 11)

Content:

Library implementations
Lazy softmax
Flash attention
How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

Required: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Required: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY

Assignments

Homework 3 due
Homework 4 out: transformer hardware math, implement speculative decoding and KV caching

Content:

Prefix sharing
KV cache reuse
Key-value cache compression
Model compression
Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

Draft Models and Speculative Decoding (Nov 18)

Content:

Draft models
Speculative decoding
Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Required: Fast Inference from Transformers via Speculative Decoding
Required: A Hitchhiker’s Guide to Speculative Decoding

Linearizing Attention and Sparse Models (Nov 20)

Content:

Linearizing attention
Sparse models

Slides: TBA

Code: TBA

Reading Material

Assignments

Test set released for shared task (without labels)

Transformer Alternatives (Nov 25)

Content:

Transformer alternatives

Slides: TBA

Code: TBA

Reading Material

Required: The Annotated S4

Assignments

Shared task systems due: code plus validation and test outputs

Wrapup and Historical Perspective (Dec 2)

Content:

Wrapup/catchup time
A bit of a historical view: what was decoding like five, ten years ago?
What does the future hold for decoding?

Slides: TBA

Code: TBA

Reading Material

Required: Learning to Reason with LLMs
Required: Lattice-based Viterbi decoding techniques for speech translation

Assignments

Homework 4 due

Shared Task Results and Poster Sessions (Dec 4)

Content:

Shared task results
Poster sessions

Slides: N/A

Code: N/A

Assignments

Shared task report due

Schedule

Introduction to Language Models and Inference (Aug 26)

Probability Review and Shared Task Introduction (Aug 28)

Common Sampling Methods for Modern NLP (Sep 2)

Beam Search and Variants (Sep 4)

Intro to A* and Best First Search (Sep 9)

Other Controlled Generation Methods (Sep 11)

Prompting as a Means of Model Control (Sep 16)

Chain of Thought and Intermediate Steps (Sep 18)

Self-Refine and Self-Correction Methods (Sep 23)

Monte Carlo Tree Search (Sep 25)

Minimum Bayes Risk (Sep 30)

Reward Models (Oct 2)

Incorporating Tools (Oct 7)

Agents and Multi-Agent Communication (Oct 9)

Systems not Models (Oct 21)

Inference Scaling vs Model Size (Oct 23)

Using External Verifiers (Oct 28)

Token Budgets (Oct 30)

Defining Efficiency (Nov 6)

Library Implementations and Optimizations (Nov 11)

Prefix Sharing and KV Cache Optimizations (Nov 13)

Draft Models and Speculative Decoding (Nov 18)

Linearizing Attention and Sparse Models (Nov 20)

Transformer Alternatives (Nov 25)

Wrapup and Historical Perspective (Dec 2)

Shared Task Results and Poster Sessions (Dec 4)