Schedule

Introduction to Language Models and Inference (Aug 26)

Content:

  • What is a language model?
  • What is an inference algorithm?
  • What will we not cover?
  • What are transformers?
  • How do modern LMs work?

Slides: TBA

Code: TBA

Reading Material

Assignments

  • HW1 Released: Math homework with small coding section to run inference on models with Hugging Face / VLLM

Probability Review and Shared Task Introduction (Aug 28)

Content:

  • Probability review
  • Using models review – APIs, local models, etc.
  • Introduction to shared task: 2-3 tasks with multiple evaluations
  • Given a fixed model and token/compute budgets, get the best possible outputs on the unseen test set

Slides: TBA

Code: TBA

Reading Material

  • Required: Probability Review from Rob Hall (or similar)

Assignments

  • Homework 1 out: primarily math homework, with some small coding section to run inference on a few models with Hugging Face / VLLM to show setup is working

Common Sampling Methods for Modern NLP (Sep 2)

Content:

  • Common sampling methods for modern NLP
  • Diversity-quality tradeoffs

Slides: TBA

Code: TBA

Reading Material

Beam Search and Variants (Sep 4)

Content:

  • Beam search and variants
  • Inadequacies of the mode

Slides: TBA

Code: TBA

Reading Material

Intro to A* and Best First Search (Sep 9)

Content:

  • Introduction to A* and best first search
  • A* methods for controlled generation

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Homework 1 due

Other Controlled Generation Methods (Sep 11)

Content:

  • Other controlled generation methods
  • Decoding-time distributional modifiers

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Homework 2 out: implementation of beam search, mirostat, temperature, top-p, top-k sampling and comparison on the shared tasks

Prompting as a Means of Model Control (Sep 16)

Content:

  • Prompting as a means of model control
  • Instruction following behavior

Slides: TBA

Code: TBA

Reading Material

  • TBA

Chain of Thought and Intermediate Steps (Sep 18)

Content:

  • Chain of thought / scratchpad
  • Intermediate steps
  • Why does chain of thought work?

Slides: TBA

Code: TBA

Reading Material

Self-Refine and Self-Correction Methods (Sep 23)

Content:

  • Self-refine and self-correction methods

Slides: TBA

Code: TBA

Reading Material

Monte Carlo Tree Search (Sep 25)

Content:

  • Monte Carlo Tree Search
  • Tree of thoughts

Slides: TBA

Code: TBA

Reading Material

Minimum Bayes Risk (Sep 30)

Reward Models (Oct 2)

Content:

  • Reward models
  • Best-of-n theory and practice

Slides: TBA

Code: TBA

Reading Material

Incorporating Tools (Oct 7)

Content:

  • Incorporating tools: math/verification based, search, etc.

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Homework 2 due

Agents and Multi-Agent Communication (Oct 9)

Content:

  • Agents and multi-agent communication

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks

Systems not Models (Oct 21)

Content:

  • Parallels to older “pipeline NLP”
  • Ensembling
  • Visualizing and evaluating systems
  • Human-in-the-loop decoding
  • Brief discussion of HCI perspectives

Slides: TBA

Code: TBA

Reading Material

Inference Scaling vs Model Size (Oct 23)

Content:

  • Inference scaling versus scaling model size
  • Differences in cost and latency considerations
  • Modeling scaling behavior

Slides: TBA

Code: TBA

Reading Material

Using External Verifiers (Oct 28)

Content:

  • Using external verifiers
  • Can reasoning be a verifiable game?
  • Can LLMs learn to plan?
  • O1 / DeepSeek-R1 / similar

Slides: TBA

Code: TBA

Reading Material

Token Budgets (Oct 30)

Defining Efficiency (Nov 6)

Content:

  • How do we define efficiency?
  • Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
  • Brief review of hardware for inference

Slides: TBA

Code: TBA

Reading Material

Library Implementations and Optimizations (Nov 11)

Content:

  • Library implementations
  • Lazy softmax
  • Flash attention
  • How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Homework 3 due
  • Homework 4 out: transformer hardware math, implement speculative decoding and KV caching

Prefix Sharing and KV Cache Optimizations (Nov 13)

Content:

  • Prefix sharing
  • KV cache reuse
  • Key-value cache compression
  • Model compression
  • Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

Draft Models and Speculative Decoding (Nov 18)

Content:

  • Draft models
  • Speculative decoding
  • Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Linearizing Attention and Sparse Models (Nov 20)

Content:

  • Linearizing attention
  • Sparse models

Slides: TBA

Code: TBA

Reading Material

  • TBA

Assignments

  • Test set released for shared task (without labels)

Transformer Alternatives (Nov 25)

Content:

  • Transformer alternatives

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Shared task systems due: code plus validation and test outputs

Wrapup and Historical Perspective (Dec 2)

Content:

  • Wrapup/catchup time
  • A bit of a historical view: what was decoding like five, ten years ago?
  • What does the future hold for decoding?

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Homework 4 due

Shared Task Results and Poster Sessions (Dec 4)

Content:

  • Shared task results
  • Poster sessions

Slides: N/A

Code: N/A

Assignments

  • Shared task report due