Schedule

Introduction to Language Models and Inference (Aug 26)

Content:

  • What is a language model?
  • What is an inference algorithm?
  • What will we not cover?
  • What are transformers?
  • How do modern LMs work?
  • Modeling errors and search errors
  • Prompting as a means of model control
  • Instruction following behavior
Slides: HTMLPDF

Code: Code

Reading Material

  • Reference: Sections 1+2 from From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (arXiv)

Probability Review (Aug 28)

Content:

  • Probability review
  • Transformer implementation
  • Generation and evaluation
  • Meta-generation

Code: Code

Reading Material

None

Common Sampling Methods for Modern NLP (Sep 2)

Beam Search and Variants (Sep 4)

Intro to A* and Best First Search (Sep 9)

Content:

  • Introduction to A* and best first search
  • A* methods for controlled generation

Slides: TBA

Code: TBA

Reading Material

  • Reference: Best-First Beam Search (arXiv)
  • Reference: NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics (arXiv)

Assignments

Other Controlled Generation Methods (Sep 11)

Content:

  • Other controlled generation methods
  • Decoding-time distributional modifiers

Slides: TBA

Code: TBA

Reading Material

  • Reference: FUDGE: Controlled Text Generation With Future Discriminators (arXiv)
  • Reference: Contrastive Search Is What You Need For Neural Text Generation (arXiv)
  • Reference: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model (arXiv)

Assignments

  • Homework 2 out: implementation of beam search, mirostat, temperature, top-p, top-k sampling and comparison on the shared tasks

Chain of Thought and Intermediate Steps (Sep 16)

Content:

  • Chain of thought / scratchpad, intermediate steps
  • Why does chain of thought work?
  • Tree of thoughts

Slides: TBA

Code: TBA

Reading Material

  • Reference: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (arXiv)
  • Reference: Tree of Thoughts: Deliberate Problem Solving with Large Language Models (arXiv)

Self-Refine and Self-Correction Methods (Sep 18)

Content:

  • Self-refine and self-correction methods

Slides: TBA

Code: TBA

Reading Material

Reasoning Models (Sep 23)

Content:

  • Reasoning models

Slides: TBA

Code: TBA

Reading Material

Incorporating Tools (Sep 25)

Content:

  • Incorporating tools: math/verification based, search, etc.

Slides: TBA

Code: TBA

Reading Material

Agents and Multi-Agent Communication (Sep 30)

Content:

  • Agents and multi-agent communication

Slides: TBA

Code: TBA

Reading Material

  • Reference: DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (arXiv)

  • Reference: A Survey on LLM-based Multi-Agent Systems: Workflow, Infrastructure, and Applications (Springer)

Reward Models and Best-of-N (Oct 2)

Content:

  • Reward models, best-of-n theory and practice
  • Monte Carlo Tree Search

Slides: TBA

Code: TBA

Reading Material

  • Reference: Why reward models are key for alignment (by Nathan Lambert)

  • Reference: Theoretical guarantees on the best-of-n alignment policy (arXiv)

Assignments

Minimum Bayes Risk and Multi-Sample Strategies (Oct 7)

Content:

  • What do we get when we sample more?
  • Minimum Bayes Risk and similar methods

Slides: TBA

Code: TBA

Reading Material

  • Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)

  • Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)

Assignments

  • Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks

Systems not Models (Oct 7)

Guest Lecturer: Omar Khattab

Content:

  • Parallels to older “pipeline NLP”
  • Ensembling
  • Visualizing and evaluating systems
  • Human-in-the-loop decoding
  • Brief discussion of HCI perspectives

Slides: TBA

Code: TBA

Reading Material

Assignments

  • Homework 3 out: build an LLM system that has a code interpreter and small reward model and visualize the system; benchmark a set of variants of this method on the shared tasks

Minimum Bayes Risk and Multi-Sample Strategies (Oct 9)

Content:

  • What do we get when we sample more?
  • Minimum Bayes Risk and similar methods

Slides: TBA

Code: TBA

Reading Material

  • Reference: It’s MBR All the Way Down: Modern Generation Techniques Through the Lens of Minimum Bayes Risk (arXiv)

  • Reference: Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation (arXiv)

Systems not Models (Oct 9)

Content:

  • Parallels to older “pipeline NLP”
  • Ensembling
  • Visualizing and evaluating systems
  • Human-in-the-loop decoding
  • Brief discussion of HCI perspectives

Slides: TBA

Code: TBA

Reading Material

No Class - Fall Break (Oct 14)

No Class - Fall Break

No Class - Fall Break (Oct 16)

No Class - Fall Break

Inference Scaling vs Model Size (Oct 21)

Content:

  • Inference scaling versus scaling model size
  • Differences in cost and latency considerations
  • Modeling scaling behavior

Slides: TBA

Code: TBA

Reading Material

  • Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)

Token Budgets and Training-Time Distillation (Oct 23)

Content:

  • Token budgets
  • Training-time distillation of inference algorithms
  • Draft CoT
  • Early exit voting

Slides: TBA

Code: TBA

Reading Material

  • Reference: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (arXiv)

  • Reference: Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (arXiv)

  • Reference: Direct Preference Optimization for Neural Machine Translation with Minimum Bayes Risk Decoding (arXiv)

  • Reference: MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods (arXiv)

Diffusion Models (Oct 28)

Content:

  • Introduction to diffusion models
  • Denoising diffusion probabilistic models (DDPM)
  • Score-based generative models
  • Diffusion models for text generation
  • Comparison with autoregressive models
  • Inference techniques for diffusion models
  • Applications in multimodal generation

Slides: TBA

Code: TBA

Reading Material

  • Reference: Denoising Diffusion Probabilistic Models (Ho et al., 2020) (arXiv)
  • Reference: Diffusion Models Beat GANs on Image Synthesis (Dhariwal & Nichol, 2021) (arXiv)

  • Optional: Diffusion Models: A Comprehensive Survey of Methods and Applications (Yang et al., 2022) (arXiv)

Defining Efficiency (Oct 30)

Content:

  • How do we define efficiency?
  • Different places where a method can be efficient (e.g. memory, latency, token cost for APIs)
  • Brief review of hardware for inference

Slides: TBA

Code: TBA

Reading Material

No Class - Democracy Day (Nov 4)

No Class - Democracy Day

Inference and Hardware (Nov 6)

Content:

  • Overview of hardware relevant to LLM inference (GPUs, TPUs, accelerators)
  • Memory bandwidth, compute, and latency considerations
  • Parallelism strategies and deployment tradeoffs

Slides: TBA

Code: TBA

Reading Material

Library Implementations and Optimizations (Nov 11)

Content:

  • Library implementations
  • Lazy softmax
  • Flash attention
  • How do vLLM/SGLang/similar speed up generation?

Slides: TBA

Code: TBA

Reading Material

  • Reference: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv)

  • Reference: SELF-ATTENTION DOES NOT NEED O(n2) MEMORY (arXiv)

Assignments

Prefix Sharing and KV Cache Optimizations (Nov 13)

Content:

  • Prefix sharing
  • KV cache reuse
  • Key-value cache compression
  • Model compression
  • Brief quantization overview

Slides: TBA

Code: TBA

Reading Material

  • Reference: Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption (arXiv)

  • Reference: Model Compression and Efficient Inference for Large Language Models: A Survey (arXiv)

Draft Models and Speculative Decoding (Nov 18)

Content:

  • Draft models
  • Speculative decoding
  • Other latency improving methods

Slides: TBA

Code: TBA

Reading Material

Linearizing Attention and Sparse Models (Nov 20)

Content:

  • Linearizing attention
  • Sparse models

Slides: TBA

Code: TBA

Reading Material

  • TBA

Assignments

Transformer Alternatives (Nov 25)

Content:

  • Transformer alternatives

Slides: TBA

Code: TBA

Reading Material

  • Reference: The Annotated S4

Assignments

No Class - Thanksgiving (Nov 27)

No Class - Thanksgiving

Shared Task Results and Poster Sessions (Dec 2)

Content:

  • Shared task results
  • Poster sessions

Slides: N/A

Code: N/A

Assignments