Example: Unconditioned Completions
Generated from BOS token (Qwen3-1.7B-Base):
- Highest Probability:
- “3.” (-4.22)
- “4.” (-4.50)
- “0)” (-7.01)
- Middle Probability:
- “8 \text{ dollars} - 5 \text{ dollars} = 3 \text{ dollars}” (-20.31)
- “3 \) because \( 2^3 = 8 \) and \( 2^{-3} = \frac{1}{8} \” (-22.53)
- “1) = (x + 2) - (x + 1) = 1” (-23.03)
- Lowest Probability:
- “2015—2016 学年度下学期期末统考 高三文科数学 2016.06” (-35.62)
- “为深入贯彻落实关于党史学习教育"办实事、开新局"的要求,按照公司党委关于党史学习教育的部署,7月6日,公司” (-47.32)
- “本篇博文主要针对在使用MySQL数据库进行数据库操作时遇到的问题进行记录和总结。包括:” (-56.30)
Example: Conditional Completions
Prompt: “The best thing about Carnegie Mellon University is”
- Highest Probability:
- " its diversity." (-5.00)
- " that it’s a community." (-9.87)
- Middle Probability:
- " its location and the vibrant community it has built." (-18.13)
- " that there’s always something else to learn." (-18.50)
- " that it gives you the ability to take your education to the next level." (-20.02)
- Lowest Probability:
- " its focus on the interdisciplinary approach, with students majoring in computer science,
mathematics and engineering." (-38.92)
- " the community that you get in it, with so many people doing so many different things." (-39.69)
- " not just its academics and research in engineering, computer science and business —
it’s the amazing, out-of-the ordinary experiences your students create in the classroom" (-68.35)
FLOPs per Layer (GQA Transformer)
- Variables: \(L\) (tokens), \(d\) (hidden size), \(H\) (heads), \(H_{kv}\) (KV heads), \(d_h = d/H\), \(d_{ff}\) (FFN size)
- Self-attention FLOPs per layer (with GQA):
\[\underbrace{2L\,d^2}_{\text{Q proj}}
+\underbrace{2L\,d \cdot H_{kv} d_h}_{\text{K proj}}
+\underbrace{2L\,d \cdot H_{kv} d_h}_{\text{V proj}}
+\underbrace{4H\,L^2\,d_h}_{QK^T + AV}
+\underbrace{2L\,d^2}_{\text{output proj}}\]
- Feed-forward FLOPs per layer (SwiGLU):
\[\underbrace{2L\,d\,d_{ff}}_{\text{gate}}
+\underbrace{2L\,d\,d_{ff}}_{\text{up}}
+\underbrace{2L\,d_{ff}\,d}_{\text{down}}\]
e.g. Llama-3.1 Series
Model |
Parameters |
Layers |
Hidden Size |
Attention Heads |
KV Heads |
MLP Size |
Llama-3.1-8B |
8.0B |
32 |
4,096 |
32 |
8 |
14,336 |
Llama-3.1-70B |
70.6B |
80 |
8,192 |
64 |
8 |
28,672 |
Llama-3.1-405B |
405.9B |
126 |
16,384 |
128 |
8 |
53,248 |
|
|
|
|
|
|
|
(source:
Hugging Face 1,
2,
3
)
Basic Generation
- Also called “decoding” (from information theoretic roots).
- Often done one token at a time, autoregressively. Simplified view:
def generate(model: Model, x: list[Token]):
y = []
while not done(y): # while termination condition not reached
p = model(x + y) # get P(y | x)
y.append(sample(p)) # sample or search for next token
return y
- We will cover many variants in this course!
Meta-generation
- A class of algorithms that generate token sequences as a subroutine.
- Most common: reranking.
def generate_and_rerank(model: Model, reranker: Reranker, candidate_count: int, x: list[Token]):
# generate candidate_count candidates
candidates = [generate(model, x) for _ in range(candidate_count)]
# score each with the reranker
scores = [reranker(x, y) for y in candidates]
# return the highest-scoring candidate
return candidates[argmax(scores)]
- We will cover many other algorithms!