An Introduction to Transformers - An Introduction to Transformers

This book walks through how fairly modern AI systems work by building miniature versions of them. Together we will go from a deep dive into the math (to build some fundamental awareness), to building, fine-tuning, and reasoning with transformers. Finally we will cap it off with a fun excursus into image generation.

Understanding Gradients¶

Using only basic Python (no NumPy, no PyTorch), we’ll compute every matrix multiplication, every activation function, every gradient. If you want to be pragmatic, you can skip this one and go to the next section. But if you want to reach for glory which is meticulous mathematical matrix multiplications, then get ready to calculate!

Tokenization & Embeddings

How text becomes vectors

QKV Projections

What Query, Key, Value actually mean

Attention

The softmax-weighted sum that made transformers possible

Multi-Head Attention

Running parallel attention operations

Feed-Forward Network

The MLP that processes attended information

Layer Normalization

Stabilizing activations for training

Loss & Gradients

Cross-entropy and the backward pass

AdamW Optimizer

How weights actually get updated

Building a Transformer¶

This is our transformer. There are many like it, but this one is ours. This section shows you how to build a complete GPT-style transformer in PyTorch. All that heavy lifting we did in the last section is now hidden behind simple backward()-like calls. It covers the architecture that powers modern language models (circa 2023), from embeddings to interpretability tools. In the end, you’ll have a new toy.

Embeddings & Positions

Token embeddings, ALiBi, RoPE

Attention

Scaled dot-product attention with causal masking

Multi-Head Attention

Parallel attention heads

Transformer Block

Pre-LN, residuals, and all components combined

Training

Gradient accumulation, validation splits

KV-Cache

Fast inference through caching

Interpretability

Logit lens, attention analysis, induction heads

Fine-Tuning a Transformer¶

Fine-tuning really should be called “necessary tuning” because the output of the previous section doesn’t look anything like the GPT-style assistants we are used to. As such, this section teaches a baseline pre-trained model to follow instructions. We go into detail on SFT, reward modeling, RLHF with PPO, DPO, and other acronyms we will explain later, the techniques that turn base models into safer assistants.

Supervised Fine-Tuning

Instruction formatting, loss masking, LoRA

Reward Modeling

Preference data and training reward models

RLHF

PPO algorithm, KL penalty, training dynamics

DPO

Direct preference optimization without RL

Reasoning with Transformers¶

How do models like o1 and DeepSeek-R1 “think”? This section covers the techniques that make transformers reason: from simple prompting tricks to full reinforcement learning pipelines. We’ll build chain-of-thought, tree search, and train our own reasoning models.

Chain-of-Thought

The simple prompt that started it all

Self-Consistency

Sample many reasoning paths, vote on the answer

Tree of Thoughts

Explore and backtrack through reasoning trees

Process Reward Models

Score each reasoning step, not just the answer

Best-of-N Verification

Generate many solutions, pick the best

Monte Carlo Tree Search

The algorithm that powered AlphaGo, for language

Budget Forcing

Control how long the model “thinks”

GRPO Training

RL for reasoning without a critic

Reasoning Distillation

Transfer reasoning to smaller models

From Noise to Images¶

But what if we aren’t generating text? Here we will learn how AI generates images from text prompts. This section builds from flow matching fundamentals to a working latent diffusion model (you’ll know what that means later).

Flow Matching

Velocity fields, noise-to-data paths

Diffusion Transformer

Patchifying images, attention for generation

Class Conditioning

Classifier-free guidance (CFG)

Text Conditioning

CLIP embeddings, cross-attention

Latent Diffusion

VAE compression, scaling to larger images