Understanding Transformers from the Inside Out
This book teaches you how fairly modern AI systems work by building miniature versions of them yourself. I don’t want to hand wave anything, because I’m learning this as we go too. Real math, straightforward code...that’s the goal.
Understanding Gradients¶
Using only basic Python (no NumPy, no PyTorch), we’ll compute every matrix multiplication, every activation function, every gradient. If you want to be pragmatic, you can skip this one and go to the next section. But if you want to reach for glory which is meticulous mathematical matrix multiplications, then get ready to calculate!
How text becomes vectors
What Query, Key, Value actually mean
The softmax-weighted sum that made transformers possible
Running parallel attention operations
The MLP that processes attended information
Stabilizing activations for training
Cross-entropy and the backward pass
How weights actually get updated
Building a Transformer¶
This is our transformer. There are many like it, but this one is ours. This section shows you how to build a complete GPT-style transformer in PyTorch. All that heavy lifting we did in the last section is now hidden behind simple backward()-like calls. It covers the architecture that powers modern language models (circa 2023), from embeddings to interpretability tools. In the end, you’ll have a new toy.
Token embeddings, ALiBi, RoPE
Scaled dot-product attention with causal masking
Parallel attention heads
Pre-LN, residuals, and all components combined
Gradient accumulation, validation splits
Fast inference through caching
Logit lens, attention analysis, induction heads
Fine-Tuning a Transformer¶
Fine-tuning really should be called “necessary tuning” because the output of the previous section doesn’t look anything like the GPT-style assistants we are used to. As such, this section teaches a baseline pre-trained model to follow instructions. We go into detail on SFT, reward modeling, RLHF with PPO, DPO, and other acronyms we will explain later, the techniques that turn base models into safer assistants.
Instruction formatting, loss masking, LoRA
Preference data and training reward models
PPO algorithm, KL penalty, training dynamics
Direct preference optimization without RL
Reasoning with Transformers¶
How do models like o1 and DeepSeek-R1 “think”? This section covers the techniques that make transformers reason: from simple prompting tricks to full reinforcement learning pipelines. We’ll build chain-of-thought, tree search, and train our own reasoning models.
The simple prompt that started it all
Sample many reasoning paths, vote on the answer
Explore and backtrack through reasoning trees
Score each reasoning step, not just the answer
Generate many solutions, pick the best
The algorithm that powered AlphaGo, for language
Control how long the model “thinks”
RL for reasoning without a critic
Transfer reasoning to smaller models
From Noise to Images¶
But what if we aren’t generating text? Here we will learn how AI generates images from text prompts. This section builds from flow matching fundamentals to a working latent diffusion model (you’ll know what that means later).