Ever wonder how transformers actually work under the hood? I mean really work, at the level of matrices and gradients and actual numbers?
You can read about attention mechanisms and backpropagation in a textbook. You can use PyTorch and watch the loss go down. But there’s something different about seeing every single calculation laid out in front of you - watching how a 6×16 embedding matrix multiplies with a 16×16 query weight matrix, seeing exactly how the chain rule propagates gradients through layer normalization, understanding why AdamW needs bias correction terms.
This project calculates a complete training step through a transformer, by hand.
We’re going to take the sentence “I like transformers” (3 tokens, plus BOS and EOS markers for 5 total) through a tiny GPT-style model:
Forward pass: From raw text → embeddings → attention → feed-forward → loss (7 pages)
Backward pass: Computing gradients for every single parameter via backpropagation (5 pages)
Optimization: Applying AdamW updates with momentum and bias correction (1 page)
Every matrix multiplication is shown step-by-step. Every gradient derivation is complete. Every dimension is tracked. Nothing is hidden behind library abstractions or handwaved as “trivial.”
By the end, you’ll have a deep, visceral understanding of transformer mathematics - the kind that only comes from doing the calculations yourself.
Watch the input flow through each layer with complete matrix operations. See how attention weights emerge from scaled dot products and how GELU activation transforms hidden states.
Trace gradients backward through the network using the chain rule. Derive Jacobian matrices for softmax and layer normalization. Compute gradients for every weight, bias, and embedding.
Optimization (1 page)
AdamW Weight Updates with Momentum & Bias Correction
Apply the complete AdamW optimizer algorithm with first and second moment estimates, bias correction terms, and weight decay. See how each parameter moves toward better values.
Python Scripts (12 files)
Reproducible Calculations with NumPy
Every page is backed by a Python script that performs the exact calculations shown in the docs. Deterministic initialization, intermediate value saving, and step-by-step verification.
We’re using a GPT-style decoder-only transformer - the same architecture family as ChatGPT, Claude, and Llama, just scaled down to be humanly tractable:
Component
Value
Why This Size?
d_model
16
Small enough to write out full matrices, large enough to be realistic
num_heads
2
Multiple heads to show how multi-head attention combines (d_k = d_v = 8)
d_ff
64
Standard 4× expansion in feed-forward layer
vocab_size
10
Our tiny vocabulary: “the”, “cat”, “sat”, “on”, “mat”, etc.
num_layers
1
One complete transformer block (you can extrapolate to N layers)
max_len
5
Length of our sequence with BOS and EOS tokens
Total parameters: ~2,600 (versus 175 billion for GPT-3)
The math is identical whether you have 16 dimensions or 4096. We’re just keeping things small enough that you can actually see what’s happening in every matrix multiplication, understand every gradient, and verify every calculation.
Understanding vs. Using: You can drive a car without knowing how the engine works. But if you want to design cars, you need to understand combustion, torque, and thermodynamics. Same with transformers.
Debugging intuition: When your transformer isn’t training properly, understanding what’s happening in each gradient helps you diagnose whether it’s vanishing gradients, attention collapse, or something else.
No magic: Libraries like PyTorch make training easy but hide the details. This project reveals what loss.backward() actually does - all 5 pages of chain rule applications.
Deep learning: The best way to learn is to do. We’re doing every calculation, so you’ll learn it deeply.
Click the button at the top to begin with tokenization, or jump to any section using the sidebar.
Every page builds on the previous one, so following in order is recommended for your first read-through. All calculations are verified by Python scripts in the scripts/ directory.