Forem: Gary Jackson

Chapter 12: Inference - Generating New Text

Gary Jackson — Sat, 02 May 2026 02:28:39 +0000

What You'll Build

A sampling loop that generates new names from the trained model.

Depends On

Chapter 11 (the trained model).

How Generation Works

After training, the parameters are frozen. We start with the BOS token, feed it through the model, get a probability distribution over next tokens, sample one, feed it back in, and repeat until the model produces BOS again ("I'm done") or we hit the maximum length. Same generation loop from the bigram chapter. Only the source of the probabilities has changed.

// --- FullTraining.cs (add below the training loop from Chapter 11) ---

const double Temperature = 0.5;

Console.WriteLine("\n--- inference (new, hallucinated names) ---");
for (int sampleIdx = 0; sampleIdx < 20; sampleIdx++)
{
    List<List<Value>>[] keys = model.CreateKvCache();
    List<List<Value>>[] values = model.CreateKvCache();

    int tokenId = tokenizer.Bos;
    var sample = new StringBuilder();

    for (int posId = 0; posId < maxSequenceLength; posId++)
    {
        List<Value> logits = model.Forward(tokenId, posId, keys, values);

        var scaledLogits = logits.Select(l => l / Temperature).ToList();
        List<Value> probabilities = Softmax(scaledLogits);

        double r = random.NextDouble();
        double sum = 0;
        int nextToken = -1;
        var probabilityValues = probabilities.Select(p => p.Data).ToList();
        // Softmax probabilities can sum to slightly less/more than 1 due to floating point.
        // Rescale r into the actual total so we never fall off the end of the loop.
        double totalProb = probabilityValues.Sum();
        r *= totalProb;

        for (int i = 0; i < probabilityValues.Count; i++)
        {
            sum += probabilityValues[i];
            if (r <= sum)
            {
                nextToken = i;
                break;
            }
        }
        if (nextToken == -1)
        {
            nextToken = probabilityValues.Count - 1;
        }

        tokenId = nextToken;
        if (tokenId == tokenizer.Bos)
        {
            break;
        }

        sample.Append(tokenizer.Decode(tokenId));
    }

    Console.WriteLine($"sample {sampleIdx + 1, 2}: {sample}");
}

Notice how model.CreateKvCache() replaces the manual array-initialisation loop we would have needed. The model knows how many layers it has; the caller doesn't need to.

Temperature

Temperature controls the "creativity" of generation. Before softmax, we divide each logit by the temperature:

Temperature = 1.0. Sample directly from the model's learned distribution. Normal randomness.
Temperature < 1.0 (e.g. 0.5). Sharpens the distribution. The model becomes more conservative, more likely to pick its top choices. Names will be more "typical".
Temperature -> 0. Always picks the single most likely token (greedy decoding). No randomness at all.
Temperature > 1.0. Flattens the distribution. More diverse but potentially less coherent output.

What You Should See

After training for 10,000 steps, the model generates plausible-sounding names like:

sample  1: kamon
sample  2: ann
sample  3: karai
sample  4: jaire
sample  5: vialan

These names don't exist in the training data. The model has learned the statistical patterns of names (consonant-vowel patterns, common endings, typical lengths) and is generating new examples from that learned distribution.

The Connection to ChatGPT

From the model's perspective, your conversation with ChatGPT is just a funny-looking "document". When you type your prompt, you're initialising the document. The model's response is a statistical document completion: the same next-token prediction we've built here, just at enormously larger scale with post-training layered on top to make it conversational.

Running the Final Model

dotnet run -c Release -- full

(Or just dotnet run -c Release. The dispatcher defaults to full when no chapter is given.)

The full training run (10,000 steps) typically takes 5-15 minutes on a modern CPU in Release mode, and much longer in Debug mode. Always use -c Release for training. The per-step loss bounces around, but watch the avg column. That's the running average, and it should trend downward from ~3.3 toward ~2.37. Every 1000 steps, a [milestone] line prints the current avg alongside its value at the previous milestone. The running average is smooth but not monotonic, so expect the occasional milestone-to-milestone wobble even while the overall trend is down. After training, you'll see 20 generated names.

What to Try Next

Now that you have a working model, here are some experiments worth running. Each one isolates a single variable so you can see its effect clearly.

Add a second transformer block. Change layerCount from 1 to 2 in FullTraining.cs. The parameter count roughly doubles, but watch the loss. A second block lets the model refine its representations further. The comment in the code already hints at this.
Increase the sequence length. Change maxSequenceLength from 8 to 16. This lets the model see full-length names during training instead of truncating them at 7 characters. Training takes roughly twice as long, but you should see longer and more varied names during inference.
Experiment with temperature. Try temperature values of 0.1, 1.0, and 2.0 in the inference loop. At 0.1 the model plays it very safe (repetitive but well-formed). At 2.0 it gets creative (diverse but sometimes incoherent). Compare the outputs side by side to build intuition for how temperature shapes generation.
Remove RMSNorm. Comment out the RmsNorm calls in Model.cs and retrain. Watch what happens to the loss. Does it still converge? Does it converge more slowly, or does it blow up entirely? This shows you what normalisation is actually doing for training stability.
Swap the nonlinearity. Replace xi.Relu() in the MLP block with something else. Try xi * xi (squaring), or even just remove the nonlinearity entirely (delete the ReLU line). The loss will tell you how much the choice of activation function matters at this scale.

Performance Optimisation Notes

The course code above prioritises clarity over speed. Martin Skuta's C# MicroGPT repo (linked in Credits) includes several C#-specific optimisations worth understanding once the concepts are solid:

Replace LINQ in Hot Paths. The course code uses .Select(...).ToList() and .Sum() throughout (for example, the ReLU step in the MLP block and the temperature-scaling and sampling code in inference). LINQ allocates an enumerator and a closure delegate on every call, which adds up quickly in a training loop running millions of operations. Rewriting these as plain for loops that append to a pre-sized List<Value> is the first optimisation to reach for - it's mechanical, preserves readability, and typically gives a noticeable speedup before you ever touch SIMD.

SIMD Vectorisation. The Value.Dot method in the repo uses System.Numerics.Vector<double> to process multiple elements per CPU instruction. This gives a significant speedup for the dot products that dominate the computation.

Iterative Backward Pass. We already used this: the explicit Stack<T> instead of recursion. This avoids stack overflow on deep graphs and eliminates function call overhead.

Zero-Allocation Hot Paths. The repo's Value.Dot pre-allocates the _inputs and _localGrads arrays once per node instead of creating intermediate Value objects for each multiply-and-add. This keeps garbage collection pressure low during training.

Backward Loop Unrolling. The Backward method can special-case nodes with 1 or 2 inputs (which covers ~99% of the graph: Add, Mul, ReLU, Pow) to avoid loop setup overhead.

Parallel Gradient Reset. Parallel.ForEach(paramsList, p => p.Grad = 0) uses multiple cores to zero out gradients.

These optimisations don't change the algorithm. They just make the same computation run faster. When studying the code, understand the clean version first, then read the optimised version as "the same thing, but faster".

From MicroGPT to ChatGPT

Everything in this course is the algorithmic essence of how LLMs work. Between this and a production model, nothing changes in the core algorithm. What changes is scale and engineering:

Aspect	MicroGPT	Production LLM
Data	32K short names	Trillions of tokens of internet text
Tokenizer	1 char = 1 token (27 tokens)	BPE subwords (~100K tokens)
Computation recorder	Scalar `Value` objects	Tensor operations on GPUs
Parameters	~4,000	Hundreds of billions
Training	1 document per step, CPU	Millions of tokens per step, thousands of GPUs
Architecture	ReLU, learned position embeddings	GeLU/SwiGLU, RoPE, GQA, MoE
Post-training	None	SFT + RLHF to make it conversational

If you understand what you've built in this course, you understand the algorithmic essence. Everything else is efficiency.

Glossary

Term	First Appears	Definition
Value	Ch. 1	A wrapper around a `double` that records how it was computed, enabling automatic gradient computation
Computation recorder	Ch. 1	What we call the `Value` class and its `Backward` method collectively. The ML community typically calls this an "autograd engine" (short for automatic gradient computation) or "automatic differentiation". Same thing, different name
Gradient	Ch. 2	How much the final loss would change if you nudged a particular value. Stored in `Value.Grad`
Backward / Backpropagation	Ch. 2	The algorithm that computes gradients by walking the computation graph in reverse
Chain Rule	Ch. 2	The calculus rule that lets you multiply rates of change along a path
BOS	Ch. 3	Beginning of Sequence token, a delimiter marking where documents start and end
Token	Ch. 3	A discrete symbol (in our case, a character) that the model processes
Bigram	Ch. 4	A model that predicts the next token using only the current token
Logits	Ch. 5	Raw, unnormalised scores output by the model, one per vocabulary token
Softmax	Ch. 5	A function that converts logits into a probability distribution
Linear	Ch. 5	A matrix-vector multiplication, the fundamental learned transformation
Embedding	Ch. 6	A learned vector associated with each token or position
Cross-Entropy Loss	Ch. 6	The specific formula for computing the loss: -log(probability of correct token). This is the "loss" from the Big Picture
Adam	Ch. 7	An optimiser that uses momentum and adaptive learning rates
RMSNorm	Ch. 8	Normalisation that rescales a vector to unit root-mean-square
Residual Connection	Ch. 8	Adding a layer's input to its output, creating a gradient highway
Attention (self-attention)	Ch. 9	The mechanism where tokens compute relevance scores and exchange information with other tokens in the same sequence
Causal attention	Ch. 9	Attention where each token can only look at positions before it, not ahead
Query / Key / Value (Q/K/V)	Ch. 9	Three projections of each token used in the attention computation
KV Cache	Ch. 9	Stored keys and values from previous positions, enabling efficient sequential processing
Head	Ch. 10	One independent attention computation operating on a slice of the embedding
MLP	Ch. 10	A two-layer feed-forward network for per-position computation
Transformer Block	Ch. 10	Attention + MLP, each with RMSNorm and residual connections
Temperature	Ch. 12	A scaling factor that controls the randomness of generated text

References

These are the primary sources behind the claims and concepts in this course. If you want to verify something, dig deeper on a topic, or just see where the ideas originally came from, start here.

The Transformer Architecture
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need."
https://arxiv.org/abs/1706.03762
The paper that introduced the transformer. Our model uses the same core structure: multi-head self-attention and feed-forward layers on a residual stream.

The Adam Optimiser (Ch. 7)
Kingma, D. P., & Ba, J. (2014). "Adam: A Method for Stochastic Optimization."
https://arxiv.org/abs/1412.6980
The original paper describing the momentum, adaptive learning rate, and bias correction that our training loop uses.

RMSNorm (Ch. 8)
Zhang, B., & Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32 (NeurIPS 2019).
https://arxiv.org/abs/1910.07467
The paper proposing RMSNorm as a simpler alternative to LayerNorm. Our RmsNorm function implements the core idea from this paper.

GPT-2 (parameter count reference)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners."
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
The GPT-2 paper. The largest GPT-2 variant had 1.5 billion parameters.

Numerical Gradient Checking (Ch. 1)
PyTorch's torch.autograd.gradcheck function, which verifies analytical gradients against numerical finite differences:
https://docs.pytorch.org/docs/stable/generated/torch.autograd.gradcheck.gradcheck.html
This is the same nudge-and-measure technique used in our GradientCheck.cs.

Karpathy's micrograd video (Ch. 1-2)
Karpathy, A. (2022). "The spelled-out intro to neural networks and backpropagation: building micrograd."
https://www.youtube.com/watch?v=VMj-3S1tku0
A 2.5-hour walkthrough of the Value class and backpropagation. If you want a video companion to Chapters 1 and 2, this is it.

Karpathy's microgpt blog post
Karpathy, A. (2026). "microgpt."
https://karpathy.github.io/2026/02/12/microgpt/
The blog post that accompanies the original Python implementation. Covers the same progression as this course with additional mathematical detail.

Credits and Acknowledgements

This course is built on the work of others. It wouldn't exist without them.

Andrej Karpathy created the original microgpt - a 200-line Python implementation that distills a GPT into its bare algorithmic essence. The pedagogical progression used in this course (bigram -> MLP -> attention -> full transformer) follows the approach Karpathy developed across multiple projects including micrograd, makemore, and nanoGPT. His blog post and accompanying YouTube videos were invaluable references for the explanatory content throughout this course.

Martin Skuta (@martinskuta) wrote the C# implementation of microgpt that this course is based on. His translation from Python to C# - including the SIMD-vectorised dot product, the iterative backward pass, and the zero-allocation optimisations - demonstrated that the algorithm translates cleanly to .NET with no external dependencies. The Value class, the Gpt function structure, and the parameter dictionary layout in this course all derive from his work.

Jonas Ara (@jonas1ara) contributed the F# translation of the C# implementation to the same repository.

The training dataset (names.txt) is from Karpathy's makemore project.

This course was created and refined by Gary Jackson with the assistance of Claude (Anthropic). Gary provided the creative direction, pedagogical priorities, and iterative feedback that shaped the course structure, while Claude drafted and revised the content, code examples, and explanations.

Chapter 11: The Full GPT - Assembling the Model

Gary Jackson — Thu, 30 Apr 2026 21:07:25 +0000

What You'll Build

Four files that together make the project complete:

Model.cs - the GptModel class that holds all parameters and implements the full forward pass (replacing the simplified Forward function from Chapters 6-7)
AdamOptimiser.cs - a reusable class wrapping the Adam state and update from Chapter 7
FullTraining.cs - the real training loop that uses GptModel across 10,000 steps
Program.cs - the finalised dispatcher with the full case wired up

Depends On

All previous chapters.

The GptModel Class

A few design notes before the code.

The Forward method takes a single token at a time, not the whole sequence at once. The KV cache (passed in as parameters) holds the context from previous positions. This is the same one-token-at-a-time approach from Chapter 9: we process tokens sequentially during both training and inference.

Each document or sample needs its own fresh KV cache. The model provides CreateKvCache() for that, and the caller passes it back into every Forward call for that sequence.

The parameter dictionary uses string keys like "wte", "wpe", "lm_head", and "layer0.attn_wq". That means a typo in a key would be a runtime error, not a compile error, but it's the most direct mapping to how PyTorch stores model weights. If you ever wanted to load real GPT-2 checkpoints, the keys would line up. To keep the C# code readable inside Forward, we add private property aliases (TokenEmbeddings, PositionEmbeddings, OutputProjection) over the most-used dict entries. The cryptic two-letter names live in the dict, and the descriptive C# names live in the code that uses them.

Forward itself is short because it delegates to two private methods (AttentionBlock and MlpBlock) that mirror the "communicate, compute" framing from Chapter 10. Each block contains a pre-norm, the transformation, and the residual add.

// --- Model.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public class GptModel
{
    // The state dict keys follow PyTorch / GPT-2 convention (wte = weight token embedding,
    // wpe = weight position embedding, etc.) so this code can map directly to PyTorch
    // checkpoints if you ever want to load real GPT-2 weights. The aliased properties
    // below give us readable C# names to use inside Forward without losing that bridge.
    private readonly Dictionary<string, List<List<Value>>> _stateDict;
    private readonly int _embeddingSize;
    private readonly int _headCount;
    private readonly int _layerCount;
    private readonly int _headDimension;

    private List<List<Value>> TokenEmbeddings => _stateDict["wte"];
    private List<List<Value>> PositionEmbeddings => _stateDict["wpe"];
    private List<List<Value>> OutputProjection => _stateDict["lm_head"];

    /// <summary>All trainable parameters, flattened into a single list for the optimiser.</summary>
    public List<Value> Parameters { get; }

    public GptModel(
        int vocabSize,
        int embeddingSize,
        int headCount,
        int layerCount,
        int maxSequenceLength,
        Random random
    )
    {
        _embeddingSize = embeddingSize;
        _headCount = headCount;
        _layerCount = layerCount;
        _headDimension = embeddingSize / headCount;

        _stateDict = new Dictionary<string, List<List<Value>>>
        {
            ["wte"] = CreateMatrix(random, vocabSize, embeddingSize),
            ["wpe"] = CreateMatrix(random, maxSequenceLength, embeddingSize),
            ["lm_head"] = CreateMatrix(random, vocabSize, embeddingSize),
        };

        for (int i = 0; i < layerCount; i++)
        {
            _stateDict[$"layer{i}.attn_wq"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.attn_wk"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.attn_wv"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.attn_wo"] = CreateMatrix(random, embeddingSize, embeddingSize);
            _stateDict[$"layer{i}.mlp_fc1"] = CreateMatrix(
                random,
                4 * embeddingSize,
                embeddingSize
            );
            _stateDict[$"layer{i}.mlp_fc2"] = CreateMatrix(
                random,
                embeddingSize,
                4 * embeddingSize
            );
        }

        // Dictionary<TKey,TValue> enumeration order is not guaranteed by the spec.
        // In .NET Core+ it preserves insertion order in practice, so Adam's momentum[]/squaredGradAvg[]
        // line up across runs - but if that implementation detail ever changes, switch
        // to a List<KeyValuePair<string, ...>> to make the order explicit.
        Parameters = [.. _stateDict.Values.SelectMany(mat => mat).SelectMany(row => row)];
    }

    public List<Value> Forward(
        int tokenId,
        int posId,
        List<List<Value>>[] keys,
        List<List<Value>>[] values
    )
    {
        List<Value> tokenEmbedding = TokenEmbeddings[tokenId];
        List<Value> positionEmbedding = PositionEmbeddings[posId];

        var x = new List<Value>();
        for (int i = 0; i < _embeddingSize; i++)
        {
            x.Add(tokenEmbedding[i] + positionEmbedding[i]);
        }

        // Initial RmsNorm: stabilises the embeddings before entering the first block.
        // This isn't standard in all transformer implementations, but gives the
        // residual stream a stable starting magnitude.
        x = RmsNorm(x);

        for (int layerIndex = 0; layerIndex < _layerCount; layerIndex++)
        {
            x = AttentionBlock(x, layerIndex, keys, values);
            x = MlpBlock(x, layerIndex);
        }

        // Note: production transformers typically apply a final RmsNorm here
        // before the output projection. We omit it for simplicity.
        return Linear(x, OutputProjection);
    }

    // Attention wrapped with pre-norm and a residual connection.
    // Mutates keys[layerIndex] and values[layerIndex] by appending the current position's K and V.
    private List<Value> AttentionBlock(
        List<Value> x,
        int layerIndex,
        List<List<Value>>[] keys,
        List<List<Value>>[] values
    )
    {
        var xResidual = new List<Value>(x);
        x = RmsNorm(x);

        List<Value> query = Linear(x, _stateDict[$"layer{layerIndex}.attn_wq"]);
        List<Value> key = Linear(x, _stateDict[$"layer{layerIndex}.attn_wk"]);
        List<Value> value = Linear(x, _stateDict[$"layer{layerIndex}.attn_wv"]);

        keys[layerIndex].Add(key);
        values[layerIndex].Add(value);

        var concatenatedHeads = new List<Value>();
        for (int h = 0; h < _headCount; h++)
        {
            int headStart = h * _headDimension;
            List<Value> queryForHead = query.GetRange(headStart, _headDimension);

            var attentionLogits = new List<Value>();
            int cachedCount = keys[layerIndex].Count;
            for (int t = 0; t < cachedCount; t++)
            {
                List<Value> keyForHead = keys[layerIndex][t].GetRange(headStart, _headDimension);
                var dot = new Value(0);
                for (int j = 0; j < _headDimension; j++)
                {
                    dot += queryForHead[j] * keyForHead[j];
                }

                attentionLogits.Add(dot / Math.Sqrt(_headDimension));
            }

            List<Value> attentionWeights = Softmax(attentionLogits);

            var headOutput = new List<Value>();
            for (int j = 0; j < _headDimension; j++)
            {
                headOutput.Add(new Value(0));
            }

            for (int t = 0; t < cachedCount; t++)
            {
                List<Value> valueForHead = values[layerIndex]
                    [t]
                    .GetRange(headStart, _headDimension);
                Value w = attentionWeights[t];
                for (int j = 0; j < _headDimension; j++)
                {
                    headOutput[j] += w * valueForHead[j];
                }
            }
            concatenatedHeads.AddRange(headOutput);
        }

        x = Linear(concatenatedHeads, _stateDict[$"layer{layerIndex}.attn_wo"]);
        for (int i = 0; i < _embeddingSize; i++)
        {
            x[i] += xResidual[i];
        }

        return x;
    }

    // Two-layer feed-forward with ReLU, wrapped with pre-norm and a residual connection.
    private List<Value> MlpBlock(List<Value> x, int layerIndex)
    {
        var xResidual = new List<Value>(x);
        x = RmsNorm(x);
        x = Linear(x, _stateDict[$"layer{layerIndex}.mlp_fc1"]);
        x = [.. x.Select(xi => xi.Relu())];
        x = Linear(x, _stateDict[$"layer{layerIndex}.mlp_fc2"]);
        for (int i = 0; i < _embeddingSize; i++)
        {
            x[i] += xResidual[i];
        }

        return x;
    }

    /// <summary>Creates a fresh KV cache for a new document/sample.</summary>
    public List<List<Value>>[] CreateKvCache()
    {
        var cache = new List<List<Value>>[_layerCount];
        for (int i = 0; i < _layerCount; i++)
        {
            cache[i] = [];
        }

        return cache;
    }
}

Extracting the Adam Optimiser

The Adam state and update from Chapter 7 was inline so the mechanics were visible. Now that we've seen them, it's worth packaging them into a reusable class.

AdamOptimiser owns the per-parameter state (momentum, squaredGradAvg), the hyperparameters, and exposes two methods: ZeroGrad() and Step(int step). The maths inside Step is identical to the inline version in Chapter 7.

// --- AdamOptimiser.cs ---

namespace MicroGPT;

// Encapsulates the Adam update from Chapter 7 (momentum, squared-gradient
// average, bias correction) along with the linear learning-rate decay used
// across the course. See Chapter 7 for the underlying maths.
public class AdamOptimiser
{
    private const double MomentumSmoothing = 0.85;
    private const double SquaredGradSmoothing = 0.99;
    private const double Epsilon = 1e-8;

    private readonly IReadOnlyList<Value> _parameters;
    private readonly double[] _momentum;
    private readonly double[] _squaredGradAvg;
    private readonly double _baseLearningRate;
    private readonly int _totalSteps;

    public AdamOptimiser(IReadOnlyList<Value> parameters, double learningRate, int totalSteps)
    {
        _parameters = parameters;
        _momentum = new double[parameters.Count];
        _squaredGradAvg = new double[parameters.Count];
        _baseLearningRate = learningRate;
        _totalSteps = totalSteps;
    }

    // Reset every parameter's gradient to zero. Call before each Backward.
    public void ZeroGrad()
    {
        foreach (Value p in _parameters)
        {
            p.Grad = 0;
        }
    }

    // Apply one Adam update to every parameter using its current Grad.
    public void Step(int step)
    {
        double currentLearningRate = _baseLearningRate * (1 - (double)step / _totalSteps);
        for (int i = 0; i < _parameters.Count; i++)
        {
            Value p = _parameters[i];
            _momentum[i] = MomentumSmoothing * _momentum[i] + (1 - MomentumSmoothing) * p.Grad;
            _squaredGradAvg[i] =
                SquaredGradSmoothing * _squaredGradAvg[i]
                + (1 - SquaredGradSmoothing) * Math.Pow(p.Grad, 2);
            double correctedMomentum = _momentum[i] / (1 - Math.Pow(MomentumSmoothing, step + 1));
            double correctedSquaredGrad =
                _squaredGradAvg[i] / (1 - Math.Pow(SquaredGradSmoothing, step + 1));
            p.Data -=
                currentLearningRate
                * correctedMomentum
                / (Math.Sqrt(correctedSquaredGrad) + Epsilon);
        }
    }
}

The Training Loop

Now that GptModel and AdamOptimiser exist, we need a training loop that puts them through the same forward-backward-update cycle from Chapter 7, using the full model instead of three loose matrices. The training code (along with the inference loop we'll add in Chapter 12) goes into a new file, FullTraining.cs, with a single static Run() method.

The structure mirrors Chapter7Exercise.cs closely. The key differences: we create a GptModel instead of three loose matrices, we call model.Forward instead of a local function, we use model.CreateKvCache() to get a fresh cache per document, and the Adam step collapses to two optimiser calls.

There's also a new "milestone" print every 1000 steps that shows the current running-average loss alongside its value at the previous milestone. This is useful because per-step loss is noisy and an avg column alone can be hard to eyeball. The running average smooths things out but still wobbles, so over 1000 steps it can drift either way. Printing the previous value lets you judge the trend yourself.

Performance note: Run with dotnet run -c Release -- full. Debug mode is significantly slower and 10,000 steps could take a very long time without optimisation.

// --- FullTraining.cs ---

using System.Text;
using static MicroGPT.Helpers;

namespace MicroGPT;

public static class FullTraining
{
    public static void Run()
    {
        // ── Hyperparameters ──────────────────────────────────────
        int embeddingSize = 16;
        int layerCount = 1; // just one transformer block for speed - try layerCount=2 to see improvement
        int maxSequenceLength = 8;
        int numSteps = 10000;
        int headCount = 4;
        double learningRate = 1e-2;
        var random = new Random(42);

        // ── Dataset and Tokenizer ────────────────────────────────
        List<string> docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);
        Console.WriteLine($"num docs: {docs.Count}");
        Console.WriteLine($"vocab size: {tokenizer.VocabSize}");

        // ── Model ────────────────────────────────────────────────
        var model = new GptModel(
            tokenizer.VocabSize,
            embeddingSize,
            headCount,
            layerCount,
            maxSequenceLength,
            random
        );
        Console.WriteLine($"num params: {model.Parameters.Count}");

        // ── Training Loop ────────────────────────────────────────
        var optimiser = new AdamOptimiser(model.Parameters, learningRate, numSteps);

        // Reusable buffers for Backward (see Chapter 2's convenience overload for the
        // simpler allocating version - here we hoist them out of the hot loop so 10,000
        // training steps don't allocate 10,000 fresh sets).
        var topo = new List<Value>();
        var visited = new HashSet<Value>();
        var backwardStack = new Stack<(Value, int)>();

        // Running average to smooth out the noisy per-step loss.
        double avgLoss = 0.0;
        // Milestone tracking so we can report the previous milestone's avg loss
        // alongside the current one every 1000 steps.
        double lastMilestoneLoss = 0.0;

        for (int step = 0; step < numSteps; step++)
        {
            string doc = docs[step % docs.Count];
            var tokens = new List<int> { tokenizer.Bos };
            tokens.AddRange(doc.Select(tokenizer.Encode));
            tokens.Add(tokenizer.Bos);
            // Any name longer than maxSequenceLength - 1 is silently truncated here.
            int tokenCount = Math.Min(maxSequenceLength, tokens.Count - 1);

            List<List<Value>>[] keys = model.CreateKvCache();
            List<List<Value>>[] values = model.CreateKvCache();

            var losses = new List<Value>();
            for (int posId = 0; posId < tokenCount; posId++)
            {
                List<Value> logits = model.Forward(tokens[posId], posId, keys, values);
                List<Value> probabilities = Softmax(logits);
                losses.Add(-probabilities[tokens[posId + 1]].Log());
            }

            var loss = new Value(0);
            foreach (Value l in losses)
            {
                loss += l;
            }

            loss *= 1.0 / tokenCount;

            // Track running average (exponential moving average with alpha = 0.01)
            avgLoss = step == 0 ? loss.Data : 0.99 * avgLoss + 0.01 * loss.Data;
            if (step == 0)
            {
                lastMilestoneLoss = avgLoss;
            }

            optimiser.ZeroGrad();

            topo.Clear();
            visited.Clear();
            backwardStack.Clear();
            loss.Backward(topo, visited, backwardStack);

            optimiser.Step(step);

            if (step == 0 || (step + 1) % 100 == 0)
            {
                Console.WriteLine(
                    $"step {step + 1, 5} / {numSteps, 5} | loss {loss.Data:F4} | avg {avgLoss:F4}"
                );
            }

            // Every 1000 steps, print a milestone showing overall progress.
            if ((step + 1) % 1000 == 0)
            {
                Console.WriteLine(
                    $"  [milestone] avg loss: {avgLoss:F4} (was {lastMilestoneLoss:F4})"
                );
                lastMilestoneLoss = avgLoss;
            }
        }

        // Chapter 12's inference loop lives here too - see the next chapter.
    }
}

A reminder about name length. As in Chapter 7, maxSequenceLength = 8 means any name longer than 7 characters is silently truncated during training. This is why the generated samples in Chapter 12 lean toward shorter names. The model simply hasn't seen the tails of longer ones during training. Raising maxSequenceLength to 16 removes the truncation at roughly 2x the training cost.

Finishing the Dispatcher

With FullTraining.cs in place, we can finalise Program.cs. You've been uncommenting one case in the dispatcher at the end of every chapter since Chapter 1. Now uncomment the final case "full" line and replace the temporary sanity-check default from Chapter 0 with the usage message. After this edit your Program.cs should look like this:

// --- Program.cs ---
//
// Dispatcher: `dotnet run -- chN` runs a specific chapter exercise,
// `dotnet run -- full` runs the full training + inference,
// `dotnet run` (no args) defaults to the full run.

namespace MicroGPT;

public static class Program
{
    public static void Main(string[] args)
    {
        string chapter = args.Length > 0 ? args[0].ToLowerInvariant() : "full";

        switch (chapter)
        {
            case "gradcheck":
                GradientCheck.RunAll();
                break;
            case "ch1":
                Chapter1Exercise.Run();
                break;
            case "ch2":
                Chapter2Exercise.Run();
                break;
            case "ch3":
                Chapter3Exercise.Run();
                break;
            case "ch4":
                Chapter4Exercise.Run();
                break;
            case "ch5":
                Chapter5Exercise.Run();
                break;
            case "ch6":
                Chapter6Exercise.Run();
                break;
            case "ch7":
                Chapter7Exercise.Run();
                break;
            case "ch8":
                Chapter8Exercise.Run();
                break;
            case "ch9":
                Chapter9Exercise.Run();
                break;
            case "ch10":
                Chapter10Exercise.Run();
                break;
            case "full":
                FullTraining.Run();
                break;
            default:
                Console.WriteLine($"Unknown chapter: {chapter}");
                Console.WriteLine("Usage: dotnet run -- [gradcheck|ch1..ch10|full]");
                break;
        }
    }
}

Two things changed from the Chapter 0 skeleton: the "full" case is now wired to FullTraining.Run(), and the default no-args value is "full" instead of "" so that dotnet run with no arguments runs the full training and inference.

With everything wired up, you can invoke any part of the project uniformly:

dotnet run -- ch1     # Chapter 1 exercise
dotnet run -- ch10    # Chapter 10 exercise
dotnet run -- full    # full training + inference (Chapters 11-12)
dotnet run            # same as "full"

The Parameter Count

With embeddingSize=16, headCount=4, layerCount=1, maxSequenceLength=8, and vocabSize=27:

Token embeddings (wte): 27 x 16 = 432
Position embeddings (wpe): 8 x 16 = 128
Output projection (lm_head): 27 x 16 = 432
Per layer: Q(256) + K(256) + V(256) + O(256) + FC1(1024) + FC2(1024) = 3,072
Total: 432 + 128 + 432 + 3,072 = 4,064

For perspective: GPT-2's largest variant had 1.5 billion parameters. GPT-4 class models have hundreds of billions. The architecture is the same, just much wider and deeper.

Optional: Try Training Now

Training is fully wired up. If you want to confirm everything works before adding inference, run it now:

dotnet run -c Release -- full

You'll see per-step loss lines and a [milestone] line every 1000 steps. The running average should drop from ~3.3 toward ~2.37 over 10,000 steps (5-15 minutes on a modern CPU in Release mode). Per-step loss is noisy, so watch the avg column for the trend.

Generation is the next chapter, so the run will end after training without producing any names yet. If you'd rather wait and run once with both training and inference in place, skip this and head straight into Chapter 12.

Chapter 10: Multi-Head Attention and the MLP Block

Gary Jackson — Wed, 29 Apr 2026 21:10:41 +0000

What You'll Build

Multi-head attention (running several attention computations in parallel, each on its own slice of the per-token embedding vector) and the MLP block (a two-layer feed-forward network for per-position "thinking"). Both concepts are introduced here and implemented in Model.cs in Chapter 11.

Depends On

Chapters 5, 8, 9 (Helpers, RmsNorm, residual connections, single-head attention).

Why Multiple Heads?

A single attention head can only learn one kind of "what am I looking for?" pattern. With multiple heads, the model can look for different kinds of relationships at the same time. In larger models with bigger embedding dimensions, individual heads often specialise in distinct patterns (one might track syntax, another semantics). At our small scale (headDimension = 4), the specialisation is fuzzier, but the mechanism is the same.

The trick: instead of running 4 full-size attention computations, we split the embedding dimension into 4 slices. If embeddingSize = 16 and headCount = 4, each head operates on 4 dimensions (headDimension = 4). This doesn't lose information because the projections (queryWeights, keyWeights, valueWeights) can learn to put related information into the same slice. The heads compute independently and their outputs are concatenated (not averaged or summed) back to the full embedding size. Concatenation keeps all the per-head information in distinct dimensions, so nothing is lost before the next step.

Multi-Head Attention

// Shape reference - integrated into GptModel.Forward in Chapter 11.
// The for loop is sequential, but conceptually each head is independent.
// In production, all heads are computed in a single matrix multiply on a GPU.

var concatenatedHeads = new List<Value>();

for (int h = 0; h < headCount; h++)
{
    int headStart = h * headDimension;
    List<Value> queryForHead = q.GetRange(headStart, headDimension);

    var attentionLogits = new List<Value>();
    for (int t = 0; t < cachedKeys.Count; t++)
    {
        List<Value> keyForHead = cachedKeys[t].GetRange(headStart, headDimension);
        var dot = new Value(0);
        for (int j = 0; j < headDimension; j++)
        {
            dot += queryForHead[j] * keyForHead[j];
        }

        attentionLogits.Add(dot / Math.Sqrt(headDimension));
    }

    List<Value> attentionWeights = Helpers.Softmax(attentionLogits);

    var headOutput = new List<Value>();
    for (int j = 0; j < headDimension; j++)
    {
        headOutput.Add(new Value(0));
    }

    for (int t = 0; t < cachedValues.Count; t++)
    {
        List<Value> valueForHead = cachedValues[t].GetRange(headStart, headDimension);
        Value w = attentionWeights[t];
        for (int j = 0; j < headDimension; j++)
        {
            headOutput[j] += w * valueForHead[j];
        }
    }

    concatenatedHeads.AddRange(headOutput); // concatenate this head's output
}

// After concatenation, project through outputWeights to mix information across heads
x = Helpers.Linear(concatenatedHeads, outputWeights);

The final Linear(concatenatedHeads, outputWeights) is important. After concatenation, each dimension still belongs to a single head. The outputWeights projection mixes information across heads, letting the model combine what different heads found.

The MLP Block

MLP stands for Multi-Layer Perceptron, a generic term for a stack of linear layers with nonlinearities between them. In transformers it's specifically a two-layer feed-forward network.

Attention is the communication mechanism (tokens talk to each other). The MLP is the computation mechanism (each position "thinks" independently). Concretely, it projects up to 4x the embedding dimension, applies ReLU, then projects back down.

// Shape reference - integrated into GptModel.Forward in Chapter 11.

x = Helpers.Linear(x, mlpUpWeights); // project up: embeddingSize -> 4*embeddingSize
x = [.. x.Select(xi => xi.Relu())]; // nonlinearity
x = Helpers.Linear(x, mlpDownWeights); // project down: 4*embeddingSize -> embeddingSize

Why project up and then back down? The wider intermediate layer gives the model more "room to think" (more dimensions to combine features in) before compressing back to the residual stream size.

We use ReLU here for simplicity. Production transformers typically use smoother variants like GeLU or SwiGLU, but the role is the same: introduce a nonlinearity between the two linear projections.

The Transformer Block

A single transformer layer combines attention and MLP, each wrapped with RMSNorm and a residual connection:

// Shape reference - integrated into GptModel.Forward in Chapter 11.

// Attention with residual
var xResidual = new List<Value>(x);
x = Helpers.RmsNorm(x);
x = /* multi-head attention + outputWeights projection */;
for (int i = 0; i < embeddingSize; i++)
{
    x[i] += xResidual[i];
}

// MLP with residual
xResidual = new List<Value>(x);
x = Helpers.RmsNorm(x);
x = /* MLP block */;
for (int i = 0; i < embeddingSize; i++)
{
    x[i] += xResidual[i];
}

Stacking Blocks

Our model uses layerCount = 1 (a single block), but the architecture supports stacking multiple blocks in sequence. Each block reads from and writes to the same residual stream:

Embeddings
    ↓
┌─ Block 1 ─┐
│ Attention  │
│ MLP        │
└────────────┘
    ↓
┌─ Block 2 ─┐
│ Attention  │
│ MLP        │
└────────────┘
    ↓
   ...
    ↓
┌─ Block N ─┐
│ Attention  │
│ MLP        │
└────────────┘
    ↓
Output projection (lmHead)

Deeper models (more blocks) can learn more complex patterns because each block refines the representation further. GPT-2's largest variant used 48 blocks.

Exercise: Multi-Head Attention + MLP

Like Chapter 9, this exercise uses hand-crafted Q/K/V so you can see the behaviour rather than waiting for training to discover it. The setup: embeddingSize = 8, headCount = 2, headDimension = 4, three cached positions. Head 0's Q slice is aligned with K[1], and head 1's Q slice is aligned with K[2], so the two heads should attend to different positions. After the demo, a second pass runs an MLP block on a fixed input to show the up-project → ReLU → down-project shape change.

Create Chapter10Exercise.cs:

// --- Chapter10Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter10Exercise
{
    public static void Run()
    {
        MultiHeadAttentionDemo();
        Console.WriteLine();
        MlpBlockDemo();
    }

    // Hand-crafted multi-head attention on a 3-position sequence.
    // embeddingSize = 8, headCount = 2, headDimension = 4. Head 0 and Head 1 are set up to
    // attend to *different* positions so we can see the independence.
    private static void MultiHeadAttentionDemo()
    {
        const int EmbeddingSize = 8;
        const int HeadCount = 2;
        const int HeadDimension = EmbeddingSize / HeadCount;

        // Each cached key has two halves: the first 4 dims serve head 0,
        // the last 4 dims serve head 1. Both halves happen to match here,
        // but they could be completely different - each head only reads its slice.
        var cachedKeys = new List<List<Value>>
        {
            new() { new(1), new(0), new(0), new(0), new(1), new(0), new(0), new(0) }, // K[0]
            new() { new(0), new(1), new(0), new(0), new(0), new(1), new(0), new(0) }, // K[1]
            new() { new(0), new(0), new(1), new(0), new(0), new(0), new(1), new(0) }, // K[2]
        };

        var cachedValues = new List<List<Value>>
        {
            new() { new(10), new(0), new(0), new(0), new(100), new(0), new(0), new(0) }, // V[0]
            new() { new(0), new(20), new(0), new(0), new(0), new(200), new(0), new(0) }, // V[1]
            new() { new(0), new(0), new(30), new(0), new(0), new(0), new(300), new(0) }, // V[2]
        };

        // Q is designed so head 0 matches K[1] and head 1 matches K[2].
        //    head 0 slice                head 1 slice
        var query = new List<Value>
        {
            new(0),
            new(5),
            new(0),
            new(0),
            new(0),
            new(0),
            new(5),
            new(0),
        };

        var concatenatedHeads = new List<Value>();

        for (int h = 0; h < HeadCount; h++)
        {
            int headStart = h * HeadDimension;
            List<Value> queryForHead = query.GetRange(headStart, HeadDimension);

            var attentionLogits = new List<Value>();
            for (int t = 0; t < cachedKeys.Count; t++)
            {
                List<Value> keyForHead = cachedKeys[t].GetRange(headStart, HeadDimension);
                var dot = new Value(0);
                for (int j = 0; j < HeadDimension; j++)
                {
                    dot += queryForHead[j] * keyForHead[j];
                }

                attentionLogits.Add(dot / Math.Sqrt(HeadDimension));
            }

            List<Value> attentionWeights = Softmax(attentionLogits);

            var headOutput = new List<Value>();
            for (int j = 0; j < HeadDimension; j++)
            {
                headOutput.Add(new Value(0));
            }

            for (int t = 0; t < cachedValues.Count; t++)
            {
                List<Value> valueForHead = cachedValues[t].GetRange(headStart, HeadDimension);
                Value w = attentionWeights[t];
                for (int j = 0; j < HeadDimension; j++)
                {
                    headOutput[j] += w * valueForHead[j];
                }
            }

            concatenatedHeads.AddRange(headOutput); // concatenate this head's output

            Console.WriteLine(
                $"--- Head {h} (dims {headStart}..{headStart + HeadDimension - 1}) ---"
            );
            Console.WriteLine(
                $"  Q slice = [{string.Join(", ", queryForHead.Select(v => v.Data))}]"
            );
            for (int t = 0; t < attentionWeights.Count; t++)
            {
                Console.WriteLine($"  attn weight[{t}] = {attentionWeights[t].Data:F4}");
            }

            Console.WriteLine(
                $"  head output = [{string.Join(", ", headOutput.Select(v => v.Data.ToString("F2")))}]"
            );
        }

        Console.WriteLine();
        Console.WriteLine("Concatenated multi-head output (length embeddingSize = 8):");
        Console.WriteLine(
            $"  [{string.Join(", ", concatenatedHeads.Select(v => v.Data.ToString("F2")))}]"
        );
        Console.WriteLine(
            "Note how the first 4 dims are dominated by V[1] and the last 4 by V[2] -"
        );
        Console.WriteLine(
            "the two heads attended to different positions and both contributions survived."
        );
    }

    // Shows the MLP block: up-project -> ReLU -> down-project.
    // We don't train anything here; we just run a fixed input through random weights
    // to show that the shape goes embeddingSize -> 4*embeddingSize -> embeddingSize.
    private static void MlpBlockDemo()
    {
        const int EmbeddingSize = 4;
        var random = new Random(42);

        List<List<Value>> mlpUpWeights = CreateMatrix(random, 4 * EmbeddingSize, EmbeddingSize);
        List<List<Value>> mlpDownWeights = CreateMatrix(random, EmbeddingSize, 4 * EmbeddingSize);

        var x = new List<Value> { new(0.5), new(-0.3), new(1.0), new(-0.8) };
        Console.WriteLine($"--- MLP Block (embeddingSize = {EmbeddingSize}) ---");
        Console.WriteLine($"  input           ({x.Count, 2} dims): [{Format(x)}]");

        List<Value> hidden = Linear(x, mlpUpWeights);
        Console.WriteLine($"  after up-proj   ({hidden.Count, 2} dims): [{Format(hidden)}]");

        var activated = hidden.Select(v => v.Relu()).ToList();
        int negBefore = hidden.Count(v => v.Data < 0);
        Console.WriteLine(
            $"  after ReLU      ({activated.Count, 2} dims): [{Format(activated)}]  (zeroed {negBefore} negatives)"
        );

        List<Value> output = Linear(activated, mlpDownWeights);
        Console.WriteLine($"  after down-proj ({output.Count, 2} dims): [{Format(output)}]");

        static string Format(IEnumerable<Value> vs) =>
            string.Join(", ", vs.Select(v => v.Data.ToString("F3")));
    }
}

Uncomment the Chapter 10 case in the dispatcher in Program.cs:

case "ch10":
    Chapter10Exercise.Run();
    break;

Then run it:

dotnet run -- ch10

You should see head 0's attention peak at position 1, head 1's peak at position 2, and the concatenated output with distinct contributions in each half. The MLP demo shows the dimensionality change: 4 → 16 → 4, with ReLU zeroing out 6 of the 16 intermediate entries (the exact count is deterministic with Random(42)).

This exercise lives in Chapter10Exercise.cs so you can come back to it any time.

Key Distinction: Communication vs. Computation

The transformer alternates between two fundamentally different operations:

Attention is communication across time. The token at position t looks at tokens 0..t-1.
MLP is computation at a single position. No cross-position information flow.

That's the design pattern of the entire transformer: communicate, compute, communicate, compute, on a residual stream that carries information forward.

Chapter 9: Single-Head Attention - Tokens Looking at Each Other

Gary Jackson — Tue, 28 Apr 2026 21:30:29 +0000

What You'll Build

The attention mechanism: the only place in a transformer where a token at position t gets to look at tokens at positions 0..t-1. This is specifically self-attention, where the token attends to other tokens in the same sequence. (You might encounter "cross-attention" in other materials, which is used in encoder-decoder models where tokens attend to a different sequence. We don't use cross-attention here.)

Depends On

Chapters 1-2, 5 (Value, Helpers).

The Core Idea

Until now, each token has been processed independently. The token at position 3 has no idea what's at positions 0, 1, or 2. Attention fixes this by letting each token ask: "what earlier tokens are relevant to me?" Because each token can only look backward (at positions before it, never ahead), this is called causal attention. The past can influence the future, but not the other way around.

It works through three separate projections of the same input:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I offer if selected?"

Where the Names Come From

If these descriptions feel a bit hand-wavy, that's because they are. The Query/Key/Value names are borrowed from database lookup, which pre-dates transformers by decades. In that world you have a query (what you want to find), the database has keys to match against, and each key has an associated value that gets returned when the key matches. Attention works the same way: Q and K dot-product together to measure "match", and V is the payload that flows through for the matches that win.

What's actually load-bearing isn't the names, it's the math. The attention formula needs two vectors to dot-product together and one to weight-and-sum, so three projections are required. You could rename them Alice, Bob, and Carol and the arithmetic would be identical. The query/key/value descriptions are a database metaphor humans use to reason about what the projections are for. The model doesn't know or care. That's why those "what am I looking for?" descriptions feel a little forced: the projections don't have to encode anything human-interpretable, they just have to play their mathematical roles.

Why Three Separate Projections?

You might wonder why we need three separate projections rather than just using the embedding directly as Q, K, and V. Each projection lets the model learn a different aspect of the same token. The query might learn to represent "what kind of character should come next", while the key learns "what kind of character am I", and the value learns "what information should I pass forward if selected". Three separate learned projections give the model the flexibility to use the same input in three different ways.

Why the Dot Product Measures Matching

The dot product multiplies matching elements of two vectors and sums them. When two vectors are aligned (point the same way), matching elements tend to share signs and contribute positive values, so the sum gets large. When they're perpendicular (unrelated), each dimension's activity in one doesn't overlap with the other, so contributions cancel or stay zero. When they're opposite, the signs disagree everywhere and the sum is large and negative.

Concrete examples with a 4-dimensional Q asking a question, and three different K candidates:

Q = [3, 2, -1, 0]                       (the query)

K_similar   = [2, 1, -1, 0]     Q·K = 3*2 + 2*1 + (-1)*(-1) + 0*0   =  9   large positive - matches
K_unrelated = [0, 0,  0, 5]     Q·K = 3*0 + 2*0 + (-1)*0   + 0*5   =  0   zero - no relationship
K_opposite  = [-3, -2, 1, 0]    Q·K = 3*-3 + 2*-2 + (-1)*1 + 0*0   = -14  large negative - anti-match

For attention, only the first case is what the model wants: a Q asking a question, and an earlier token's K offering an answer that aligns. The higher the dot product, the more relevant that earlier token is.

The Attention Pipeline

The math flows in stages:

Q · K[t]        → score for position t  (how well Q matches K[t])
scores / sqrt(d)  → scaled logits (prevents big numbers)
softmax(logits) → attention weights that sum to 1
weights × V[t]  → weighted contribution from position t
sum of those    → final output

Q and K never touch V directly. Q·K decides who's relevant, and V is what the relevant ones pass along. Each ingredient has one job:

K answers "do I match?" (used in the dot product)
V answers "if I'm relevant, here's what I have to share"
Q is the question being asked

The original transformer paper ("Attention Is All You Need") calls this scaled dot-product attention because of the / sqrt(d) scaling step.

A Concrete Example

Suppose we're processing the name "emma" and we're at the second 'm' (position 3), trying to predict what comes next.

The model might learn a query like "what vowels appeared recently?". The earlier 'e' at position 1 would have a key that matches this query well, giving it a high attention weight. Its value (encoding information about being a vowel) flows into the current position. The first 'm' might get a lower weight because its key doesn't match the query as well.

This is how the model learns long-range patterns like "after two consonants, a vowel is likely".

The KV Cache

A key design detail: during both training and inference, we process one token at a time. After computing K and V for the current token, we append them to a cache. When computing attention for position 5, we already have cached K and V from positions 0-4, and we compute dot products between the current Q and all cached Ks.

"Doesn't the KV cache make training different from inference?" Not algorithmically. In production systems, the KV cache is usually only used at inference, because during training all positions are processed in parallel using matrix operations. But the math is identical. MicroGPT processes one token at a time during both training and inference, making the KV cache explicit in both cases.

"Where is the causal mask?" If you've read nanoGPT or other batched transformer code, you've probably seen a lower-triangular tril matrix multiplied into the attention scores to zero out future positions. MicroGPT has no such mask, and it doesn't need one. Because we build the sequence token-by-token and append to cachedKeys and cachedValues as we go, the only keys and values in scope when computing attention at position t are the ones from positions 0 through t. The future tokens physically aren't in the cache yet, so there is nothing to mask. Sequential KV caching replaces matrix masking - same causality, different shape.

A subtle but important point: the cached keys and values are not frozen numbers. They're live Value objects that are part of the computation graph. When Backward runs, gradients flow through the cached values just like any other Value. That's what makes attention learnable. The model adjusts the weight projections (queryWeights, keyWeights, valueWeights) based on how the cached keys and values contributed to the final loss.

Code

Here's the shape of scaled dot-product attention, extracted from the full GptModel.Forward you'll build in Chapter 11. This is exactly what the runnable exercise below exercises, and what Chapter 10 generalises to multiple heads.

The four weight matrices. queryWeights, keyWeights, valueWeights, and outputWeights below are the learned matrices for this attention layer. Three of them turn the input into Q, K, and V projections. The fourth (outputWeights) is applied at the very end to mix the attention result back into the model's internal representation. This becomes important in Chapter 10 when multi-head attention needs to mix information across heads. In Chapter 11's GptModel you'll see these stored under GPT-2's state dict keys (attn_wq, attn_wk, attn_wv, attn_wo) so PyTorch checkpoints could map directly, but the descriptive parameter names used here make the roles clearer.

// Shape reference - Chapter 11 integrates this into GptModel.Forward.
// embeddingSize is the embedding dimension (16 in our model, set in Chapter 6)

List<Value> SingleHeadAttention(
    List<Value> x,
    List<List<Value>> cachedKeys,
    List<List<Value>> cachedValues,
    List<List<Value>> queryWeights,
    List<List<Value>> keyWeights,
    List<List<Value>> valueWeights,
    List<List<Value>> outputWeights
)
{
    List<Value> query = Helpers.Linear(x, queryWeights);
    List<Value> key = Helpers.Linear(x, keyWeights);
    List<Value> value = Helpers.Linear(x, valueWeights);

    cachedKeys.Add(key);
    cachedValues.Add(value);

    var attentionLogits = new List<Value>();
    for (int t = 0; t < cachedKeys.Count; t++)
    {
        var dot = new Value(0);
        for (int j = 0; j < embeddingSize; j++)
        {
            dot += query[j] * cachedKeys[t][j];
        }

        // Scale by sqrt(embeddingSize) to keep the dot products in a reasonable range.
        // Without this, larger embedding dimensions produce larger dot products,
        // which push Softmax toward extreme values (all weight on one token).
        attentionLogits.Add(dot / Math.Sqrt(embeddingSize));
    }

    List<Value> attentionWeights = Helpers.Softmax(attentionLogits);

    var output = new List<Value>();
    for (int j = 0; j < embeddingSize; j++)
    {
        output.Add(new Value(0));
    }

    for (int t = 0; t < cachedValues.Count; t++)
    {
        Value w = attentionWeights[t];
        for (int j = 0; j < embeddingSize; j++)
        {
            output[j] += w * cachedValues[t][j];
        }
    }

    return Helpers.Linear(output, outputWeights);
}

Exercise: Hand-Crafted Single-Head Attention

The code above is what GptModel will use, but it takes learned projections (queryWeights, keyWeights, valueWeights, outputWeights), which means you can't run it meaningfully until training starts in Chapter 11. To actually see attention working, the exercise below skips the projections and constructs Q, K, and V by hand, picking directions on purpose so you can predict which position should win.

The setup: three cached positions with embeddingSize = 4, where each K[t] points in a different basis direction, each V[t] carries a distinct payload, and Q is aligned with K[1]. If the math is right, position 1 should receive nearly all the attention weight, and the output vector should look mostly like V[1].

Create Chapter9Exercise.cs:

// --- Chapter9Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter9Exercise
{
    public static void Run()
    {
        // Hand-crafted single-head attention on a 3-position sequence with embeddingSize=4.
        // We skip the Q/K/V projections and just build K, V, and Q directly so we can
        // see exactly which position the query attends to.
        const int EmbeddingSize = 4;

        // Three cached positions. Each key points in a different basis direction.
        var cachedKeys = new List<List<Value>>
        {
            new() { new(1), new(0), new(0), new(0) }, // K[0]
            new() { new(0), new(1), new(0), new(0) }, // K[1]
            new() { new(0), new(0), new(1), new(0) }, // K[2]
        };

        // Each value carries a distinct "payload" so we can see whose information
        // flows through to the output.
        var cachedValues = new List<List<Value>>
        {
            new() { new(10), new(0), new(0), new(0) }, // V[0] = 10 in slot 0
            new() { new(0), new(20), new(0), new(0) }, // V[1] = 20 in slot 1
            new() { new(0), new(0), new(30), new(0) }, // V[2] = 30 in slot 2
        };

        // The current query points strongly in the same direction as K[1].
        // Expectation: position 1 gets the highest attention weight, and
        // the output should look mostly like V[1] = [0, 20, 0, 0].
        var query = new List<Value> { new(0), new(5), new(0), new(0) };

        List<Value> attentionLogits = ComputeAttentionLogits(query, cachedKeys, EmbeddingSize);
        List<Value> attentionWeights = Softmax(attentionLogits);
        List<Value> output = ComputeAttentionOutput(attentionWeights, cachedValues, EmbeddingSize);

        Console.WriteLine("--- Single-Head Attention ---");
        Console.WriteLine($"Q = [{string.Join(", ", query.Select(v => v.Data))}]");
        Console.WriteLine();

        Console.WriteLine("Attention weights (should peak at position 1):");
        for (int t = 0; t < attentionWeights.Count; t++)
        {
            Console.WriteLine($"  position {t}: {attentionWeights[t].Data:F4}");
        }

        Console.WriteLine();
        Console.WriteLine("Output vector (should look mostly like V[1] = [0, 20, 0, 0]):");
        Console.WriteLine($"  [{string.Join(", ", output.Select(v => v.Data.ToString("F3")))}]");

        // Sanity check: position 1 should have the highest weight.
        int topPosition = 0;
        for (int t = 1; t < attentionWeights.Count; t++)
        {
            if (attentionWeights[t].Data > attentionWeights[topPosition].Data)
            {
                topPosition = t;
            }
        }

        Console.WriteLine();
        Console.WriteLine(
            $"Top-attended position: {topPosition} ({(topPosition == 1 ? "PASS" : "FAIL")})"
        );
    }

    // Scaled dot-product attention scores: score[t] = (query . keys[t]) / sqrt(embeddingSize)
    private static List<Value> ComputeAttentionLogits(
        List<Value> query,
        List<List<Value>> cachedKeys,
        int embeddingSize
    )
    {
        var attentionLogits = new List<Value>();
        for (int t = 0; t < cachedKeys.Count; t++)
        {
            var dot = new Value(0);
            for (int j = 0; j < embeddingSize; j++)
            {
                dot += query[j] * cachedKeys[t][j];
            }

            attentionLogits.Add(dot / Math.Sqrt(embeddingSize));
        }
        return attentionLogits;
    }

    // Weighted sum of value vectors, using the attention weights.
    private static List<Value> ComputeAttentionOutput(
        List<Value> attentionWeights,
        List<List<Value>> cachedValues,
        int embeddingSize
    )
    {
        var output = new List<Value>();
        for (int j = 0; j < embeddingSize; j++)
        {
            output.Add(new Value(0));
        }

        for (int t = 0; t < cachedValues.Count; t++)
        {
            Value w = attentionWeights[t];
            for (int j = 0; j < embeddingSize; j++)
            {
                output[j] += w * cachedValues[t][j];
            }
        }
        return output;
    }
}

Uncomment the Chapter 9 case in the dispatcher in Program.cs:

case "ch9":
    Chapter9Exercise.Run();
    break;

Then run it:

dotnet run -- ch9

You should see attention weights of approximately [0.07, 0.86, 0.07] and an output vector dominated by the 20 from slot 1. If you change query to point at K[0] or K[2], the peak moves accordingly. Try it. That's the whole attention mechanism in ~40 lines of arithmetic.

Chapter 8: RMS Normalisation and Residual Connections

Gary Jackson — Mon, 27 Apr 2026 20:46:56 +0000

What You'll Build

Two architectural patterns that make deep networks trainable: RMSNorm (keeps activations from exploding or vanishing) and residual connections (gives gradients a highway to flow through).

Depends On

Chapters 1-2 (Value), Chapter 5 (Helpers).

The Problem They Solve

As data flows through many Linear operations and activation functions like ReLU (both of which you've already seen), the magnitude of the numbers can drift. They grow huge, or shrink to near-zero. Both are catastrophic for training. RMSNorm rescales the numbers after each layer to keep them in a stable range, and residual connections let the original signal bypass each layer entirely.

RMSNorm

Imagine a vector of numbers flowing through the network. After a few Linear operations, those numbers might have drifted to very large values like [500, 800, 300] or very small ones like [0.001, 0.002, 0.001]. RMSNorm fixes this by measuring the overall "size" of the vector (using the root mean square: the square root of the average of the squared values) and then dividing each element by that size. The result is a vector whose overall magnitude is always close to 1, regardless of what happened in previous layers.

Why root-mean-square specifically? This is the same RMS pattern we saw in Adam's squared gradient average in Chapter 7, and for the same two reasons:

Makes values positive. We care about overall magnitude, not direction. A vector [-5, 5] has the same "size" as [5, -5], and squaring makes the calculation agree.
Emphasises larger values. A value of 10 contributes 100 to the sum; a value of 1 contributes just 1. So the measure is dominated by the biggest elements rather than being smeared across all of them.

Squaring on the way in and square-rooting on the way out gives us a single number that represents the vector's "typical size". Dividing by that scale leaves a vector whose overall magnitude is ~1.

Add it to Helpers.cs:

// --- Helpers.cs (add inside the Helpers class) ---

/// <summary>
/// Rescales a vector so its overall magnitude is close to 1, using the root mean
/// square of its values. Keeps activations stable across deep networks.
/// </summary>
public static List<Value> RmsNorm(List<Value> x)
{
    var sumSq = new Value(0);
    foreach (Value xi in x)
    {
        sumSq += xi * xi;
    }

    Value ms = sumSq / x.Count;
    Value scale = (ms + 1e-5).Pow(-0.5);
    return [.. x.Select(xi => xi * scale)];
}

The 1e-5 prevents division by zero if all values happen to be zero. RMSNorm was introduced by Zhang & Sennrich (2019) as a simpler alternative to LayerNorm (used in the original GPT-2). It drops the learned scale/shift parameters and the mean-subtraction step, making it faster while achieving similar results. See the References section for the paper.

Residual Connections

A residual connection simply adds a layer's input back to its output. It isn't a separate function, it's a pattern applied inline wherever a transformation occurs:

// Pattern - not a standalone function, used inside Model.cs in Chapter 11

var xResidual = new List<Value>(x);
x = SomeTransformation(x);
for (int i = 0; i < x.Count; i++)
{
    x[i] += xResidual[i];
}

This has a profound effect on gradient flow. During backpropagation, the gradient at the residual addition is just copied to both branches (local gradient of addition is 1). This means gradients can flow directly from the loss to early layers without being diminished by intermediate transformations.

This is the value-reuse pattern from Chapter 2 in its most important form: the Value objects in xResidual are the same objects that SomeTransformation(x) was built from, so Backward() reaches them via two paths and accumulates both contributions onto their .Grad via the += we flagged back then. Without that accumulation, the skip path's contribution would silently overwrite the layer path's, and the "gradient highway" would collapse.

Why This Chapter Exists Separately

You might wonder why we don't just introduce these inside the attention chapter. The reason is that RMSNorm and residual connections are independent concepts that show up in many architectures beyond transformers. Understanding them in isolation makes it clear they aren't "part of attention". They're stabilisation techniques that wrap around any layer.

Exercise

Create Chapter8Exercise.cs:

// --- Chapter8Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter8Exercise
{
    public static void Run()
    {
        // ── Test RmsNorm ──
        // RMS of [3, 4] = sqrt((9+16)/2) = sqrt(12.5) ~ 3.536
        // Normed ~ [3/3.536, 4/3.536] ~ [0.849, 1.131]
        var testVec = new List<Value> { new(3.0), new(4.0) };
        List<Value> normed = RmsNorm(testVec);
        Console.WriteLine("--- RmsNorm ---");
        Console.WriteLine("Expected: 0.849 1.131");
        Console.Write("Got:      ");
        foreach (Value v in normed)
        {
            Console.Write($"{v.Data:F3} ");
        }
        Console.WriteLine();

        // Try it with a "drifted" vector - large values get scaled down
        // RMS of [500, 800, 300] ~ 571.548; normed ~ [0.875, 1.400, 0.525]
        // Values are now close to 1.0 in magnitude, regardless of the original scale.
        var bigVec = new List<Value> { new(500.0), new(800.0), new(300.0) };
        List<Value> bigNormed = RmsNorm(bigVec);
        Console.WriteLine("--- RmsNorm on large values ---");
        Console.WriteLine("Expected: 0.875 1.400 0.525");
        Console.Write("Got:      ");
        foreach (Value v in bigNormed)
        {
            Console.Write($"{v.Data:F3} ");
        }
        Console.WriteLine();

        // ── Test Residual Connection ──
        // Start with [1, 2], apply a transformation (double each value),
        // then add the original back: [2+1, 4+2] = [3, 6]
        var x = new List<Value> { new(1.0), new(2.0) };
        var xResidual = new List<Value>(x);

        // "Transformation": double each value
        x = [.. x.Select(xi => xi * 2.0)];

        // Residual: add original back
        for (int i = 0; i < x.Count; i++)
        {
            x[i] += xResidual[i];
        }

        Console.WriteLine("--- Residual Connection ---");
        Console.WriteLine("Expected: 3.0 6.0  (transformation output + original input)");
        Console.Write("Got:      ");
        foreach (Value v in x)
        {
            Console.Write($"{v.Data:F1} ");
        }
        Console.WriteLine();
    }
}

Uncomment the Chapter 8 case in the dispatcher in Program.cs:

case "ch8":
    Chapter8Exercise.Run();
    break;

Then run it:

dotnet run -- ch8

Chapter 7: The Training Loop and Adam Optimiser

Gary Jackson — Sun, 26 Apr 2026 21:06:46 +0000

What You'll Build

A complete training loop that processes documents, computes loss, backpropagates gradients, and updates parameters using the Adam optimiser.

Depends On

All previous chapters.

The Training Loop

A training step is just five things in a row:

Pick a document and tokenize it
Forward pass for each token, building up the loss
Backward pass to fill in every gradient
Nudge the parameters using those gradients
Zero the gradients out before the next step

Step 4 is where Adam lives. Before we look at the code, it's worth slowing down on what Adam actually does and why we use it.

Understanding Adam

You could update parameters with simple gradient descent: p.Data -= learningRate * p.Grad. Adam is smarter in two ways.

Momentum (momentum). Instead of reacting to each individual gradient, Adam tracks a running average of recent gradients. This smooths out noisy updates, like a rolling ball that doesn't reverse direction every time it hits a bump.

Squared gradient average (squaredGradAvg). Adam also tracks the running average of each parameter's squared gradient. Squaring serves two purposes:

Makes values positive. We want to track how large gradients have been, not their direction. A gradient of -5 and +5 should both count as "large".
Emphasises larger gradients. A gradient of 10 contributes 100 to the average, a gradient of 1 contributes just 1. So a parameter with occasional huge gradients gets dampened more than one with steady moderate gradients.

When the update happens, Adam divides by the square root of this number. Parameters with consistently large gradients get a smaller effective step size and vice versa, so each parameter ends up with its own adapted learning rate. The squaring-then-square-rooting gives us what's called the RMS (root mean square) of the gradient, effectively a rolling "typical size" of recent gradients.

Bias correction (correctedMomentum, correctedSquaredGrad). Because momentum and squaredGradAvg start at zero, they're biased toward zero in early steps. The correction factors 1 / (1 - beta^(step+1)) compensate for this warm-up period.

Learning rate decay. The learning rate decreases linearly over training: currentLearningRate = learningRate * (1 - step/numSteps). This allows large steps early on (when parameters are far from good values) and smaller, more precise steps later. The decay reaches zero at the final step, so the model makes progressively smaller adjustments as training continues, effectively locking in what it has learned.

The constants MomentumSmoothing, SquaredGradSmoothing, and Epsilon in the code are Adam's hyperparameters. MomentumSmoothing controls how much smoothing is applied to the momentum (higher = more smoothing, more memory of past gradients), SquaredGradSmoothing does the same for the squared gradient average, and Epsilon is a tiny number that prevents division by zero. Standard defaults are MomentumSmoothing=0.9, SquaredGradSmoothing=0.999, but we use 0.85, 0.99 for faster training on this small problem.

Code

// --- Chapter7Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter7Exercise
{
    public static void Run()
    {
        var random = new Random(42);
        List<string> docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);
        Console.WriteLine($"num docs: {docs.Count}");
        Console.WriteLine($"vocab size: {tokenizer.VocabSize}");

        // ── Simplified model (replaced by GptModel in Chapter 11) ──
        int embeddingSize = 16;
        int maxSequenceLength = 8;
        int numSteps = 1000;
        double learningRate = 0.01;

        List<List<Value>> tokenEmbeddings = CreateMatrix(
            random,
            tokenizer.VocabSize,
            embeddingSize
        );
        List<List<Value>> positionEmbeddings = CreateMatrix(
            random,
            maxSequenceLength,
            embeddingSize
        );
        List<List<Value>> outputProjection = CreateMatrix(
            random,
            tokenizer.VocabSize,
            embeddingSize
        );

        // Collect all parameters into a flat list for the optimiser
        var paramsList = new List<Value>();
        foreach (List<Value> row in tokenEmbeddings)
        {
            paramsList.AddRange(row);
        }

        foreach (List<Value> row in positionEmbeddings)
        {
            paramsList.AddRange(row);
        }

        foreach (List<Value> row in outputProjection)
        {
            paramsList.AddRange(row);
        }

        Console.WriteLine($"num params: {paramsList.Count}");

        // ── Adam optimiser ──
        // Note: the standard Adam defaults are MomentumSmoothing=0.9, SquaredGradSmoothing=0.999.
        // We use more aggressive values here to train faster on this small problem.
        const double MomentumSmoothing = 0.85,
            SquaredGradSmoothing = 0.99,
            Epsilon = 1e-8;
        double[] momentum = new double[paramsList.Count];
        double[] squaredGradAvg = new double[paramsList.Count];

        // Reusable buffers for Backward. These are what the parameterless Backward()
        // overload from Chapter 2 allocates internally on every call. Here we hoist
        // them out of the hot loop so 1,000 training steps don't allocate 1,000
        // fresh copies.
        var topo = new List<Value>();
        var visited = new HashSet<Value>();
        var backwardStack = new Stack<(Value, int)>();

        for (int step = 0; step < numSteps; step++)
        {
            string doc = docs[step % docs.Count];
            var tokens = new List<int> { tokenizer.Bos };
            tokens.AddRange(doc.Select(tokenizer.Encode));
            tokens.Add(tokenizer.Bos);
            // Any name longer than maxSequenceLength - 1 is silently truncated here.
            int tokenCount = Math.Min(maxSequenceLength, tokens.Count - 1);

            var losses = new List<Value>();
            for (int posId = 0; posId < tokenCount; posId++)
            {
                List<Value> logits = Forward(
                    tokens[posId],
                    posId,
                    tokenEmbeddings,
                    positionEmbeddings,
                    outputProjection,
                    embeddingSize
                );
                List<Value> probabilities = Softmax(logits);
                losses.Add(-probabilities[tokens[posId + 1]].Log());
            }

            // Average the per-position losses into a single scalar
            var loss = new Value(0);
            foreach (Value l in losses)
            {
                loss += l;
            }

            loss *= 1.0 / tokenCount;

            foreach (Value p in paramsList)
            {
                p.Grad = 0;
            }

            topo.Clear();
            visited.Clear();
            backwardStack.Clear();
            loss.Backward(topo, visited, backwardStack);

            double currentLearningRate = learningRate * (1 - (double)step / numSteps);
            for (int i = 0; i < paramsList.Count; i++)
            {
                Value p = paramsList[i];
                momentum[i] =
                    MomentumSmoothing * momentum[i] + (1 - MomentumSmoothing) * p.Grad;
                squaredGradAvg[i] =
                    SquaredGradSmoothing * squaredGradAvg[i]
                    + (1 - SquaredGradSmoothing) * Math.Pow(p.Grad, 2);
                double correctedMomentum =
                    momentum[i] / (1 - Math.Pow(MomentumSmoothing, step + 1));
                double correctedSquaredGrad =
                    squaredGradAvg[i] / (1 - Math.Pow(SquaredGradSmoothing, step + 1));
                p.Data -=
                    currentLearningRate
                    * correctedMomentum
                    / (Math.Sqrt(correctedSquaredGrad) + Epsilon);
            }

            if (step == 0 || (step + 1) % 100 == 0)
            {
                Console.WriteLine(
                    $"step {step + 1, 4} / {numSteps, 4} | loss {loss.Data:F4}"
                );
            }
        }
    }

    private static List<Value> Forward(
        int tokenId,
        int posId,
        List<List<Value>> tokenEmbeddings,
        List<List<Value>> positionEmbeddings,
        List<List<Value>> outputProjection,
        int embeddingSize
    )
    {
        List<Value> tokenEmbedding = tokenEmbeddings[tokenId];
        List<Value> positionEmbedding = positionEmbeddings[posId];
        var x = new List<Value>();
        for (int i = 0; i < embeddingSize; i++)
        {
            x.Add(tokenEmbedding[i] + positionEmbedding[i]);
        }

        return Linear(x, outputProjection);
    }
}

Code Walkthrough

Breaking the code down section by section:

Setup (docs through tokenizer). Load the dataset, build the tokenizer, print stats. Same as Chapter 6.

Hyperparameters (embeddingSize through learningRate). Two we've seen (embeddingSize, maxSequenceLength), and two new ones:

numSteps = 1000 - how many training iterations to run
learningRate = 0.01 - the starting size of each parameter update

Model setup (tokenEmbeddings through outputProjection). Three embedding/projection matrices. The whole setup will be replaced by GptModel in Chapter 11.

Parameter list (paramsList). Adam needs to update every learnable number, so we flatten all three matrices into one big list of Value objects. For our sizes that's 27*16 + 8*16 + 27*16 = 992 parameters.

Adam state (MomentumSmoothing through squaredGradAvg). The Adam constants (smoothing factors and Epsilon), plus two double arrays, one for momentum and one for squared gradient averages. Each array has one entry per parameter, all starting at zero.

Backward buffers (topo, visited, backwardStack). Pre-allocated lists/sets/stacks for Backward() to reuse across all 1000 steps. This is the 3-argument Backward() overload we built in Chapter 2. The parameterless version would allocate fresh buffers every call.

Training loop (for (var step ...). This is the heart of training. Each iteration is one step:

Pick and tokenize a doc. Cycle through docs with step % docs.Count, wrap with BOS on both sides, cap length at maxSequenceLength. The modulo is defensive: if numSteps ever exceeds docs.Count, the loop wraps back to the start.
Forward pass (the posId loop). For each position in the name, run Forward to get logits, softmax to get probabilities, then collect -log(probability of correct next token) as the loss. A 5-character name produces 5 losses.
Average losses (loss *= 1.0 / tokenCount). Sum the per-position losses and divide by count to get one scalar loss for the whole document.
Backward (loss.Backward(...)). Reset all gradients to zero, clear the buffers, then call Backward(), which fills in .Grad on every Value using the algorithm from Chapter 2.
Adam update (the paramsList loop). The five lines from the Adam explanation above:
1. Compute the decayed learning rate for this step
2. Update momentum (running avg of gradients)
3. Update squared gradient average
4. Apply bias correction to both
5. Take the step: p.Data -= currentLearningRate * correctedMomentum / sqrt(correctedSquaredGrad + epsilon)
Print progress. Log every 100 steps so you can watch the loss go down.

Forward method. Look up token embedding, add position embedding, project to vocab size. Takes the three matrices and embeddingSize as explicit parameters so the forward pass's dependencies are visible at the call site. This is what gets replaced by GptModel.Forward in Chapter 11.

The loop runs 1,000 times, each time nudging the 992 parameters slightly toward something that gives lower loss. By the end, you'll see loss drop from ~3.3 (random guessing) down toward the bigram baseline of ~2.45.

A note on name length. maxSequenceLength = 8 means we train on at most the first 7 characters of each name plus a BOS token. Longer names like "alexandra" or "elizabeth" are silently truncated by the Math.Min on the tokenCount line above. If later on you see the model under-generating long names during inference, this is why. Raising maxSequenceLength to 16 covers ~100% of the dataset but roughly doubles training time, because every position still runs a forward pass. We keep it at 8 for course speed.

Uncomment the Chapter 7 case in the dispatcher in Program.cs:

case "ch7":
    Chapter7Exercise.Run();
    break;

Then run it:

dotnet run -c Release -- ch7

(Release mode matters here. Debug mode is significantly slower because Value allocations dominate. On a modern CPU this runs in under a minute in Release mode.)

What You Should See

The first step prints a loss around 3.3 (random guessing). Over 1,000 steps, the trend moves downward. Don't worry if individual steps bounce around. Each step trains on a single document, so the loss is noisy. One step might land at 1.7, the next at 2.8. What matters is the overall trend, not any single value.

By the end of training, the loss lands in a similar range to the bigram baseline from Chapter 4 (~2.45). That might feel disappointing. Why build a neural network to match a counting table? The answer is that this model still processes each position independently. It has no way for tokens to look at each other, so it's effectively a neural version of the bigram. The components that let the model use longer context (attention in Chapter 9, multi-head attention in Chapter 10, and the MLP in Chapter 10) are what will push the loss well below the bigram baseline when we assemble the full model in Chapter 11.

A note on evaluation. We're computing the loss on the same data we train on. In a production setting, you'd hold out a portion of the data for validation, to detect overfitting (the model memorising training examples rather than learning general patterns). For the purpose of understanding the architecture, this simplification is fine.

Chapter 6: Embeddings, the Forward Pass, and the Loss Function

Gary Jackson — Sat, 25 Apr 2026 22:12:26 +0000

What You'll Build

Embedding tables that give each token and each position a learned vector, a minimal forward pass that produces logits, and the loss function that measures how wrong the predictions are.

Depends On

Chapters 1-3, 5 (Value, Tokenizer, Helpers).

Embeddings: Giving Tokens an Identity

The model needs two pieces of information about each token: what the token is, and where it appears in the sequence. Each piece gets its own embedding. We'll start with the first one (token embeddings) and cover position embeddings in the next section.

So far, each token is just an integer: a is 0, b is 1, z is 25. A neural network can't do anything useful with a raw integer. It needs a richer representation, a list of numbers that captures something meaningful about each token. Maybe the first number captures "how often this letter starts a name" and the second captures "how vowel-like it is". We don't hand-pick these meanings. The network discovers them during training.

This list of numbers for each token is called an embedding. An embedding table is just a matrix where row i is the embedding for token i:

Letter   Token ID   Token Embedding (4 numbers in this example)
─────    ────────   ───────────────────────────────────────────
  a         0       [ 0.02, -0.05,  0.11,  0.03]
  b         1       [-0.07,  0.04,  0.01, -0.09]
  c         2       [ 0.06,  0.08, -0.03,  0.05]
 ...       ...       ...
  .        26       [ 0.01, -0.02,  0.04, -0.01]   ← BOS

At the start, every embedding is random (like the small numbers above). By the end of training, tokens that behave similarly in names will end up with similar embeddings.

Every parameter in the network starts as a random number, but the range matters. If values are too large, numbers flowing through the network can explode in size, and gradients stop being useful. We want values centred around zero (both positive and negative) and mostly very small. A bell curve distribution with a small standard deviation (0.08 in the code) gives us exactly that: most values land between -0.16 and 0.16, clustering tightly near zero.

The 0.08 value is tuned for this model's dimensions. It's not a universal constant. If you scale the model up (for example by changing embeddingSize to 256 or 512), the right standard deviation shrinks with the layer width - a common rule of thumb is 1/sqrt(fan_in) - the same scaling at the heart of Xavier and Kaiming initialisation, two standard schemes you'll meet elsewhere in ML. Keeping 0.08 at larger widths tends to produce exploding gradients for no obvious reason, so flag this as the first knob to revisit if you start experimenting with size.

Add these helpers to Helpers.cs:

// --- Helpers.cs (add inside the Helpers class) ---

/// <summary>
/// Generates a random number from a bell curve (Gaussian/normal distribution)
/// centered on the mean, with most values falling within 'std' of it.
/// Uses the Box-Muller transform - turns two uniform random numbers into
/// one bell-curve random number.
/// </summary>
public static double RandomBellCurve(Random rng, double mean, double std)
{
    double u1 = 1.0 - rng.NextDouble();
    double u2 = 1.0 - rng.NextDouble();
    return mean + std * Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Sin(2.0 * Math.PI * u2);
}

/// <summary>
/// Creates a matrix of Value objects initialized to small random numbers.
/// </summary>
public static List<List<Value>> CreateMatrix(Random rng, int rows, int cols, double std = 0.08)
{
    var mat = new List<List<Value>>();
    for (int i = 0; i < rows; i++)
    {
        var row = new List<Value>();
        for (int j = 0; j < cols; j++)
        {
            row.Add(new Value(RandomBellCurve(rng, 0, std)));
        }

        mat.Add(row);
    }
    return mat;
}

Position Embeddings: Why Position Matters

That covers the first piece. The model now knows what each token is, but it doesn't know where that token appears in the sequence. Consider the letter a in these two names:

anna - a is at position 0 (starting the name)
ann*a* - a is at position 3 (ending the name)

The same letter behaves very differently depending on where it sits. A model that only knows "the current token is a" can't tell these two situations apart. It needs position information too.

You might think: "just pass the position as a number, 0, 1, 2, 3." But that's the same problem we had with token IDs. A single integer doesn't give the network enough to work with. The network needs a rich representation, a list of numbers, so it can learn complex patterns about what "being at position 3" means. Maybe position 0 tends to start with certain consonants, or position 3 is where doubled letters often appear. Those patterns need room to be encoded.

So we create a second embedding table, positionEmbeddings, where row i is the learned embedding for position i:

Position   Position Embedding
────────   ───────────────────────────────────────────
  0        [ 0.01,  0.03, -0.02,  0.05]
  1        [-0.04,  0.01,  0.07, -0.03]
  2        [ 0.03, -0.06,  0.02,  0.08]
 ...        ...

To combine the two, we simply add them element-by-element. For example, token a (ID 0) appearing at position 1:

  Token embedding (a):     [ 0.02, -0.05,  0.11,  0.03]
+ Position embedding (1):  [-0.04,  0.01,  0.07, -0.03]
= Combined:                [-0.02, -0.04,  0.18,  0.00]

The result is a single vector that encodes both "what token is this?" and "where does it appear?"

From Embeddings to Predictions

After combining embeddings, we have a vector of 16 numbers (the internal representation). But we need 27 scores, one per possible next token, so we can pick the most likely one. This is where outputProjection comes in: it's a weight matrix that converts the 16-number vector into 27 scores using Linear, exactly as we built in Chapter 5.

Notice outputProjection has the same dimensions as tokenEmbeddings (27 x 16) but does the reverse job. tokenEmbeddings goes from a token ID to a 16-number vector, outputProjection goes from a 16-number vector back to 27 token scores.

Putting It Together

In the broader ML world (PyTorch, GPT-2, nanoGPT) you'll see these matrices called wte (weight token embedding), wpe (weight position embedding), and lm_head (language model head). When we put the model into a class in Chapter 11, the dictionary keys will use the GPT-2 names so the code maps directly to PyTorch references. The C# variables themselves will stay descriptive.

For now, our "model" is just: look up token embedding, look up position embedding, add them, project to vocabulary size. Chapter 11 replaces this with the full GptModel class, which adds layers between the embeddings and the projection so the model can consider relationships between tokens rather than looking at each token in isolation.

Cross-Entropy Loss - Naming the Loss Function

In the Big Picture, we said the loss is "a single number that measures how wrong the prediction was". Cross-entropy loss is the specific formula for computing that number. It works like this: look at the probability the model assigned to the correct next token, and take the negative log of it.

If the model assigns probability 1.0 to the correct token, -log(1.0) = 0 (no surprise, zero loss). If it assigns probability near 0, the negative log goes to +infinity (maximum surprise, huge loss). The formula is just -probabilities[correctToken].Log().

At initialisation with random weights, the model assigns roughly equal probability to all 27 tokens, so each token has probability 1/27, and the loss is -log(1/27) ≈ 3.296. That is exactly the "~3.3 starting loss" you'll see printed throughout Chapters 7-11. It isn't a rough heuristic, it's the arithmetic of a uniform distribution over a 27-token vocabulary. Anything below that during training means the model has learned something beyond guessing.

Exercise: Run a Single Forward + Loss

Two numbers control the shape of everything that follows:

embeddingSize = 16 - the size of each embedding vector. Every token and position is represented as a list of 16 numbers. Larger means the model can capture more nuance, but trains slower.
maxSequenceLength = 8 - the maximum number of tokens the model can look at in one sequence. For our dataset of names, this covers most names (the first 7 characters plus BOS). Chapter 7 talks about what happens to longer names.

Create Chapter6Exercise.cs. This pulls everything together: embeddings, a forward pass, and a loss computation:

// --- Chapter6Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter6Exercise
{
    public static void Run()
    {
        var random = new Random(42);
        List<string> docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);

        int embeddingSize = 16;
        int maxSequenceLength = 8;

        // Each row is a learned vector for one token (27 rows x 16 columns)
        List<List<Value>> tokenEmbeddings = CreateMatrix(
            random,
            tokenizer.VocabSize,
            embeddingSize
        );

        // Each row is a learned vector for one position (8 rows x 16 columns)
        List<List<Value>> positionEmbeddings = CreateMatrix(
            random,
            maxSequenceLength,
            embeddingSize
        );

        // Projects the embedding back to vocabulary size for prediction
        List<List<Value>> outputProjection = CreateMatrix(
            random,
            tokenizer.VocabSize,
            embeddingSize
        );

        // Run a single forward pass and compute the loss.
        // We're pretending the correct next character after BOS is 'e' - the choice
        // is arbitrary, just to demonstrate the loss formula.
        List<Value> logits = Forward(
            tokenizer.Bos,
            0,
            tokenEmbeddings,
            positionEmbeddings,
            outputProjection,
            embeddingSize
        );
        List<Value> probabilities = Softmax(logits);
        Value loss = -probabilities[tokenizer.Encode('e')].Log();

        Console.WriteLine($"Loss: {loss.Data:F4}");
        Console.WriteLine(
            $"Predicted probability of 'e': {probabilities[tokenizer.Encode('e')].Data:F4}"
        );
        // With Random(42), these values are deterministic - you should see the same
        // numbers every time. The loss should be around 3.3 (close to -log(1/27)),
        // confirming the model is effectively guessing randomly at this point.
    }

    // Minimal forward pass: look up embeddings, add them, project to vocab size
    private static List<Value> Forward(
        int tokenId,
        int posId,
        List<List<Value>> tokenEmbeddings,
        List<List<Value>> positionEmbeddings,
        List<List<Value>> outputProjection,
        int embeddingSize
    )
    {
        List<Value> tokenEmbedding = tokenEmbeddings[tokenId];
        List<Value> positionEmbedding = positionEmbeddings[posId];

        var x = new List<Value>();
        for (int i = 0; i < embeddingSize; i++)
        {
            x.Add(tokenEmbedding[i] + positionEmbedding[i]);
        }

        return Linear(x, outputProjection);
    }
}

This returns 27 logits - one score per token in the vocabulary. Higher logit = the model thinks that token is more likely to come next. The loss is the negative log probability of the correct next token.

Uncomment the Chapter 6 case in the dispatcher in Program.cs:

case "ch6":
    Chapter6Exercise.Run();
    break;

Then run it:

dotnet run -- ch6

Chapter 5: Linear Transformation and Softmax

Gary Jackson — Fri, 24 Apr 2026 23:29:37 +0000

What You'll Build

Two helper functions that show up in nearly every layer of a neural network:

Linear takes an input vector and a weight matrix, multiplies each row of weights element-by-element with the input, and sums each row into a single output value:

  input:   [1, 2, 3]
  weights: [[0.1, 0.2, 0.3],    row 0: 0.1*1 + 0.2*2 + 0.3*3 = 1.4
            [0.4, 0.5, 0.6]]    row 1: 0.4*1 + 0.5*2 + 0.6*3 = 3.2
  output:  [1.4, 3.2]

Two rows of weights means two output values. This is how neural networks change the size of data as it flows through layers.

Softmax takes a list of raw numbers and turns them into probabilities that add up to 1. For example, [2.0, 1.0, 0.1] becomes roughly [0.66, 0.24, 0.10]. The largest input gets the highest probability.

They live in their own file because they're pure math utilities, independent of the model architecture.

Depends On

Chapters 1-2 (the Value class - our computation recorder).

Code

Linear needs a way to compute a dot product between two lists of Value objects. Add this method to Value.cs:

// --- Value.cs (add inside the Value class) ---

public static Value Dot(List<Value> a, List<Value> b)
{
    var result = new Value(0);
    for (int i = 0; i < a.Count; i++)
    {
        result += a[i] * b[i];
    }

    return result;
}

Debug vs Release matters from here on. Each += allocates a fresh Value, and a typical training step does tens of thousands of them. In Debug mode the JIT skips inlining and the GC churns, so the same run that takes ~30 seconds in Release can take 5+ minutes in Debug. Once the code is working, always run training with -c Release. The Performance Optimisation Notes section at the end of the course covers the rest of the speedups.

Now add Linear and Softmax to Helpers.cs:

// --- Helpers.cs ---

namespace MicroGPT;

public static class Helpers
{
    /// <summary>
    /// Matrix-vector multiply. Each row of weights is multiplied element-by-element
    /// with input and summed into a single value.
    /// </summary>
    public static List<Value> Linear(List<Value> input, List<List<Value>> weights) =>
        [.. weights.Select(row => Value.Dot(row, input))];

    /// <summary>
    /// Converts raw scores (logits) into a probability distribution.
    /// </summary>
    public static List<Value> Softmax(List<Value> logits)
    {
        double maxVal = logits.Max(v => v.Data);
        var exponentials = logits.Select(v => (v - maxVal).Exp()).ToList();
        var total = new Value(0);
        foreach (Value? e in exponentials)
        {
            total += e;
        }

        return [.. exponentials.Select(e => e / total)];
    }
}

Each element of Linear's output is a weighted sum of input, where weights contains the learned parameters. If input has 16 elements and weights has 64 rows of 16 elements each, the output has 64 elements. This is how neural networks change the dimensionality of data.

Notice there's no bias term added after the dot product. Production models typically include one (output = weights * input + bias), but we omit it for simplicity. Some modern architectures like LLaMA also drop biases.

The raw numbers that come out of a model before they're turned into probabilities are called logits. You'll see the term everywhere in ML. They can be any value (positive, negative, large, small), and on their own they don't mean much. They need to be converted into probabilities first, which is where Softmax comes in.

Softmax takes a list of logits and turns them into a probability distribution where all values are in [0, 1] and sum to 1. We subtract the max value before taking exp for numerical stability. Mathematically it doesn't change the result (the max cancels out in the division), but without it, exp of large numbers can overflow to infinity. The backward pass is unaffected too: shifting every logit by the same constant cancels in the ratio, so the gradients through Softmax come out identical to the unshifted version.

Because both Linear and Softmax are built entirely from Value operations (add, multiply, exp, divide), gradients flow through them automatically during the backward pass. They aren't "frozen" math. They're part of the computation graph, just like any other chain of Value operations.

Exercise

Create Chapter5Exercise.cs:

// --- Chapter5Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter5Exercise
{
    public static void Run()
    {
        // Test Linear: a 2x3 weight matrix times a length-3 input vector
        var input = new List<Value> { new(1.0), new(2.0), new(3.0) };
        var weights = new List<List<Value>>
        {
            new() { new(0.1), new(0.2), new(0.3) }, // row 0: 0.1*1 + 0.2*2 + 0.3*3 = 1.4
            new() { new(0.4), new(0.5), new(0.6) }, // row 1: 0.4*1 + 0.5*2 + 0.6*3 = 3.2
        };
        List<Value> output = Linear(input, weights);

        Console.WriteLine("--- Linear ---");
        Console.WriteLine("Expected: 1.4 3.2");
        Console.Write("Got:      ");
        foreach (Value v in output)
        {
            Console.Write($"{v.Data:F1} ");
        }
        Console.WriteLine();

        // Test Softmax: converts raw logits into probabilities that sum to 1
        var logits = new List<Value> { new(2.0), new(1.0), new(0.1) };
        List<Value> probabilities = Softmax(logits);

        Console.WriteLine("--- Softmax ---");
        Console.WriteLine(
            "Expected: 0.659 0.242 0.099  (sum to 1.0, largest logit gets highest prob)"
        );
        Console.Write("Got:      ");
        foreach (Value p in probabilities)
        {
            Console.Write($"{p.Data:F3} ");
        }
        Console.WriteLine();
    }
}

Uncomment the Chapter 5 case in the dispatcher in Program.cs:

case "ch5":
    Chapter5Exercise.Run();
    break;

Then run it:

dotnet run -- ch5

Common Misconception

"Why not just divide each logit by the sum of all logits?" Because logits can be negative, and a probability distribution needs all non-negative values. The exp function makes sure everything is positive before normalising.

Chapter 4: The Bigram Model - Simplest Possible Language Model

Gary Jackson — Thu, 23 Apr 2026 23:06:04 +0000

What You'll Build

A character-level language model that predicts the next character based only on the current character. No neural network, no gradients, just counting. A "bigram" is a pair of consecutive tokens (like 'e' followed by 'm'), so a bigram model is one that learns which pairs occur most often.

Notice that this model uses plain double arrays, not Value objects. That's because there's nothing to train here. We're just counting occurrences in the data. Value only starts pulling its weight when we need to compute gradients and update parameters, which kicks in from Chapter 6.

This file isn't used by the final GPT model. It's a baseline you can come back to later to see how much better the neural network does.

Depends On

Chapter 3 (tokenizer + dataset).

Why Start Here?

This model pins down the task clearly before introducing any complexity. The task is: given a sequence of tokens so far, predict what comes next. The bigram model does this in the simplest way imaginable. It counts how often each pair of consecutive characters appears, and uses those counts as probabilities.

Code

// --- BigramModel.cs ---

using System.Text;

namespace MicroGPT;

public class BigramModel
{
    private readonly double[,] _nextTokenProbs;
    private readonly Tokenizer _tokenizer;

    public BigramModel(List<string> docs, Tokenizer tokenizer)
    {
        _tokenizer = tokenizer;
        int vocabSize = tokenizer.VocabSize;

        // Count transitions: counts[i,j] = how often token j follows token i
        int[,] counts = new int[vocabSize, vocabSize];

        foreach (string doc in docs)
        {
            List<int> tokens = Tokenize(doc);
            for (int i = 0; i < tokens.Count - 1; i++)
            {
                counts[tokens[i], tokens[i + 1]]++;
            }
        }

        // Convert counts to probabilities
        _nextTokenProbs = new double[vocabSize, vocabSize];
        for (int i = 0; i < vocabSize; i++)
        {
            double rowSum = 0.0;
            for (int j = 0; j < vocabSize; j++)
            {
                rowSum += counts[i, j];
            }

            if (rowSum > 0)
            {
                for (int j = 0; j < vocabSize; j++)
                {
                    _nextTokenProbs[i, j] = counts[i, j] / rowSum;
                }
            }
        }
    }

    public string GenerateName(Random random, int maxLength = 20)
    {
        int token = _tokenizer.Bos;
        var name = new StringBuilder();
        for (int step = 0; step < maxLength; step++)
        {
            double r = random.NextDouble();
            double cumulative = 0;
            int next = _tokenizer.VocabSize - 1;
            for (int j = 0; j < _tokenizer.VocabSize; j++)
            {
                cumulative += _nextTokenProbs[token, j];
                if (r <= cumulative)
                {
                    next = j;
                    break;
                }
            }
            if (next == _tokenizer.Bos)
            {
                break;
            }

            name.Append(_tokenizer.Decode(next));
            token = next;
        }
        return name.ToString();
    }

    public void PrintSamples(int count, Random random)
    {
        Console.WriteLine("\n--- bigram samples ---");
        for (int s = 0; s < count; s++)
        {
            Console.WriteLine($"sample {s + 1, 2}: {GenerateName(random)}");
        }
    }

    /// <summary>
    /// Computes the average negative log probability across all documents.
    /// This is the loss baseline that our neural network should beat.
    /// </summary>
    public double ComputeLoss(List<string> docs)
    {
        double totalLoss = 0.0;
        int totalTokens = 0;
        foreach (string doc in docs)
        {
            List<int> tokens = Tokenize(doc);
            for (int i = 0; i < tokens.Count - 1; i++)
            {
                double p = _nextTokenProbs[tokens[i], tokens[i + 1]];
                // A pair never seen during training has p == 0, which would give
                // -log(0) = +infinity. We skip the loss contribution (but still count
                // the token in the denominator), which slightly flatters the baseline.
                if (p > 0)
                {
                    totalLoss += -Math.Log(p);
                }

                totalTokens++;
            }
        }
        return totalLoss / totalTokens;
    }

    private List<int> Tokenize(string doc)
    {
        var tokens = new List<int> { _tokenizer.Bos };
        tokens.AddRange(doc.Select(_tokenizer.Encode));
        tokens.Add(_tokenizer.Bos);
        return tokens;
    }
}

Why `-log` for the Loss?

ComputeLoss uses -Math.Log(p) where p is the probability the model assigned to the correct next token. Why that formula?

If the model assigns probability 1.0 to the correct token (perfect prediction), -log(1.0) = 0. Zero loss.
If it assigns 0.5 (coin flip), -log(0.5) = 0.69. Moderate loss.
If it assigns 0.01 (nearly wrong), -log(0.01) = 4.6. High loss.
If it assigns 0 (completely wrong), -log(0) is infinity.

So -log converts a probability into a "how surprised was the model" score, where lower is better. The average of this score across all tokens in the dataset is the loss. Chapter 6 covers this in more detail when the neural network needs to compute it.

How Generation Works

The GenerateName method is worth understanding because the transformer model in Chapters 11-12 generates names the same way. The only thing that changes is where the probabilities come from.

The strategy is:

Start with BOS. Just like training data wraps each name with BOS on both sides, generation starts from BOS and asks "what's the first character?"
Look up the probability row. _nextTokenProbs[token, ...] is a row of 27 numbers that sum to 1.0. Each number says how likely that token is to come next. For example, after BOS the row might say a=0.08, b=0.04, ..., z=0.01, reflecting how often each letter starts a name in the training data.
Pick a token using weighted random sampling. Roll a random number between 0 and 1. Think of it as a "dart throw" onto a number line. Each token's probability determines how wide its section of the line is. The loop walks the line by accumulating probabilities until the running total crosses the dart.

Here's a worked example. Say BOS is the current token, and the next-token probabilities for the first three tokens are [a=0.3, b=0.5, c=0.2]:

   |-------- a --------|--------------- b ---------------|------ c ------|
   0                   0.3                               0.8            1.0

If the random number is r = 0.6, the loop accumulates:

After a: cumulative = 0.3. Is 0.6 <= 0.3? No.
After b: cumulative = 0.8. Is 0.6 <= 0.8? Yes, pick b.

Token b gets picked 50% of the time because its slice covers half the line. Token a gets picked 30% of the time, c gets 20%. The loop always starts from the first token, but the random number decides how far along the line we go before stopping.

Repeat or stop. If the picked token is BOS, the model is saying "this name is done" and we stop. Otherwise we append the character, make it the new current token, and go back to step 2.

The bigram model's probabilities come from counting pairs in the training data. The transformer's probabilities will come from a neural network's forward pass. But the generation loop (start at BOS, sample from probabilities, stop at BOS) stays exactly the same.

Exercise: Run the Bigram Model

Create Chapter4Exercise.cs:

// --- Chapter4Exercise.cs ---

namespace MicroGPT;

public static class Chapter4Exercise
{
    public static void Run()
    {
        var random = new Random(42);
        var docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);

        Console.WriteLine($"num docs: {docs.Count}");
        Console.WriteLine($"vocab size: {tokenizer.VocabSize}");

        var bigram = new BigramModel(docs, tokenizer);
        bigram.PrintSamples(20, random);
        Console.WriteLine($"Bigram loss: {bigram.ComputeLoss(docs):F4}");
        // Expect something around 2.45 - our neural network should beat this.
    }
}

Uncomment the Chapter 4 case in the dispatcher in Program.cs:

case "ch4":
    Chapter4Exercise.Run();
    break;

Then run it:

dotnet run -- ch4

What You'll See

The output will be names, but they won't be very good. The model knows that certain characters tend to follow others (like 'a' often follows 'k'), but it has no concept of what happened two characters ago. No memory beyond the immediate predecessor. Run it and see for yourself. The results are deterministic with the seeded random, so you'll get the same output every time.

Key Takeaway

The bigram model pins down the task (next-token prediction) and the metric (negative log probability of the correct next token). Everything from Chapter 5 onwards is about doing this same task better, using a neural network that can consider more context.

Chapter 3: The Tokenizer - Text to Numbers and Back

Gary Jackson — Wed, 22 Apr 2026 20:44:17 +0000

What You'll Build

A Tokenizer class that converts between characters and integer IDs, plus a special BOS (Beginning of Sequence) token.

Depends On

Nothing. This chapter is independent of Chapters 1-2. It sits here because the next chapter (Bigram) needs it.

Why Characters?

Production LLMs use subword tokenizers (like BPE) that merge frequent character sequences into single tokens, giving them a vocabulary of ~100K tokens. MicroGPT uses the simplest possible tokenizer: one token per unique character. For a dataset of names, that gives us 26 lowercase letters plus one special token, for a vocabulary of 27.

The integer values themselves are arbitrary. Token 0 being 'a' and token 25 being 'z' is just convention. They might as well be emoji. Each token is simply a distinct symbol.

Code

// --- Tokenizer.cs ---

namespace MicroGPT;

public class Tokenizer
{
    private readonly List<char> _allChars;

    public int Bos { get; } // Beginning of Sequence token ID
    public int VocabSize { get; } // total number of unique tokens

    public Tokenizer(List<string> docs)
    {
        _allChars = [.. string.Join("", docs).Distinct().OrderBy(c => c)];
        Bos = _allChars.Count; // e.g., 26 if a-z are 0-25
        VocabSize = _allChars.Count + 1; // 27 total
    }

    public int Encode(char c) => _allChars.IndexOf(c);

    public char Decode(int i) => i == Bos ? '.' : _allChars[i]; // display BOS as '.'

    /// <summary>
    /// Loads documents from a text file, one per line, shuffled.
    /// </summary>
    public static List<string> LoadDocs(string path, Random random)
    {
        return
        [
            .. File.ReadAllLines(path)
                .Select(l => l.Trim())
                .Where(l => !string.IsNullOrEmpty(l))
                .OrderBy(_ => random.Next()),
        ];
    }
}

The BOS Token

BOS stands for "Beginning of Sequence". It's token ID 26 (one past the last letter) and it doesn't exist as a character in the training data. It's purely synthetic, acting as a delimiter that tells the model "a new document starts here" and "the current document ends here". During training, each document (e.g. the name "emma") gets wrapped with BOS on both sides:

[BOS, e, m, m, a, BOS]

The model learns that BOS initiates a new name, and that producing BOS means "I'm done". When we need to display it, Decode renders it as '.' (a dot). The choice of dot is arbitrary. It just needs to be something visually distinct from the letters. You'll see it in the model's output later when a generated name ends: emma.

Exercise: Verify Tokenization

Create Chapter3Exercise.cs:

// --- Chapter3Exercise.cs ---

namespace MicroGPT;

public static class Chapter3Exercise
{
    public static void Run()
    {
        var random = new Random(42);
        List<string> docs = Tokenizer.LoadDocs("input.txt", random);
        var tokenizer = new Tokenizer(docs);

        Console.WriteLine($"num docs: {docs.Count}");
        Console.WriteLine($"vocab size: {tokenizer.VocabSize}");

        // Encode a name
        string name = "emma";
        List<int> encoded = name.Select(tokenizer.Encode).ToList();
        Console.WriteLine($"Encoded: [{string.Join(", ", encoded)}]"); // [4, 12, 12, 0]

        // Decode it back
        string decoded = string.Join("", encoded.Select(tokenizer.Decode));
        Console.WriteLine($"Decoded: {decoded}"); // emma

        // A full document with BOS on both sides
        var tokens = new List<int> { tokenizer.Bos };
        tokens.AddRange(name.Select(tokenizer.Encode));
        tokens.Add(tokenizer.Bos);
        Console.WriteLine($"With BOS: [{string.Join(", ", tokens)}]"); // [26, 4, 12, 12, 0, 26]
    }
}

Uncomment the Chapter 3 case in the dispatcher in Program.cs:

case "ch3":
    Chapter3Exercise.Run();
    break;

Then run it:

dotnet run -- ch3

Chapter 2: Backward - Automatic Gradient Computation

Gary Jackson — Tue, 21 Apr 2026 21:53:24 +0000

What You'll Build

The Backward() method on Value. Starting from a final output (like the loss), it walks the computation graph in reverse and fills in the .Grad field on every Value, answering "if I nudged this number, how much would the final output change?"

Depends On

Chapter 1 (the Value class with forward operations).

Why This Matters

Training a neural network comes down to: "nudge each parameter a tiny bit in the direction that lowers the loss." But with thousands of parameters and a complex computation graph, you can't compute those nudges by hand. Backward() does it automatically using the chain rule.

The Chain Rule in Plain English

If a car goes twice as fast as a bicycle, and a bicycle goes four times as fast as a walking person, then the car goes 2 x 4 = 8 times as fast as the person. That's the chain rule. You multiply the rates of change along the path.

In the computation graph: if changing a causes c to change by 3x, and changing c causes loss to change by 2x, then changing a causes loss to change by 3 x 2 = 6x.

The Algorithm

The backward pass works in two stages.

Stage 1 - Topological sort. Walk the graph from the output and order all nodes so that a node always appears after all the nodes it depends on (its inputs come first).

Stage 2 - Propagate gradients. Start from the output (where gradient = 1.0, because "if I nudge the loss, the loss changes by the same amount" - the ratio is trivially 1). Visit each node in reverse topological order. For each node, multiply its gradient by each local gradient and add the result to the corresponding input's gradient.

Notice the += rather than = in Stage 2. This is important, and the trace below shows why.

Code

A note on reading the code. The backward pass can look confusing at first because everything is a Value. The node you're currently processing is a Value, its inputs are also Value objects, and they all have a .Grad field.

When you see v._inputs[j].Grad += ..., remember that v is the current node (the operation's output), and v._inputs[j] is one of the nodes that was fed into that operation. The code is reaching back from an output to its inputs and telling each one how much it contributed to the final loss.

Add these methods to the Value class:

// --- Value.cs (add inside the Value class) ---

// Convenience overload: allocates fresh buffers on each call.
// Good for one-off use in the early chapters. The 3-argument version below
// lets the training loop in Chapter 7 reuse buffers across thousands of steps.
public void Backward() => Backward([], [], new Stack<(Value, int)>());

public void Backward(
    List<Value> topo,
    HashSet<Value> visited,
    Stack<(Value node, int inputIndex)> stack
)
{
    // Stage 1: Iterative topological sort
    stack.Push((this, 0));
    while (stack.Count > 0)
    {
        (Value? current, int inputIndex) = stack.Pop();
        Value[]? inputs = current._inputs;

        if (inputs != null && inputIndex < inputs.Length)
        {
            stack.Push((current, inputIndex + 1));
            Value input = inputs[inputIndex];
            if (visited.Add(input))
            {
                stack.Push((input, 0));
            }
        }
        else
        {
            topo.Add(current);
        }
    }

    // Stage 2: Propagate gradients in reverse topological order
    Grad = 1.0;
    for (int i = topo.Count - 1; i >= 0; i--)
    {
        Value v = topo[i];
        if (v.Grad == 0)
        {
            continue; // optimisation: nothing to propagate
        }

        if (v._inputs == null)
        {
            continue;
        }

        for (int j = 0; j < v._inputs.Length; j++)
        {
            v._inputs[j].Grad += v._localGrads![j] * v.Grad;
        }
    }
}

There are two overloads. The parameterless one is what we'll use in Chapters 2-6: it allocates its own buffers, you call L.Backward(), done. The three-argument version takes pre-allocated topo, visited, and stack buffers so the Chapter 7 training loop can reuse them across thousands of steps instead of reallocating on every call. Same algorithm either way.

Why Iterative Instead of Recursive?

Karpathy's Python version uses a recursive topological sort. That works fine for short sequences, but deep computation graphs can blow the call stack. The C# version replaces recursion with an explicit Stack<T>, which is both faster and safe for arbitrary graph depth. The algorithm is identical, only the mechanics differ.

Walking Through Both Stages

If the code feels opaque, here's a step-by-step trace using the same example from the exercise below:

a = Value(2.0)       ← leaf, no inputs
b = Value(3.0)       ← leaf, no inputs
c = a * b            ← inputs: [a, b], localGrads: [b.Data, a.Data] = [3.0, 2.0]
L = c + a            ← inputs: [c, a], localGrads: [1.0, 1.0]
L.Backward()

Stage 1 - Topological sort. The stack holds (node, inputIndex) pairs. Each pop says "I'm visiting node, and I'm up to input #inputIndex." When inputIndex has walked past every input, the node is done and gets added to sortedNodes.

Step  Pop        What happens                              Stack after               sortedNodes
────  ─────────  ────────────────────────────────────────   ───────────────────────    ───────────
      (start)    Push (L, 0)                               [(L,0)]                   []

1     (L, 0)     L has inputs, index 0 < 2:                [(L,1), (c,0)]            []
                   push (L,1) to come back later
                   c is new, push (c,0) to explore it

2     (c, 0)     c has inputs, index 0 < 2:                [(L,1), (c,1), (a,0)]     []
                   push (c,1) to come back later
                   a is new, push (a,0) to explore it

3     (a, 0)     a has no inputs (leaf):                   [(L,1), (c,1)]            [a]
                   add a to sortedNodes

4     (c, 1)     c has inputs, index 1 < 2:                [(L,1), (c,2), (b,0)]     [a]
                   push (c,2) to come back later
                   b is new, push (b,0) to explore it

5     (b, 0)     b has no inputs (leaf):                   [(L,1), (c,2)]            [a, b]
                   add b to sortedNodes

6     (c, 2)     c has inputs, but index 2 = 2:            [(L,1)]                   [a, b, c]
                   all inputs done, add c to sortedNodes

7     (L, 1)     L has inputs, index 1 < 2:                [(L,2)]                   [a, b, c]
                   push (L,2) to come back later
                   a already visited, skip

8     (L, 2)     L has inputs, but index 2 = 2:            []                        [a, b, c, L]
                   all inputs done, add L to sortedNodes

Step 7 is the interesting one: a is L's second input, but we've already visited it through c, so we skip it. Without the visited check, a would appear twice in sortedNodes and its gradient would be doubled.

Stage 2 - Propagate gradients. Walk sortedNodes in reverse (L → c → b → a), seeding L.Grad = 1.0:

The formula at each step is:  input.Grad += localGrad * node.Grad

Step  i   Node   .Grad   What happens
────  ──  ────   ─────   ──────────────────────────────────────────────────────
1     3   L      1.0     L's inputs [c, a] with localGrads [1.0, 1.0]:
                           c.Grad += 1.0 (localGrad) * 1.0 (L.Grad) = 1.0
                           a.Grad += 1.0 (localGrad) * 1.0 (L.Grad) = 1.0

2     2   c      1.0     c's inputs [a, b] with localGrads [3.0, 2.0]:
                           a.Grad += 3.0 (localGrad) * 1.0 (c.Grad) = 3.0    (a.Grad is now 4.0)
                           b.Grad += 2.0 (localGrad) * 1.0 (c.Grad) = 2.0

3     1   b      2.0     b has no inputs (leaf), skip

4     0   a      4.0     a has no inputs (leaf), skip

Result:  a.Grad = 4.0    (1.0 from the addition path + 3.0 from the multiplication path)
         b.Grad = 2.0

Notice a.Grad gets written to twice: once in step 1 (through L = c + a) and again in step 2 (through c = a * b). The two contributions are summed with +=, which is why a.Grad ends up at 4.0. Without the += accumulation, the second write would overwrite the first and we'd get 3.0 instead of 4.0.

Exercise: Verify Gradients

Create Chapter2Exercise.cs:

// --- Chapter2Exercise.cs ---

namespace MicroGPT;

public static class Chapter2Exercise
{
    public static void Run()
    {
        var a = new Value(2.0);
        var b = new Value(3.0);
        Value c = a * b; // c = 6.0
        Value loss = c + a; // L = 8.0

        loss.Backward();

        Console.WriteLine($"a.Grad: expected 4, got {a.Grad}");
        Console.WriteLine($"b.Grad: expected 2, got {b.Grad}");
    }
}

Uncomment the Chapter 2 case in the dispatcher in Program.cs:

case "ch2":
    Chapter2Exercise.Run();
    break;

Then run it:

dotnet run -- ch2

These are the same results we traced through in the tables above: a.Grad = 4.0 (1.0 from the addition path + 3.0 from the multiplication path) and b.Grad = 2.0.

What This Means

a.Grad = 4.0 tells you: "if you increase a by a tiny amount e, then L increases by approximately 4e." This is the information the optimiser will use to update parameters.

Architecture Diagrams vs. the Computation Graph

Now that you've seen a.Grad accumulate to 4.0 from two different paths, there's a distinction worth pinning down. Diagrams of neural networks usually show a neat chain of boxes:

Input → [Layer 1] → [Layer 2] → Output

That's the architecture diagram. Think of it as a building's floor plan: it shows you the rooms and which rooms connect to which. Clean and linear.

The computation graph is what Value actually tracks. It's more like the building's electrical wiring diagram, showing every individual operation inside those boxes. And when you zoom in, you discover that values get reused. A single Value can feed into multiple operations.

Here's a simple example, the same one from the exercise:

a = 2.0
b = 3.0
c = a * b       // a is used here
L = c + a       // a is used here too

If you drew this as a tree, a would appear in two separate places, as if there were two copies. But there's only one a. The computation graph reflects this: a is a single node with two arrows coming out of it, one going to the multiplication and one going to the addition.

This means the computation graph isn't a tree, it's a web (technically a "directed acyclic graph"). That's exactly why a.Grad ended up at 4.0 in the exercise: Backward() accumulated contributions from both paths. The clearest payoff comes in Chapter 8, where residual connections (x = layer(x) + xResidual) reuse the same Value objects on both sides of the addition, and the += accumulation here is what makes their "gradient highway" actually work.

Chapter 1: The Value Class - Recording the Forward Pass

Gary Jackson — Tue, 21 Apr 2026 05:07:40 +0000

What You'll Build

A class called Value that wraps a double and remembers how it was created. Think of it as a number that keeps a receipt of every operation it went through.

Why This Comes First

In the Big Picture, Step 1 (the forward pass) chains together thousands of small operations, and Step 3 (the backward pass) walks those operations in reverse.

For that to work, every operation has to leave a record behind: what were the inputs, and how sensitive was the output to each input? The Value class is that record-keeping wrapper. Every number in our neural network is going to be a Value.

The Core Idea

A Value holds three things:

The number itself (Data)
A reference to the values that produced it (_inputs)
The local gradient (_localGrads) - how much the output of that specific operation would change if you wiggled each input

You don't need to understand the calculus behind these local gradients right now. Each operation has a known, fixed rule for them (listed in the table below), and the backward pass in Chapter 2 uses them mechanically.

The Grad field is empty for now. It gets filled in during the backward pass (Chapter 2) with the answer to: "how much does the final loss change if I wiggle this specific value?"

A naming distinction worth pinning down now. There are two things on a Value that both include the word "gradient", and they do different jobs:

Local gradient (_localGrads) - stored per operation (+, *, Exp, etc.), frozen at forward time. For each input to the op, it records: "if only that input changed by a tiny amount, how much would this op's output change?" It's a property of one operation in isolation.
Gradient (Grad) - filled in during the backward pass. Every Value has its own Grad, which records: "if only this Value's Data changed by a tiny amount, how much would the final loss change?" It's a property of the whole path from this Value to the loss.

The backward pass in Chapter 2 walks the graph in reverse, multiplying the two together via the chain rule to fill in every Grad.

Code

// --- Value.cs ---

namespace MicroGPT;

public class Value(double data, Value[]? inputs = null, double[]? localGrads = null)
{
    public double Data = data;
    public double Grad; // filled in during the backward pass (Chapter 2)

    private readonly Value[]? _inputs = inputs;
    private readonly double[]? _localGrads = localGrads;

    // --- Arithmetic operations ---
    // Each operation records three things: the result, the inputs, and the local gradients.
    // The local gradients are explained in the "Verifying Local Gradients" section below.

    public static Value operator +(Value a, Value b) => new(a.Data + b.Data, [a, b], [1.0, 1.0]);

    public static Value operator *(Value a, Value b) =>
        new(a.Data * b.Data, [a, b], [b.Data, a.Data]);

    // NaN if Data is negative and n is fractional.
    public Value Pow(double n) => new(Math.Pow(Data, n), [this], [n * Math.Pow(Data, n - 1)]);

    // -Infinity if Data == 0, NaN if Data < 0. If you see NaN propagating through
    // training, a softmax probability collapsed to 0 and this is usually the entry point.
    public Value Log() => new(Math.Log(Data), [this], [1.0 / Data]);

    public Value Exp() => new(Math.Exp(Data), [this], [Math.Exp(Data)]);

    // ReLU: passes positive values through unchanged, blocks negatives entirely.
    public Value Relu() => new(Math.Max(0, Data), [this], [Data > 0 ? 1.0 : 0.0]);

    // --- Convenience overloads ---
    public static Value operator +(Value a, double b) => a + new Value(b);

    public static Value operator *(Value a, double b) => a * new Value(b);

    public static Value operator -(Value a) => a * -1;

    public static Value operator -(Value a, double b) => a + (-b);

    public static Value operator /(Value a, Value b) => a * b.Pow(-1);

    public static Value operator /(Value a, double b) => a * Math.Pow(b, -1);

    public override string ToString() => $"Value(data={Data})";
}

For quick reference, here are the local gradients each operation records:

Operation	Local gradient(s)
`a + b`	`1`, `1`
`a * b`	`b`, `a`
`a.Pow(n)`	`n * aⁿ⁻¹`
`a.Log()`	`1 / a`
`a.Exp()`	`eᵃ`
`a.Relu()`	`1` if `a > 0`, else `0`

Verifying Local Gradients - The Nudge Test

Have a look at the addition operator:

public static Value operator +(Value a, Value b)
    => new(a.Data + b.Data, [a, b], [1.0, 1.0]);

The second argument, [a, b], records the two inputs. The third argument, [1.0, 1.0], records the local gradient for each input, in the same order. So:

The local gradient for input a is 1.0
The local gradient for input b is 1.0

But what does that actually mean, and why should you believe those are the right numbers?

You can answer both questions without any calculus. The technique is simple: nudge one input by a tiny amount, run the operation again, and see how much the output changed. The ratio of output-change to input-change is the local gradient.

Addition: why is the local gradient 1.0 for both inputs?

Let's say a = 2 and b = 3. The output is 2 + 3 = 5.

Now nudge a up by a tiny amount - say, 0.001:

New output: 2.001 + 3 = 5.001
The output changed by: 5.001 - 5.0 = 0.001
You nudged by 0.001, the output moved by 0.001
Ratio: 0.001 / 0.001 = 1.0 - that's the local gradient for a

Now nudge b instead:

New output: 2 + 3.001 = 5.001
Same result: ratio = 1.0 - that's the local gradient for b

Addition passes changes through at a 1:1 rate for both inputs, so the local gradients array is [1.0, 1.0].

Multiplication: why are the local gradients [b.Data, a.Data]?

Have a look at the multiplication operator:

public static Value operator *(Value a, Value b)
    => new(a.Data * b.Data, [a, b], [b.Data, a.Data]);

The local gradients are [b.Data, a.Data], meaning:

The local gradient for input a is b's value
The local gradient for input b is a's value

Let's verify with a = 2, b = 3. The output is 2 * 3 = 6.

Nudge a by 0.001:

New output: 2.001 * 3 = 6.003
The output changed by: 6.003 - 6.0 = 0.003
Ratio: 0.003 / 0.001 = 3.0 - that's the local gradient for a, and it equals b's value

Nudge b by 0.001:

New output: 2 * 3.001 = 6.002
The output changed by: 6.002 - 6.0 = 0.002
Ratio: 0.002 / 0.001 = 2.0 - that's the local gradient for b, and it equals a's value

This makes intuitive sense: the bigger b is, the more a small change to a gets amplified, and vice versa.

Power: the first curved function.

Have a look at the power operator:

public Value Pow(double n)
    => new(Math.Pow(Data, n), [this], [n * Math.Pow(Data, n - 1)]);

For a² (so n = 2), the local gradient n * Math.Pow(a, n - 1) simplifies to 2 * a. That's the first formula we've seen where the gradient depends on a itself, and the reason is that a² behaves differently from addition and multiplication. Let's see how.

Line up some input/output pairs for a²:

a = 1  →  a² =  1
a = 2  →  a² =  4   (jumped by 3)
a = 3  →  a² =  9   (jumped by 5)
a = 4  →  a² = 16   (jumped by 7)

Each step in a produces a bigger jump in a² than the last. Compare that to a + 5, which goes 6, 7, 8, 9 - the jump is exactly 1 every single step. We'll call a + 5 a straight function (same rate of change everywhere) and a² a curved function (rate of change grows as a gets bigger).

Multiplication is straight too, from a's perspective. a * 3 goes 3, 6, 9, 12 as a goes 1, 2, 3, 4 - the jump is always exactly 3. That's why the local gradient for multiplication is a fixed number (b.Data): no matter where you nudge a, the rate of change is the same.

a² is different. The rate of change at a = 3 isn't the same as the rate at a = 4. The formula 2 * a tells us: at a = 3, the rate is 6; at a = 4, the rate is 8. There's no single number that describes a²'s rate - you have to ask "rate at which value of a?".

Let's nudge-test at a = 3, using a small nudge (0.0001) to match the default in GradientCheck.cs below:

Original: 3² = 9
Nudged: 3.0001² = 9.00060001
Change in output: 0.00060001
Ratio: 0.00060001 / 0.0001 = 6.0001

The formula says the rate at a = 3 is exactly 6, and we measured 6.0001. The extra 0.0001 isn't a bug - it's the curvature leaking in. When we nudged from 3 to 3.0001, we technically measured something between "the rate at 3" and "the rate at 3.0001" (which is very slightly steeper), so we overshoot the true answer at 3 by a tiny amount. Halve the nudge and the error halves with it.

This is a general fact about the nudge test: it's exact for straight functions, and slightly off for curved ones by an amount proportional to the nudge size. Keep that in mind when we run the full check in a moment - you'll see 6.0001 for Power and a similar drift for Exp (also curved), while Addition and Multiplication come out perfectly.

You can verify any operation this way. You don't need to trust the formulas. You don't need calculus. You just need subtraction and division.

Heads up: Log and Pow can blow up on certain inputs. Log(0) gives -Infinity, Log of any negative number gives NaN, and Pow gives NaN when you raise a negative number to a non-whole power like 0.5. If NaN ever starts spreading through training later in the course, one of these two operations is almost always where it started. Come back to this section and nudge-test the suspect values.

Verifying the Formulas: The Nudge Test

Before trusting the local-gradient formulas from the table above, let's verify them by direct measurement. The helper class below runs the nudge test against raw math operations (no Value objects involved) to prove the formulas are correct independently of the C# implementation. Put it in GradientCheck.cs:

// --- GradientCheck.cs ---

namespace MicroGPT;

public static class GradientCheck
{
    /// <summary>
    /// Measures the gradient of a function at a specific input value by nudging
    /// and observing. Works for any function that takes a double and returns a double.
    /// </summary>
    public static double MeasureGradient(Func<double, double> f, double at, double nudge = 0.0001)
    {
        double before = f(at);
        double after = f(at + nudge);
        return (after - before) / nudge;
    }

    /// <summary>
    /// Runs the nudge test for all Value operations and prints the results.
    /// </summary>
    public static void RunAll()
    {
        // The "expected" column for each check is computed by applying the local
        // gradient formula from Value.cs directly. The "measured" column comes from
        // the nudge test. If the formula is right, the two columns should agree.
        static void Row(string name, double measured, double expected) =>
            Console.WriteLine(
                $"  {name, -16}  measured {measured, 8:F4}   expected {expected, 8:F4}"
            );

        Console.WriteLine("=== Straight functions (measurement should be exact) ===");
        Console.WriteLine();

        // Addition: local gradient formula from Value.cs is [1.0, 1.0]
        {
            double a = 2,
                b = 3;
            Console.WriteLine($"--- Addition: a + b where a={a}, b={b} ---");
            Row("Gradient for a", MeasureGradient(x => x + b, a), 1.0);
            Row("Gradient for b", MeasureGradient(x => a + x, b), 1.0);
            Console.WriteLine();
        }

        // Multiplication: local gradient formula from Value.cs is [b.Data, a.Data]
        {
            double a = 2,
                b = 3;
            Console.WriteLine($"--- Multiplication: a * b where a={a}, b={b} ---");
            Row("Gradient for a", MeasureGradient(x => x * b, a), b);
            Row("Gradient for b", MeasureGradient(x => a * x, b), a);
            Console.WriteLine();
        }

        Console.WriteLine("=== Curved functions (tiny drift proportional to nudge size) ===");
        Console.WriteLine();

        // Power: local gradient formula from Value.cs is n * Math.Pow(Data, n - 1)
        {
            double a = 3,
                n = 2;
            Console.WriteLine($"--- Power: a^n where a={a}, n={n} ---");
            Row("Gradient for a", MeasureGradient(x => Math.Pow(x, n), a), n * Math.Pow(a, n - 1));
            Console.WriteLine();
        }

        // Exp: local gradient formula from Value.cs is Math.Exp(Data)
        {
            double a = 1;
            Console.WriteLine($"--- Exp: e^a where a={a} ---");
            Row("Gradient for a", MeasureGradient(x => Math.Exp(x), a), Math.Exp(a));
            Console.WriteLine();
        }

        // Log: local gradient formula from Value.cs is 1.0 / Data
        {
            double a = 4;
            Console.WriteLine($"--- Log: ln(a) where a={a} ---");
            Row("Gradient for a", MeasureGradient(x => Math.Log(x), a), 1.0 / a);
        }
    }
}

Wire it into the dispatcher by uncommenting the gradcheck line in the switch in Program.cs:

case "gradcheck":
    GradientCheck.RunAll();
    break;

Then run it:

dotnet run -- gradcheck

Every gradient matches the formula from the table, including the curvature tilt we predicted. Power shows 6.0001 (exactly the number we worked out by hand earlier), and Exp shows a similar small drift because e^a is also curved. Addition and Multiplication come out perfectly because they're straight from each input's perspective. Log at a = 4 is curved but so gently that the error hides below the fourth decimal.

Exercise: Verify Value Operations

Create Chapter1Exercise.cs. This verifies that Value computes correct forward results:

// --- Chapter1Exercise.cs ---

namespace MicroGPT;

public static class Chapter1Exercise
{
    public static void Run()
    {
        // Verify forward pass - chained operations produce correct results
        var a = new Value(2.0);
        var b = new Value(3.0);
        Value c = a * b;
        Value d = c + a;
        Value e = d.Pow(2);

        Console.WriteLine("--- Forward Pass ---");
        Console.WriteLine($"c: expected 6,  got {c.Data}");
        Console.WriteLine($"d: expected 8,  got {d.Data}");
        Console.WriteLine($"e: expected 64, got {e.Data}");
    }
}

Wire it into the dispatcher by uncommenting this line in the switch in Program.cs:

case "ch1":
    Chapter1Exercise.Run();
    break;

Then run it:

dotnet run -- ch1

A Design Choice Worth Noticing

If you look at the Value operators, the local gradient values are computed immediately during the forward pass. When a * b runs, the resulting Value already contains [b.Data, a.Data] as concrete numbers. The backward pass then just multiplies and accumulates - it never computes a local gradient itself.

Production frameworks like PyTorch do this differently. They store the inputs during the forward pass, then compute the local gradient values during the backward pass using those stored inputs. For a * b, PyTorch saves references to a and b, then during backward computes b * upstream_gradient and a * upstream_gradient.

The final numbers are identical - it's the same math, just performed at a different time.

Our Value class precomputes because it makes the code simpler to understand: you can see the local gradients right there in the operator. PyTorch defers the computation because at scale (tensors with millions of numbers), precomputing and storing all the local gradients would use a lot of memory. It's cheaper to store just the inputs and recompute when needed. But for a scalar Value, storing two doubles per operation is trivial.