Forem: Kyle Pena

When ‘Magic’ Works: Type-Level Tricks in TypeScript

Kyle Pena — Fri, 19 Sep 2025 21:38:40 +0000

Sufficiently Advanced Technology

The received wisdom among developers is to avoid magic: magic numbers, magic strings, and especially framework-flavored "auto-magic". It's brittle, it's unpredictable, and it's not explicit.

On the other hand, Arthur C. Clarke stated that any sufficiently advanced technology is indistinguishable from magic.

I think the operative difference between harmful magic and sufficiently advanced technology is how well it works. And I think that TypeScript type tricks are an excellent platform for building the latter.

Bad Magic: An Anecdote

I once worked with a team which was building an auto-instrumentation framework for ES2020. The goal was to provide a low-friction mechanism (i.e.; an attribute) to instrument a wide variety of language features - method calls, property access, yields from a generator, and so on.

After two months, we were still struggling with edge cases and memory leaks. We ultimately pulled the plug.

Why wasn't this project successful? Without itemizing the individual technical shortcomings that made up the failure, I think the root cause was that the scope was too large.

The ES2020 runtime is very complicated. For example, check out these obscure corners. But a foolproof auto-instrumentation tool must handle all of these language features correctly.

Here's my takeaway: for "magic" to work reliably and without enormous engineering effort, a restricted scope seems necessary.

Type Tricks Have Limited Scope

Type systems are declarative and finite; runtimes are dynamic and stateful. This makes type-based auto-magic much simpler to pull off. This is where TypeScript really shines.

Zod, Prisma, and Hono all use extensive auto-magic typing tricks and are resounding successes. Zod, for example, is widely adopted.

Indeed, TypeScript's advanced type system seems designed to encourage magic.

Good Magic: Quick Case Study

Suppose we have some unmanaged resource that we retrieve by scoped ID:

employee/543293 represents Employee 543293.
department/55 represents Department 55.
building/3 represents Building 3.

The resource is ingested into the runtime as some kind of untyped "thing" (bytes, JSON, database record, etc.).

So we have to cast with as.

For example, we might write:

const employee = (await retrieve(`employee/543293`)) as Employee;

Or, if we don't want to do that, we could write a special-built method per entity:

function retrieveEmployee(identifier : string) : Promise<Employee> {
   return retrieve(identifier) as Promise<Employee>;
}

const employee : Employee = await retrieveEmployee(`employee/543293`);

Both patterns do not scale well with the number of entities - we either have a proliferation of per-entity methods, or a proliferation of usages of as.

Is there a better way? Thankfully yes. It is possible to write the code very cleanly:

const employee : Employee = await retrieve(`employee/${key}`);

const department : Department = await retrieve(`department/${employee.departmentId}`);

const building : Building = await retrieve(`building/${department.buildingId}`);

Notice how that the call-site signature for retrieve is the same for every line (no generic types supplied in angle brackets, for example).

TypeScript is automatically inferring the return type based on the contents of a template literal!!!

Here's the source, followed by an explanation:

type Employee = {
    name : string
    departmentId : string;
}

type Department = {
    id : string
    name : string
    budget : number
    buildingId: string
}

type Building = {
    id : string
    name : string
}

type KeywordToObjectType = {
    employee : Employee,
    department : Department,
    building : Building
}

// Keyword ::= employee, department, building
type Keyword = keyof KeywordToObjectType;

// Format of the string literal
type ObjectIdentifier = `${Keyword}/${string}`;

// Utility: Extracts the keyword out of the string literal
type ExtractKeyword<T> = T extends `${infer K extends Keyword}/${string}` ? 
    K : 
    never;

// Utility: Maps ObjectIdentifier to appropriate type
type TypeForObjectIdentifier<T> = KeywordToObjectType[ExtractKeyword<T>];

// And this is the retrieve function, which handles the cast
export async function retrieve<T extends ObjectIdentifier>(identifier: T) : 
    Promise<TypeForObjectIdentifier<T>> {
    const retrievedValue = {}; // stub -> fetch / parse / etc.
    return retrievedValue as TypeForObjectIdentifier<T>;
}

Generic Types As "Functions On Types"

Note the return type of retrieve: TypeForObjectIdentifier<T>.

It's helpful to think of TypeForObjectIdentifier<T> as a "function on type T", rather than a generic type in the traditional sense.

Therefore the declaration of TypeForObjectIdentifier<T> amounts to a simple "type program" which accepts an object identifier T and returns the appropriate object type. For example, by mapping building/3 to object type Building.

Template Literal Parsing

Type-level parsing of string template literals is the heart of the entire trick. It is what allows TypeScript to infer the return type based on the first portion of the object identifier.

TypeScript has the ability to perform type inference on template literals - even if some portions of which may have values that are only known at runtime!

In our declaration of ExtractKeyword<T>, we ask TypeScript to match T to the template literal ${infer K extends Keyword}/${string}, otherwise return never (which can trigger a compiler failure).

type ExtractKeyword<T> = T extends `${infer K extends Keyword}/${string}` ? 
    K : 
    never;

When TypeScript matches building/${department.buildingId} to ${infer K extends Keyword}/${string}, department.buildingId is statically known to be a string.

Likewise, building is bound to K by way of the keyword infer: we can think of K as if it is a named capture group in a regex. Its value is a side effect of matching the object identifier to the template literal type.

Indexed Access Types

In TypeScript's "type language", the syntax Type['property'] produces the type corresponding to the property property on Type. This is called a indexed access type. Therefore, KeywordToObjectTypeMap['building'] produces the type Building.

Putting It All Together

type TypeForObjectIdentifier<T> = KeywordToObjectType[ExtractKeyword<T>];

Now it should be clearer how TypeForObjectIdentifier<T> works: Extract the Keyword portion of the string template via ExtractKeyword<T>, and then map it to the object type.

Adding Runtime Safety

This trick does not guarantee runtime safety. If the shape of the retrieved object does not conform to the expected type given its identifier, we're past the limits of what compile-time safety can guarantee. We can remedy this with some clever use of Zod schemas.

Here is the code in entirety:

import z from "zod"

const Employee = z.object({
    id: z.string(),
    name: z.string(),
});

const Department = z.object({
    id : z.string(),
    name : z.string(),
    budget : z.number(),
    buildingId: z.string()
});

const Building = z.object({
    id : z.string(),
    name : z.string()
});

const KeywordToZodSchema = {
    employee: Employee,
    department: Department,
    building: Building
};

type Keyword = keyof typeof KeywordToZodSchema;

type KeywordToObjectType = {
    [key in Keyword]: z.infer<(typeof KeywordToZodSchema)[key]>
}

// Format of the string literal
type ObjectIdentifier = `${Keyword}/${string}`;

// Utility: Extracts the keyword out of the string literal
type ExtractKeyword<T> = T extends `${infer K extends Keyword}/${string}` ? 
    K : 
    never;

// Utility: Maps ObjectIdentifier to appropriate type
type TypeForObjectIdentifier<T> = KeywordToObjectType[ExtractKeyword<T>];

function extractKeyword(identifier: string) : Keyword {
    const keyword = identifier.split('/')[0];
    if (!(keyword in KeywordToZodSchema)) {
        throw new Error("Invalid identifier");
    }
    return keyword as Keyword;
}

export async function retrieve<T extends ObjectIdentifier>(identifier: T) : 
    Promise<TypeForObjectIdentifier<T>> {
    const retrievedValue = {}; // fetch / parse / etc.
    const keyword = extractKeyword(identifier);
    const parsed = KeywordToZodSchema[keyword].parse(retrievedValue);
    return parsed as TypeForObjectIdentifier<T>;
}

The Runtime Validation Refactor - Step by Step

First, we redefined our types as Zod schemas.

import z from "zod"

const Employee = z.object({
    id: z.string(),
    name: z.string()
});

const Department = z.object({
    id : z.string(),
    name : z.string(),
    budget : z.number(),
    buildingId: z.string()
});

const Building = z.object({
    id : z.string(),
    name : z.string()
});

Then, we defined a mapping from keyword to schema as a const map.

const KeywordToZodSchema = {
    employee: Employee,
    department: Department,
    building: Building
};

Then, we recovered the KeywordToObjectType from our original implementation using a mapped type. This keeps the type definition automatically in sync with the const mapping from keyword to schema, which is available at runtime.

type Keyword = keyof typeof KeywordToZodSchema;

type KeywordToObjectType = {
    [key in Keyword]: z.infer<(typeof KeywordToZodSchema)[key]>
}

At runtime, we need to be able to map object identifier values to keyword values, so we defined a runtime counterpart to ExtractKeyword<T> called extractKeyword:

function extractKeyword(identifier: string) : Keyword {
    const keyword = identifier.split('/')[0];
    if (!(keyword in KeywordToZodSchema)) {
        throw new Error("Invalid identifier");
    }
    return keyword as Keyword;
}

And finally, we re-implemented retrieve to perform runtime validation with Zod. First, we extract the keyword from the identifier. Then, we index into KeywordToZodSchema to select the appropriate Zod schema object. Finally, we parse and return the result.

export async function retrieve<T extends ObjectIdentifier>(identifier: T) : 
    Promise<TypeForObjectIdentifier<T>> {
    const retrievedValue = {}; // fetch / parse / etc.
    const keyword = extractKeyword(identifier);
    const schema = KeywordToZodSchema[keyword]
    const parsed = schema.parse(retrievedValue);
    return parsed as TypeForObjectIdentifier<T>;
}

There's an unavoidable as because we have to narrow parsed from a type union covering all possible object types into the specific type as represented by TypeForObjectIdentifier<T>.

Conclusion

It is possible, helpful, and practical to implement "magic type tricks" for cleaner TypeScript code. In fact, many popular frameworks do it, and you can do remarkable things with template literals.

Rather than create trouble and confusion, these tricks increase type safety and make for a better developer experience.

Visualizing Decoder Layer Gradients

Kyle Pena — Sat, 12 Jul 2025 16:59:16 +0000

In this short post, I'll explain a practical problem you'll encounter when visualizing the gradients of decoder layers, and how to resolve it.

The Llama 3.2-1b model consists of a token input embedding layer, 15 stacked decoder layers, followed by a token prediction head.

Each decoder layer takes as input a hidden state tensor of dimension (B,N,2048), where B is the batch dimension, N is the number of tokens, and 2048 is the model dimension. Therefore H[0,1,10] represents the activation of the 11th "neuron" of the second token in a batch size of one.

PyTorch's autograd allows us to compute the partial differential of any activation (call it $α\alpha$ ) with respect to some earlier layer of the network - say, for layer j, $H^j$ :

tokenized = tokenizer(input_text, return_tensors="pt", padding=True).to(device)
result = model(**tokenized, output_hidden_states=True)
hidden_states = result.hidden_states
gradient = torch.autograd.grad(
    outputs=alpha,
    inputs=hidden_states[j],
)[0]

torch.autograd.grad only allows for the computation with respect to a single activation, so for the complete picture we would need to loop over every activation in the next layer $H^{j+1}$ :

gradients = {}
token_idx = ...
for i in range(2048):
    alpha = hidden_states[j+1][:, token_idx, i]
    gradients[i] = torch.autograd.grad(
        outputs=alpha,
        inputs=hidden_states[j],
    )[0]

As a heatmap, the result looks something like this (this is called the Jacobian):

Notice the strong diagonal? This is not a bug. It's because of skip connections.

Skip Connections

Skip connections are ubiquitous in LLM architectures, as they are one of the primary reasons that it is possible to train very deep networks. If we are fitting a layer $f (x) = H (x) + x$ , where H(x) is the main function (i.e.; self-attention), then $H (x)$ can be interpreted as the residual $f (x) - x$ . Thus each layer is responsible for learning an additive update to the input x instead of a wholesale transformation. This makes deep networks easier to train, as in some sense fitting the residual is "easier". It also appears to have some kind of strange smoothing effect on the loss landscape that is no doubt related. You can read the ResNet paper for more details.

However, this is problematic for our purposes because we can't read the "real" gradients along the diagonal - they are obscured by the overwhelming effect of the skip connection.

The Fix

This is tricky to fix if we take the decoder layer as a whole. However, if we focus on the self-attention layer (or the fully connected layer) within the decoder block, it's very manageable.

Let's take a look at this (slightly simplified) implementation of the decoder layer.

        # Self Attention
        residual = hidden_states
        hidden_states = self.input_layernorm(hidden_states)
        hidden_states, self_attn_weights = self.self_attn(...)
        hidden_states = residual + hidden_states # Skip connection

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states # Skip connection

        return hidden_states

Notice that the implementation is cleanly separated into two "blocks": Self-Attention and Fully Connected.

To make it a bit clearer, we might write it like this:

    def forward(self, hidden_states):

        residual = hidden_states
        hidden_states = self.self_attention(hidden_states)
        hidden_states = residual + hidden_states # <- Skip Connection


        residual = hidden_states
        hidden_states = self.fully_connected(hidden_states)
        hidden_states = residual + hidden_states # <- Skip Connection

        return hidden_states

First, we'll need a few hook helper methods to store a reference to the relevant tensors.


layer_ins = {}
layer_outs = {}

def _save_layer_output(tag):
    def hook(_mod, args, kwargs, out):
        logging.debug(f"Saving output from {tag}, type: {type(out)}")
        if isinstance(out, tuple):
            out = out[0]
        layer_outs[tag] = out
    return hook

def _save_layer_input(tag):
    def hook(_mod, args, kwargs):
        logging.debug(f"Saving input to {tag}, type: {type(args)}")
        if isinstance(args, tuple):
            args = args[0]
        layer_ins[tag] = args
    return hook

Then, we can register the hooks. register_forward_pre_hook captures the input to an operation and register_forward_hook captures the output. If you refer back to the complete decoder layer forward pass implementation, input_layernorm and post_attention_layernorm are the cutpoints in the computation graph we need to isolate the self-attention portion of the decoder layer.

def setup_hooks_attn():
    for idx, layer in enumerate(model.model.layers):
           layer.input_layernorm.register_forward_pre_hook(_save_layer_input(f"input_layernorm_{idx}"))
        layer.post_attention_layernorm.register_forward_hook(_save_layer_output(f"post_attention_layernorm_{idx}"))

The gradient visualization for self-attention still shows the same diagonal dominance phenomena:

But it disappears if we simply subtract the identity matrix I (note the updated colormap scale).

Why It Works

The explanation is simple. The self-attention layer with skip-connection $f (x) = H (x) + x$ has a Jacobian (read: partial differential or autograd gradient) with respect to x of $J_f = J_{H(x)} + I$ .

Therefore, the partial differential of just the main branch $H (x)$ is $J_{H(x)} = J_f - I$ . In plain english, this simply means we should subtract the identity matrix from the gradient matrix.

This fix works equally well for the fully connected portion of the decoder layer.

Gradient Descent on Token Input Embeddings

Kyle Pena — Mon, 23 Jun 2025 19:34:29 +0000

Input Embedding Space Gradients

This is the first in a series of posts on the question:

"Can we extract meaningful information or interesting behavior from gradients on 'input embedding space'?"

I'm defining 'input embedding space' as the token embeddings prior to positional encoding.

The basic procedure for obtaining input space gradients is as follows:

Transform tokens into embeddings (but do not apply positional embedding).
Run an ordinary forward pass on the input embeddings to obtain a predicted token distribution.
Measure cross-entropy of the predicted distribution with a target token distribution.
Use autograd to calculate gradients on the input embeddings with respect to cross entropy.

The result is a tensor of the same shape as the input embeddings that points in the direction of minimizing the difference between the predicted and target distribution.

Choice Of Model

These experiments were performed with HuggingFace's transformers library and the ModernBERT-large model (released Dec 2024).

ModernBERT-large was chosen because:

Despite being "large" it is lightweight enough for rapid experimentation.
It has a strong and ready-made visualization suite
Bidirectional encoders are easy to reason about.
The mask in-filling capabilities were attractive for experimentation purposes (for example: ablation studies).

Follow-up experiments with meta-llama/3.2-1b are included later in this post.

Implementation

Obtaining input embeddings prior to positional embeddings required reaching into the internals of the network architecture:

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForMaskedLM.from_pretrained(MODEL)
tokenized = tokenizer(sentences, return_tensors="pt", padding=True)
inputs_embeds = model.model.embeddings.tok_embeddings(tokenized['input_ids'])

Fortunately, we can pass input_embeds directly into the model's forward pass, and this bypasses the embedding and accepts the one we supply instead.


 tokenized_no_input_ids = { 
     key: value 
     for (key,value) in tokenized.items() 
     if key != "input_ids"
 }
 model_result = model(**tokenized_no_input_ids,
       inputs_embeds=inputs_embeds)

Finally, we can use torch's built-in autograd capabilities to get our input space embedding:

    inputs_embeds_grad = torch.autograd.grad(
        outputs=loss,
        inputs=inputs_embeds,
        create_graph=False,
        retain_graph=False,
        allow_unused=False    
    )

Case Study: Horses and Dogs, Neighs and Barks

To make things more concrete, let's start with two prompts:

"The animal that says bark is a ____"
"The animal that says neigh is a ____"

The token distributions as predicted by ModernBERT-large are, respectively:

Representing the left distribution as 🐶 and the right distribution as 🐴, we are computing the gradient of:

with respect to cross_entropy(🐶,🐴).

Which means:

"Figure out which direction each token wants to go in order to fill in the blank with 'horse' instead of 'dog'".

As a gut-check, let's measure the L2 norm of the gradients for each token to give us a rough sense of the "impulse" given by cross entropy on each token:

The tokens with the top 3 gradient L2 norms are "says", "dog" and "animal".

This is encouraging. But are the gradient directions meaningful?

Let's see if any of the gradients point in a neigh-like direction by finding the vocab token with the largest cosine similarity to our gradient: argmax(cosine_sim(gradient, vocabulary))

However, perhaps this is the wrong question to ask. We want to understand if the gradient is heading towards any vocab token starting from the initial embedding:

argmax(vocab, cosine_sim(gradient, vocab - bark))

Sadly, this yields the same set of tokens because the gradient vectors are mostly orthogonal to the original embedding (indeed, they all have a cosine similarity of about -0.01):

ADAM on Input Embeddings

Although the early indications are mixed, it would be interesting to try to ADAM optimize the input embeddings.

It does converge (quite rapidly):

Animating the top token probabilities illustrates the convergence quite nicely:

And most encouragingly, " bark" seems to be on the move!

While " bark" is moving, I should point out that the new embedding (we can call it bark'), is still firmly in " bark" territory. No other vocab token is closer by cosine similarity or euclidean distance.

The Euclidean distance between " neigh" and " bark" is around 2.5, and after 500 training steps we have barely traveled 0.8. An extended training run of 10,000 steps still lands bark' firmly in bark world.

But has " bark" traveled towards anything in particular?

Indeed - "bark" has traveled more towards neigh than any other token in the vocabulary.

While this is encouraging, the cosine similarity of the heading towards neigh is nothing astonishing: about 0.3.

Repeating this exercise over 64 examples, we can see that 'bark' is a bit of an outlier (it was a contrived example). The total L2 token embedding distances per sequence typically level off, while the KL-divergence approaches zero.

Is there any kind of structure about which dimensions are affected? By inspecting a histograms and cumulative density plots of per-dimension movement in input embedding space, it doesn't appear that any particular token was "favored" - all tokens had a roughly equal distribution of embedding dimension displacement. The following histogram from our 64 test examples is typical.

Some Hypotheses

I conjecture that the performing gradient descent on input space embeddings places is in the "overparameterized regime".

This has some implications for where and how we minimize to nearly zero loss.

Specifically:

The global minima manifold is "close to everywhere".
There are almost no local minima - which means that the global minima is reachable from every starting point by gradient descent.
The "global minima manifold" is conjectured to be vast and interconnected.

The first point is uncontroversial - it is a well known property of high dimensional Euclidean space that all points become "close".

The second point helps explain why loss in the overparameterized regime almost always converges to nearly zero.

The third point explains why we should have no expectation that the point we converge to is in any way interpretable: The global minima manifold is itself quite high dimensional, and only a tiny fraction of the points on it have sensible back-projections.

TLDR; our consistent ability to converge to zero loss, the lack of interpretability of the results, and the relatively short distance our embeddings travel all lend support to the claim that we are seeing a classic loss landscape.

More Validation - Randomized Input Embeddings

But, to further validate our hypotheses about a vast and everywhere-close global minima manifold, we will conduct a final experiment:

Prior to gradient descent, replace the input embeddings with a random point sampled from a hyper-ellipse fitted to the ModernBERT-large input embeddings.
Run gradient descent as usual.
Inspect loss for convergence and input embedding L2 distances per sequence.

If loss converges and we again observe that the input embeddings do not move "very far" and "level off", this is good evidence for our hypothesis.

Here are the results:

Again - we consistently converge, and not a single token moved enough to back-project to a new token.

This is strong evidence for our hypothesis.

If we can take random input embeddings and move a relatively small distance in input embeddings space and converge to an arbitrary distribution... this fits the description of a loss landscape in the over-parameterized regime perfectly.

Update: Follow-up with Meta-Llama/Llama-3.2-1B

The architecture of ModernBERT is divergent enough from "real" LLMs that it is worth seeing if these observations hold for a model in the Llama family.

Llama-3.2-1B was the obvious choice for me to keep things local.

Even so, this model is 7x the size of ModernBERT. Without any effort put into optimization, a batch of 64 took something like 11 seconds per step on my MacBook Air.

Optimization 1: No Hand-Rolled ADAM

I rolled my own ADAM optimizer - this was a mistake. I had good reasons at the time. But I should have found a way to make it work without the out-of-the-box optimizer.

The goal was to only apply updates to non-special tokens. The solution ended up being simple - selectively zero out the gradient after loss.backward() but before opt.step():

loss.backward()
inputs_embeds.grad.mul_(gradient_masks)  
opt.step()

Optimization 2: Ensuring All Tensors Live on MPS

Macs have a high performance backend for PyTorch called MPS.

My assumption was that PyTorch was automatically placing tensors on this "device". This assumption was incorrect.

This was a straightforward fix - define a device and ensure every tensor is assigned to it.

if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

# Elsewhere... as an example
torch.tensor(args.inputs_embeds,device=device).requires_grad_(True)

Optimization 3: Using torch.amp.GradScaler and FP16 Precision

We can quantize gradient calculation, but FP16 has more severe underflow issues than FP32. PyTorch provides a nice wrapper around optimizers to automatically scale up loss to avoid gradient underflow:

        with torch.autocast("mps", dtype=torch.float16):
            logits = model(**args.tokenized_no_input_ids,
                        inputs_embeds=inputs_embeds).logits[:, -1, :]
            loss = F.kl_div(F.log_softmax(logits, -1), target_probabilities, reduction='batchmean')
            reg_loss = args.L1_embed_loss * inputs_embeds.abs().mean()
            loss += reg_loss
        scaler.scale(loss).backward()
        inputs_embeds.grad.mul_(gradient_masks)           # mask out specials
        scaler.step(opt)
        scaler.update()

Optimization 4: Turning off Gradients / Model Freezing

Since we are optimizing input embeddings, there's no need to store gradient calculations for the model weights. We can "freeze" the model after constructing the graph. This had the largest impact on training time.

    model.eval()                              # inference mode
    model.requires_grad_(False)               # turn off grads for every param

Setup

I sampled random input embeddings from a hyper-ellipse fitted to the vocabulary embeddings and assembled 64 "sentences" of varying lengths. I set up an ADAM optimizer to optimize token input embeddings to minimize KL divergence from the distribution produced by completing the sentence "The animal that says 'neigh' is a ". I masked out special tokens to prevent gradient updates on tokens like <|begin_of_text|>.

Results (Random Input Embeddings)

Just like with ModernBERT, training over 64 examples shows mean KL divergence approaching zero quickly - under 1e-2 in less than 1,000 iterations.

As with ModernBERT, the input embeddings only traveled a "short" distance. Here is a smoothed histogram of L2 distance traveled per token:

Compare this with a smoothed histogram of distances between 1000 pairs of randomly selected token input embeddings from the vocabulary:

The histogram of per-embedding-dimension displacement demonstrate just how little the individual dimensions "wiggled" (nearly all token dimensions moved less than 0.02 units) per token.

However, unlike our ModernBERT experiments, this was enough to move our random input embeddings to back-project to new tokens, as nonsensical as they were:

The reason for the back-projections changing is unclear - we did not see this behavior with ModernBERT, and the distance our embeddings travel here is demonstrably "small" compared to typical inter-token distance. Maybe tokens are distributed differently in the regions where "junk" tokens live?

Conclusions

The evidence we collected seems to support the hypothesis that the global loss minima manifold for Meta-LLama/Llama-3.2-1B is "close to everywhere" and easy to find through gradient descent.

My current belief is that the intelligence that arises in large-scale models through gradient descent is a combination of three factors:

A bias towards simplicity (created at least in part via regularization).
The properties of high dimensional space make it possible to "always improve".
Sequential layers lends itself towards the creation of abstractions.

Our experiment only has property (2), and hence it is unsurprising that the input embeddings do not morph in "intelligent" ways - say, from "bark" to "neigh".

What we did recover was additional evidence in line with current thinking on loss landscapes, which is in itself valuable.

Thank you for coming on this journey with me!

Investigating the performance of np.einsum

Kyle Pena — Tue, 05 Nov 2024 19:27:25 +0000

TLDR; tensordot mode (and therefore BLAS) is only activated for two-argument einsum when optimize=True

As a reader of my last blog post pointed out to me, np.einsum is considerably slower than np.matmul unless you turn on the optimize flag in the parameters list: np.einsum(..., optimize = True).

Being somewhat skeptical, I fired up a Jupyter notebook and did some preliminary tests.

Matrix multiplication of two C-order (aka row major order) matrices of varying dimensions, np.matmul is consistently about twenty times faster then np.einsum with optimize=False.

M1	M2	np.einsum	np.matmul	np.einsum / np.matmul
(100, 500)	(500, 100)	0.765	0.045	17.055
(100, 1000)	(1000, 100)	1.495	0.073	20.554
(100, 10000)	(10000, 100)	15.148	0.896	16.899

When optimize=True, np.einsum is much closer to np.matmul:

M1	M2	np.einsum	np.matmul	np.einsum / np.matmul
(100, 500)	(500, 100)	0.063	0.043	1.474
(100, 1000)	(1000, 100)	0.086	0.067	1.284
(100, 10000)	(10000, 100)	1.000	0.936	1.068

Why?

This is mysterious - optimize=True is about finding an optimal contraction order of indices. That only matters when there are more than two operands. Yet our tests show a substantial speedup in the two-operand case.

By way of contrast, here's an example where optimize=True should make a difference - the chained dot product:

X = np.einsum('ij,jk,kl,lm->im',a,b,c,d)

We have four operands. Due to associativity, all of these expressions are mathematically equivalent:

<a,<b,<c,d>>>
<<a,b>,<c,d>>
<<<a,b>,c>,d>
<a,<<b,c>,d>>
<<a,<b,c>>,d>

...and optimize picks the one that is likely to result in a speedup.

Maybe optimize is doing more than optimizing contraction order? Perhaps it is somehow aware of row-major vs column-major layout of the matrices in memory, and somehow being proactive about it? Like, picking a different matrix multiplication algorithm entirely if the second argument is in row-major order?

But no - np.einsum is at least 1.5 times slower with a column major second argument.

M1	M2	np.einsum	np.matmul	np.einsum / np.matmul
(100, 500)	(500, 100)	1.486	0.056	26.541
(100, 1000)	(1000, 100)	3.885	0.125	31.070
(100, 10000)	(10000, 100)	49.669	1.047	47.444

Going Deeper

Here are the release notes of [Numpy 1.12.0]:

np.einsum now supports the optimize argument which will optimize the order of contraction. For example, np.einsum would complete the chain dot example np.einsum(‘ij,jk,kl->il’, a, b, c) in a single pass which would scale like N^4; however, when optimize=True np.einsum will create an intermediate array to reduce this scaling to N^3 or effectively np.dot(a, b).dot(c). Usage of intermediate tensors to reduce scaling has been applied to the general einsum summation notation. See np.einsum_path for more details.

Later release notes indicate that np.einsum was upgraded to use tensordot (which itself uses BLAS where appropriate). Aha!

If we read def einsum(*operands, out=None, optimize=False, **kwargs) in numpy/numpy/_core/einsumfunc.py, we'll come to this early-out logic almost immediately:

    # If no optimization, run pure einsum
    if optimize is False:
        if specified_out:
            kwargs['out'] = out
        return c_einsum(*operands, **kwargs)

Does c_einsum utilize tensordot? I doubt it. Later on in the code, we see the tensordot call that the 1.14 notes seem to be referencing:

    # Start contraction loop
    for num, contraction in enumerate(contraction_list):

        ...

        # Call tensordot if still possible
        if blas:

            ...

            # Contract!
            new_view = tensordot(
                *tmp_operands, axes=(tuple(left_pos), tuple(right_pos))
            )

So, here's what's happening:

The contraction_list loop is executed if optimize is True - even in the trivial two operand case.
tensordot is only invoked in the contraction_list loop.
Therefore, we invoke tensordot (and therefore BLAS) when optimize is True.

To me, this seems like a bug. IMHO, the "early-out" at the beginning of np.einsum should still detect if the operands are tensordot-compatible and call out to tensordot if possible. Then, we would get the obvious BLAS speedups even when optimize is False. After all, the semantics of optimize pertain to contraction order, not to usage of BLAS, which I think should be a given.

The boon here is that persons invoking np.einsum for operations that are equivalent to a tensordot invocation would get the appropriate speedups, which makes np.einsum a bit less dangerous from a performance point of view.

How does c_einsum actually work?

I spelunked some C code to check it out. The heart of the implementation is here.

After a great deal of argument parsing and parameter preparation, the axis iteration order is determined and a special-purpose iterator is prepared. Each yield from the iterator represents a different way to stride over all operands simultaneously.

    /* Allocate the iterator */
    iter = NpyIter_AdvancedNew(nop+1, op, iter_flags, order, casting, op_flags,
                               op_dtypes, ndim_iter, op_axes, NULL, 0);

Assuming certain special-case optimizations don't apply, an appropriate sum-of-products (sop) function is determined based on the datatypes involved:

    sop = get_sum_of_products_function(nop,
                        NpyIter_GetDescrArray(iter)[0]->type_num,
                        NpyIter_GetDescrArray(iter)[0]->elsize,
                        fixed_strides);

And then, this sum-of-products (sop) operation is invoked on each multi-operand stride returned from the iterator, as seen here:

        iternext = NpyIter_GetIterNext(iter, NULL);
        if (iternext == NULL) {
            goto fail;
        }
        dataptr = NpyIter_GetDataPtrArray(iter);
        stride = NpyIter_GetInnerStrideArray(iter);
        countptr = NpyIter_GetInnerLoopSizePtr(iter);
        needs_api = NpyIter_IterationNeedsAPI(iter);

        NPY_BEGIN_THREADS_NDITER(iter);
        NPY_EINSUM_DBG_PRINT("Einsum loop\n");
        do {
            sop(nop, dataptr, stride, *countptr);
        } while (!(needs_api && PyErr_Occurred()) && iternext(iter));

That's my understanding of how einsum works, which is admittedly still a little thin - it really deserves more than the hour I've given it.

But it does confirm my suspicions, however, that it acts like a generalized, gigabrain version of the grade-school method of matrix multiplication. Ultimately, it delegates out to a series of "sum of product" operations which rely on "striders" moving through the operands - not too different from what you do with your fingers when you learn matrix multiplication in school.

Summary

So why is np.einsum generally faster when you call it with optimize=True? There are two reasons.

The first (and original) reason is it tries to find an optimal contraction path. However, as I pointed out, that shouldn't matter when we have just two operands, as we do in our performance tests.

The second (and newer) reason is that when optimize=True, even in the two operand case it activates a codepath that calls tensordot where possible, which in turn tries to use BLAS. And BLAS is as optimized as matrix multiplication gets!

So, the two-operands speedup mystery is solved!

The Unreasonable Usefulness of numpy's einsum

Kyle Pena — Sun, 03 Nov 2024 03:27:34 +0000

Introduction

I'd like to introduce you to the most useful method in Python, np.einsum.

With np.einsum (and its counterparts in Tensorflow, PyTorch and JAX), you can write complicated matrix and tensor operations in an extremely clear and succinct way. I've also found that its clarity and succinctness relieves a lot of the mental overload that comes with working with tensors.

And it's actually fairly simple to learn and use. Here's how it works:

In np.einsum, you have a subscripts string argument and you have one or more operands:

numpy.einsum(subscripts : string, *operands : List[np.ndarray])

The subscripts argument is a "mini-language" that tells numpy how to manipulate and combine the axes of the operands. It's a little difficult to read at first, but it's not bad when you get the hang of it.

Single Operands

For a first example, let's use np.einsum to swap the axes of (a.k.a. take the transpose) a matrix A:

M = np.einsum('ij->ji', A)

The letters i and j are bound to the first and second axes of A. Numpy binds letters to axes in the order they appear, but numpy doesn't care what letters you use if you are explicit. We could have used a and b, for example, and it works the same way:

M = np.einsum('ab->ba', A)

However, you must supply as many letters as there are axes in the operand. There are two axes in A, so you must supply two distinct letters. The next example won't work because the subscripts formula only has one letter to bind, i:

# broken
M = np.einsum('i->i', A)

On the other hand, if the operand does indeed have only one axis (i.o.w., it is a vector), then the single-letter subscript formula works just fine, although it isn't very useful because it leaves the vector a as-is:

m = np.einsum('i->i', a)

Summing Over Axes

But what about this operation? There's no i on the right-hand-side. Is this valid?

c = np.einsum('i->', a)

Surprisingly, yes!

Here is the first key to understanding the essence of np.einsum: If an axis is omitted from the right-hand-side, then the axis is summed over.

Code:

c = 0
I = len(a)
for i in range(I):
   c += a[i]

The summing-over behavior isn't limited to a single axis. For example, you can sum over two axes at once by using this subscript formula: c = np.einsum('ij->', A):

Here is the corresponding Python code:

c = 0
I,J = A.shape
for i in range(I):
   for j in range(J):
      c += A[i,j]

But it doesn't stop there - we can get creative and sum some axes and leave others alone. For example: np.einsum('ij->i', A) sums the rows of matrix A, leaving a vector of row sums of length j:

Code:

I,J = A.shape
r = np.zeros(I)
for i in range(I):
   for j in range(J):
      r[i] += A[i,j]

Likewise, np.einsum('ij->j', A) sums columns in A.

Code:

I,J = A.shape
r = np.zeros(J)
for i in range(I):
   for j in range(J):
      r[j] += A[i,j]

Two Operands

There's a limit to what we can do with a single operand. Things get a lot more interesting (and useful) with two operands.

Let's suppose you have two vectors a = [a_1, a_2, ... ] and b = [a_1, a_2, ...].

If len(a) === len(b), we can compute the inner product (also called the dot product) like this:

a = np.asarray([4,5,6])
b = np.asarray([1,2,3])
c = np.einsum('i,i->', a, b)`
>> c := 32.0

Two things are happening here simultaneously:

Because i is bound to both a and b, a and b are "lined up" and then multiplied together: a[i] * b[i].
Because the index i is excluded from the right-hand-side, axis i is summed over in order to eliminate it.

If you put (1) and (2) together, you get the classic inner product.

Code:

c = 0
I = len(a)
for i in range(I):
   c += a[i] * b[i]

Now, let's suppose that we didn't omit i from the subscript formula, we would multiply all a[i] and b[i] and not sum over i:

a = np.asarray([4,5,6])
b = np.asarray([1,2,3])
c = np.einsum(`i,i->i`, a, b)
>> c := np.asarray([4,10,18])

Code:

I = len(a)
c = np.zeros(I)
for i in range(I):
   c[i] = a[i] * b[i]

This is also called element-wise multiplication (or the Hadamard Product for matrices), and is typically done via the numpy method np.multiply.

Finally, let's suppose we included all the axes in the output - both i and j. This is called the outer product.

a = np.asarray([4,5,6])
b = np.asarray([1,2,3])
C = np.einsum(`i,j->ij`, a, b)
>> C := np.asarray([[4,8,12],[5,10,15],[6,12,18]])

In this subscript formula, the axes of a and b are bound to separate letters, and thus are treated as separate "loop variables". Therefore C has entries a[i] * b[j] for all i and j, arranged into a matrix.

Code:

I = len(a)
J = len(b)
C = np.zeros(I,J)
for i in range(I):
   for j in range(J):
      C[i,j] = a[i] * b[j]

Three Operands

Taking the outer product a step further, here's a three-operand version:

M = np.einsum('i,j,k->ijk', a, b, c)

The equivalent Python code for our three-operand outer product is:

I = len(a)
J = len(b)
K = len(c)
for i in range(I):
   for j in range(J):
      for j in range(K):
         M[i,j,k] = a[i] * b[j] * c[k]

Going even further, there's nothing stopping us from omitting axes to sum over them in addition to transposing the result by writing ki instead of ik on the right-hand-side of ->:

M = np.einsum('i,j,k->ki', a, b, c)

The equivalent Python code would read:

I = len(a)
J = len(b)
K = len(c)
M = np.zeros(K,I)
for i in range(I):
   for j in range(J):
      for k in range(K):
         M[k,i] += a[i] * b[j] * c[k]

Now I hope you can begin to see how you can specify complicated tensor operations rather easily. What's more, I can readily read off the above operation straight from the subscripts: "The outer product of three vectors, with the middle axes summed over, and the final result transposed". Pretty neat, but is this just academic? I don't think so.

A Practical Example

For a practical example, let's implement the equation at the heart of LLMs, from the classic paper "Attention is All You Need".

Eq. 1 describes the Attention Mechanism:

We'll focus our attention on the term $QK^T$ , because softmax isn't computible by np.einsum and the scaling factor $1dk\frac{1}{\sqrt{d_k}}$ is trivial to apply.

The $QK^T$ term represents the dot products of m queries with n keys. Q is a collection of m d-dimensional row vectors stacked into a matrix, so Q has the shape md. Likewise, K is a collection of n d-dimensional row vectors stacked into a matrix, so K has the shape md.

The product between a single Q and K would be written as:

np.einsum('md,nd->mn', Q, K)

Note that because of the way we wrote our subscripts equation, we avoided having to transpose K prior to matrix multiplication!

So, that seems pretty straightforward - in fact, it's just a traditional matrix multiplication. However, we're not done yet. Attention Is All You Need uses multi-head attention, which means we really have k such matrix multiplies happening simultaneously over an indexed collection of Q matrices and K matrices.

To make things a bit clearer, we might rewrite the product as $Q_iK_i^T$ .

That means we have an additional axis i for both Q and K.

And what's more, if we are in a training setting, we are probably executing a batch of such multi-headed attention operations.

So presumably would want to perform the operation over a batch of examples along a batch axis b. Thus, the complete product would be something like:

batch_multihead_QKt = np.einsum('bimd,bind->bimn', Q, K, optimize = True)

I'm going to skip the diagram here because we're dealing with 4-axis tensors. But you might be able to picture "stacking" the earlier diagram to get our multi-head axis i, and then "stacking" those "stacks" to get our batch axis b.

It's difficult for me to see how we would implement such an operation with any combination of the other numpy methods. Yet, with a little bit of inspection, it's clear what's happening: Over a batch, over a collection of matrices Q and K, perform the matrix multiplication Qt(K).

Now, isn't that wonderful?

Small Update

It was pointed out to me that the Attention example boils down to slicewise matmul, and that is supported by matmul out-of-the-box. Fair point! Originally, this post was using a different example that doesn't boil down to slicewise matmul, but it was far less topical than LLMs.

Also, it was pointed out to me that the performance of np.einsum greatly suffers unless you specifically set optimize=True in the parameter list. I investigated this topic in detail in my follow-up blog posts here.

Use Hardware Security Modules And Browser Integration To Fight Deepfakes

Kyle Pena — Fri, 01 Nov 2024 20:57:57 +0000

Introduction

Deepfakes are a massive problem without an effective solution.

Current approaches seem to fall into two categories:

Both approaches have merits, but also some obvious drawbacks. Watermarks rely on deepfake purveyors acting in good faith and are easily removed. On the other hand, AI deepfake detectors are never going to be foolproof - in fact, adversarial training specifically seeks to defeat this modality.

We may yet end up in some kind of end-game where only the most well-resourced producers can make deep-fakes that are undetectable by the best detectors, but this is a risky gamble.

The scheme I propose is relatively simple:

Integrate Hardware Security Modules directly with camera sensors. HSMs are hardware cryptographic modules capable of directly capturing sensor data and generating a digital signature without ever going through a software layer. They are generally designed to be tamper resistant, and use a per-device certificate issued from a PKI root certificate for all similar devices.
Include the HSM-generated digital signature in the image metadata.
Develop and release a first-party, built-in browser feature which displays a tamper-proof mark of authenticity to the user when an image is authentic.
As the inevitable need arises, restrict the distribution of HSM-integrated cameras (or special, more tamper-resistant HSMs) to trusted and licensed organizations.

The problem with implementing the scheme - as you might see - isn't the technology.

It's the cooperation required between manufacturers, browsers, and the public key infrastructure. Like so many of the problems of the ever-expanding polycrisis, we currently lack the proper incentives to compel the necessary action.

I'll spend some time outlining a gradualist's approach to making it happen. In any case, I think it's possible, and will shortly become necessary anyway.

The Solution - In More Detail

There are three legs to the stool.

Trusted Devices With Integrated Hardware Security Modules (HSMs)
Public Key Infrastructure
Browser Integration

I am defining Trusted Devices as digital cameras with a Hardware Security Module (HSM) directly integrated with the camera sensor. When an image is captured by the camera sensor, the HSM uses a certificate from a "Trusted Device Root Certificate" (TDRC) to digitally sign the image.

In order to authenticate an image from a Trusted Device, the image's digital signature must be verified as having been produced by a non-revoked certificate deriving from the TDRC. This can be easily accomplished by leveraging Public Key Infrastructure and utilizing cryptographic standards already available in modern browsers.

Verifying an image establishes a chain of trust from the browser to the camera sensor itself, provided the user is willing to make the following assumptions:

The Trusted Device Root Certificate is not compromised.
The Trusted Device's individual certificate is not compromised.
The HSM of the Trusted Device has not been tampered with (HSMs manufacturers already include a number of tamper-resistant designs).

Browser Iconography

Here's a tricky problem: How do we present a tamper-proof (i.e.; CSS and browser extension resistant) "indicator of authenticity" to the user, that is immediately understandable and also un-spoofable?

The address bar seems like a good place to put it - that area is generally off limits to browser extensions. Perhaps when you hover over a verified image, display a checkmark and an icon-sized representation of the image (so you can't be fooled by overlapped images in CSS space).

I am tempted to propose more specific solutions, but I feel I would be out of my lane. Browsers employ talented UX people, and I don't want to poison the well with a bad suggestion. Suffice it to say that I think the problem is solvable, and would involve some combination of within-page indicators and out-of-page indicators. But perhaps there is another obvious solution which involves just one or neither? I would love to hear some of your ideas in the comments.

The Case For Limited Distribution

Should Trusted Devices be available for purchase by the consumer? My argument is that at first - absolutely. Having every device produce authentic images would familiarize the public with the concept and shove most of the toothpaste back into the tube for a short while.

However, the problems with consumer-grade authentic images will rapidly become apparent.

HSM-enabled trusted Devices have an obvious griefing vector: Print out a deepfake, and take a photo of it with a Trusted Device. The photo is authentic, but the content is not.

Unfortunately, there's no way around it: Extending the chain of trust from the camera sensor to the depicted content itself requires trust in the operator. The trust in the operator has to come from somewhere and be justified by something.

To me, the only reasonable solution is to rely on the power of institutions. Human institutions have successfully controlled human behavior and provided reasonable guarantees for thousands of years - in a sense they are our most "reliable" technology.

As the problems of consumer-grade trusted devices become more obvious to the public at large, the natural need will arise for a different "class" of trusted device, and even a different "class" of trusted device user - all for critical purposes.

The Trusted Device Licensing Board

Acting largely as a bureaucratic extension of the technical capabilities of the Public Key Infrastructure, we could envision a board that would have the following capabilities:

The granting of a certificate derived from the Trusted Device Root Certificate.
The revocation of the certificate (and therefore the repudiation of all further images created by the Trusted Device owned by the licensee).
The licensing of a different grade of trusted device with special authority (not unlike the president' pen).

The board would derive its income through licensing fees, and would operate as an independent entity. It would have its own investigative powers to resolve claims of misconduct (similar to how the bar might investigate and remedy claims of legal malfeasance), which would act as a source of tension in the system to encourage honest behavior amongst the operators.

So Who Gets Licensed And Who Do They Work For?

Choosing who should get the devices is tricky (and fraught with danger). But I think it's easy to imagine a partial list of who ideally should be able to use them:

Clerks of congresses, governments and courts recording official proceedings.
Notaries, coroners, surveyors and other local officials.
Trusted journalists documenting proof of disasters, atrocities and war-crimes.
Licensed and accredited 3rd party agencies who provide photographic evidence for insurance claims and lawsuits as a service.

A follow-on question is whether these persons belong to the organizations they service, or are acting as external, 3rd party service vendors.

In my opinion, it is better for Trusted Device Licensees to act as independent, third party service vendors. The alternative comes with risks:

An internal Trusted Device Owner may feel pressured to misuse the device in order to keep their job, especially when it is in the institution's interest to do so (this is the classic multiple principal problem).
The existence of an independent body of licensees would reinforce the notion of the 'Trusted Device Owner' as a distinct role and entity, and therefore give the trustworthiness of the system a concrete referent to attach to.

Plus, the economics just work out better with an independent board and 3rd party licensees. Persons licensed by the board would derive their livelihood from providing verifiable photographic evidence as a third party service. Therefore, license holders would have a vested interest in acting honestly due to the threat of license revocation (which would come with a loss of income).

A Dose of Realism

To be transparent, I don't think it's reasonable to expect the creation of a licensing board ex nihilo.

You have to sell people on the tech first, allow the nature of the problem of misuse to be revealed over time, and then sell people on the idea of more restricted distribution from a more selective root certificate.

Here's how it would go:

Integrate HSMs into devices purchasable by the consumer
Wait for the inevitable abuse of the system
Create a second "class" of devices.
Repeat 1-3 until a licensing board becomes inevitable.

Multi-Level Certificates for Scalable Enforcement Against Misuse

Supposing we do end up with a selective licensing board, there's also the problem of scale.

In order for the benefits of Trusted Devices to be felt at multiple levels in society (down the local level, for example), there has to be a certain level of hierarchical distribution and ownership, and therefore hierarchical levels of accountability for the misuse of devices.

Therefore, it may be worthwhile to create sub-certificates under the TDRC corresponding to the hierarchy of organizations owning the Trusted Devices. This would be paired with a policy which revokes the higher level certificate if enough abuses at the lower level. This aligns the interests of all the individual actors in the Trusted Device ecosystem with the greater public good of having a trustworthy system.

Quis custodiet ipsos custodes?

Old-fashioned corruption is also a possibility, as it is in any institution.

But if that is going to be our reason to say "no", what's our alternative here? I can easily envision a world in the not-too-distant future where:

Insurance adjustors will routinely 'deepfake' photographic evidence to reduce payouts on insurance claims
Photographic evidence will be inadmissible in court because any image could be a deepfake (and frequently is)
Despotic leaders will commit war crimes with impunity because any photographic evidence can be hand-waived away as a deepfake

I don't want to present a false dichotomy between this system and a complete breakdown of the value of digital media, but I have a strong intuition that I'm not too far off the mark.

My sincere hope is that this post might serve a blueprint for when we inevitably will need to do something drastic about it anyway.

Capturing The Statistics of Streaming Data

Kyle Pena — Tue, 29 Oct 2024 01:54:12 +0000

The Rise (Return?) Of Specialized Computing Environments

Isn't it interesting that the key advancements in computing over the last 20 years have all been driven by some kind of specialized and limited computing environments? I'm thinking specifically of:

GPU computing
Blockchain VMs
Stateless functions in cloud computing environments
Distributed computing

There have been attempts to do "general purpose programming" in some of these environments (like CUDA for GPUs and using Rust to write programs for Solana), but the seams show. While it's getting better, there are still important limitations.

The reality is that ever since CPUs stopped getting appreciably better each year (and ever since society developed a keen interest in "world computers"), the cutting edge has been in these limited, special purpose computing environments. But it's not all bad news.

Here's the bright side for people that like math and algorithms. When you don't have GBs of RAM and fast and cheap disk access at your fingertips, you have to be careful, and algorithms matter again!

The "Streaming Context"

The common denominator between these limited computed environments is what I would call the "streaming context". In a streaming context, you see each observation in a dataset once, and never again. Keeping the entire dataset in memory or re-fetching from storage is out of the question, because the number of observations is potentially very large, in-process memory may be limited, and data storage / retrieval may be prohibitively costly.

What kinds of algorithms are useful in the streaming context? Almost by definition, any algorithm that can compute meaningful results in "one-pass".

A running total might be the simplest example of a one-pass algorithm. It's not too much more difficult to compute the mean in one pass. But I'd like to take the data-description direction as far as it can possibly go.

What if we could capture the entire data distribution in just one pass through the data, without actually storing the data?

The Statistics Bucket

Our goal is to create a "stats bucket" that gathers all sorts of useful information about a dataset - not just the usual sample statistics (mean, variance, skewness, ...), but also a detailed description of the shape of the distribution. It must gather this information in one pass over the data.

But the Stats Bucket should be update-able. When new data is streamed to the bucket, the sample statistics and the shape of the distribution are updated.

Let's Play A Game

To make the limitations involved a bit clearer and analogize the "streaming context", let's talk about it as if it were a game, called the "Number Reading Game". Here's how you play:

I begin reading a list of numbers to you. I'll read them to you one by one. You don't know how long the list of numbers is in advance, but it's going to be long. When I'm done, you have to tell me the mean. But here's the catch: You can't write down the numbers I've read.

No problem, right? All you have to do is keep track of the total and the count. There's no rules against that!

total = 0 
N = 0
for y in ys:
   total += y
   N += 1
μ = total / N

If you're not a fan of keeping track of a large sum in total, there's an alternative formula you can use. We'll call it the "mean update formula".

μ′=μ+y−μn\mu^\prime = \mu + \frac{y - \mu}{n}

(Hint: To see why this works, multiply both sides by n and rewrite the terms with summations)

Did you notice something interesting about the mean update formula? It expresses the updated mean in terms of the current mean and the next observation. In mathematics, that's called a recurrence relation. Make a mental note of the idea, because we're going to be using oodles of those.

Let's Play Another Game

Okay, maybe that was a bit too easy. Let's try something else. When I'm done reading the numbers, you have to tell me the variance $σ2\sigma^2$ - a measure of how spread out the numbers are.

For a refresher, here's the formula for the (sample size corrected) variance $σ2\sigma^2$ , where μ is the mean of the data:

σ2=1N−1∑iN(yi−μ)2\sigma^2 = \frac{1}{N-1} \sum_i^N{(y_i - \mu)^2}

Here's some pseudo-code for computing the variance in the most straightforward way possible:

# First pass through `ys` - compute the mean
total = 0 
N = 0
for y in ys:
   total += y
   N += 1
μ = total / N


# Second pass through `ys` - calculate "2nd central moment"*
M2 = 0
for y in ys:
   M2 += (y - μ) ** 2

# Compute variance from `m2`
variance = M2  / (N - 1)

Uh-oh! There's two loops here - the first loop is computing the mean (μ), and the second loop computes M2*. That's two passes through the data, and that's against the rules!

Please note that here and elsewhere I am calling quantities like $M_p$ the "central moments". Technically the central moments are $MpN\frac{M_p}{N}$ , not $M_p$ . It just less of a mouthful than saying 'the sum of the p'th power of the differences from the mean'

Anyway, remember our recurrence relation for the mean? Wouldn't it be nice if there was a recurrence relation for the variance? Well lucky for us, there is! That means we can compute the variance in one pass.

The formula is as follows:

M2′=M2+(y−μ)(y−μ′)M_2^\prime = M_2 + ( y - \mu)(y - \mu^\prime)

σ′2=M2′N−1M2′\sigma^{\prime 2} = \frac{M_2^\prime}{N-1} M_2^\prime

(Psst... here's a derivation)

In the above formula, we use a recurrence relation to compute the updated second central moment $M2′M_2^\prime$ based on the current second central moment $M_2$ , and then divide the updated second central moment by (N-1) to get the updated variance $σ′2\sigma^{\prime 2}$

You'll notice that the formula makes use of both the current and updated mean, μ and μ', so a complete code implementation would also need to include an implementation of the mean update formula from earlier.

Even so, the implementation is pretty straightforward:

M2 = 0
mean = 0
old_mean = 0
N = 0
for y in ys:
   N = N + 1
   old_mean = mean
   mean = old_mean + (y - mean) / N
   M2 = M2 + (y - mean) * (y - old_mean)
variance = M2 / (N-1)

Central Moments

As you might have guessed, if there is a second central moment there are first, third, fourth, fifth and so on central moments ( $M_1, M_3, M_4, M_5, ...$ ).

The general formula for central moments is:

∑i(yi−μ)p\sum_i{(y_i-\mu)^p}

where p is the order of the moment. Again, I'm diverging from convention and calling these quantities the moments, even though the stats types will rightly point out I'm missing the division by N.

One common use for the third and second central moments is calculating the "skewness" of a distribution. Somewhat counterintuitively, the below distribution is "right skewed" even though the lump "leans left". That's because the bulk of the data (or its "mass") is located to the right of the peak.

In terms of the second $M_2$ and third central moment $M_3$ , the skewness is:

NM3M232\frac{\sqrt{N} M_3}{M_2^{\frac{3}{2}}}

But if we are going to play our number reading game with skewness, we would need a recurrence relation for the third central moment. And just to future-proof ourselves, while we're at it, why don't we work out a recurrence relation for arbitrary central moments?

Recurrence Relations for Arbitrary Central Moments

It's difficult to calculate the central moments of order p in a single pass for the same reason that p=2 and p=3 were difficult - the presence of μ in the middle of the equation:

Mp=∑iN(yi−μ)pM_p = \sum_i^N {(y_i - \mu)^p}

However, there have been several research papers dedicated to the incremental (or recursive) computation of central moments. More recent papers have adapted older approaches to address numerical foot-guns like catastrophic cancellation. In this post I am largely adapting this paper from the mid 2000's.

The authors describe a pairwise update formulas for means and central moments. The idea is that in order to compute the central moments of a dataset $\cup B$ , you can combine the moments from datasets A and B according to their formula. And to compute the moments of, say, dataset A, you could divide the dataset further, and so on, until we reach the base case of single elements which have known moments.

This isn't the approach we want to take, but we can adapt it rather easily by taking A to be the current dataset, and B to be the set containing the single observation y, B = {y}.

Here is the pairwise formula, adapted from the paper:

MA∪B=∑k=0p(pk)[Mp−kA(−NBNμB,A)k+Mp−kB(NANμB,A)k]M_{A \cup B} = \sum_{k=0}^p {p \choose k} [ M_{p-k}^A (\frac{-N_B}{N}\mu_{B,A})^k + M_{p-k}^B (\frac{N_A}{N}\mu_{B,A})^k ]

Where $μB,A=μB−μA\mu_{B,A} = \mu_B - \mu_A$ .

Now, let's adapt this formula to the incremental case, where B={y}. The formula simplifies quite a bit because all central moments of a singleton B={y} are zero except for $M_0 = 1$ .

So, in other words, the central moments of B when B is just the single element {y} are:

[1, 0, 0, 0, ...]

With that in mind, the term beginning with $M_{p-k}^B$ drops out of the summation except when p = k, and we are left with:

Mp′=∑k=0p(pk)Mp−k(−1N(y−μ))k\text{Almost } M_p^\prime = \sum_{k=0}^p {p \choose k} { M_{p-k} (\frac{-1}{N}(y-\mu))^k }

"Almost" because we are missing one final term when k = p (and therefore $M_{p-k}^B = M_0^B = 1$ ).

That simplifies to:

Mp′=(N−1N(y−μ))p\text{Rest of } M_p^\prime = (\frac{N-1}{N}(y-\mu))^p

Add "Almost" and "Rest of", and we've got the complete equation.
And now we're ready to start coding!

Yea! Enough Math, Show Me Some Code!

I've been a bit math-heavy, and that's because there's so much ground to cover. But the implementation is interesting as well - and not completely straightforward!

We would like to create a data structure that stores the bounds (minimum and maximum) as well as the mean and first 10 central moments. We will provide convenience methods that give the variance, standard deviation, skewness, and kurtosis.

An important note: In this implementation I am storing the "zero'th" moment (p = 0) in index 0 of self.moments. This makes the indexing of the array line up with the 1-based indexing in the papers, which is nice. For example, the "second central moment" is stored in self.moments[2] as opposed to the somewhat confusing self.moments[1]. What's more, because of how central moments are defined, self.moments[0] is always N, the count of the dataset. So, we get a "count tracker" for N for free.

@dataclass
class StatsBucket:

    # How many orders of central moments to compute
    n_moments : int = 0

    # The central moments, with index 0 containing the count and index 1 containing the sum
    moments : List[float] = field(default_factory = list)

    # The mean - must be separately updated because we are storing central moments
    mean : float = None

    # minimum and maximum - separately updated and useful for bounds
    minimum : float = None
    maximum : float = None

    def __init__(self, 
                 n_moments : int, 
                 moments : Optional[List[float]] = None, 
                 mean : Optional[float] = None):
        self.n_moments = n_moments
        if moments is not None:
            self.moments = moments
            self.mean = mean
        else:
            self.moments = [0] + [0]*(self.n_moments)

The implementations for the summary statistics are all straightforward and based on definitions you can find on Wikipedia:

    def n(self):
        return self.moments[0]

    def sample_mean(self):
        return self.mean

    def sample_variance(self):
        return self.moments[2]/self.n()

    def corrected_sample_variance(self):
        return self.moments[2]/(self.n()-1)

    def corrected_sample_stdev(self):
        return self.corrected_sample_variance()**0.5

    def sample_stdev(self):
        return self.sample_variance()**0.5

    def sample_skewness(self):
        return (self.moments[3]/self.n()) / ((self.moments[2]/self.n())**1.5)

    def sample_excess_kurtosis(self):
        return (self.moments[4]/self.n()) / (self.moments[2]/self.n())**2.0 - 3

Now, we'll need a few methods:

initialize(ys : List[float]) for initializing an empty StatsBucket with the observations ys and computing all the central moments and statistics in one pass.
update(self, y : float) for incorporating a new observation into an initialized RunningStats instance and updating all relevant statistics and central moments.
combine(self, other) for computing the central moments of $\cup B$ StatsBuckets representing sets A and B. The existing StatsBucket is updated.

Here is the formula for updating the moment of order p, where all current central moments are stored in the array Ms:

    @staticmethod
    def calculate_updated_moment(p : int, 
                                 Ms: List[float], 
                                 mean : float, 
                                 y : float) -> float:

        s21 = y - mean
        n = Ms[0] + 1
        n1 = Ms[0]
        n2 = 1

        Σ = 0
        for k in range(0,p+1):
            res = math.comb(p,k) * ((Ms[p-k] * (s21*(-n2/n))**k )) 
            Σ += res
        Σ+= ((s21*n1/n)**p )
        return Σ

Now, we would like to update all moments of order 0 through P. And presumably we will be doing this in a loop over a large number of observations, so we would like to avoid list allocations if possible.

So, it would be a good idea to update self.moments in place, through something like:

for i in range(p):
   self.moments[i] = ...calculate updated moment i...

However, if you take a look at the recurrence relation, the formula for central moment p uses all central moments order 0 through p. So if we attempt to update self.moments in-place, we'll "clobber" the results and get an incorrect result.

So, we have to update the moments in reverse order. The correct loop looks like this:

for i in range(p,-1,-1):
   self.moments[i] = ...calculate updated moment i...

Here is the Python code, which updates Ms in-place and returns an updated mean, min_, and max_ back to the caller:

    @staticmethod
    def update_stats(Ms : List[float], 
                     mean : float, 
                     min_ : float, 
                     max_ : float, 
                     y : float):

        # backwards iteration here super important to avoid self-clobbering during computation
        P = len(Ms)-1
        for p in range(P,-1,-1):
            Ms[p] = StatsBucket.calculate_updated_moment(p, 
                                                         Ms, 
                                                         mean, 
                                                         y)

        # update mean
        n = Ms[0]
        mean = mean + (y-mean) / (n)

        # update bounds
        min_ = min(y,min_)
        max_ = max(y,max_)

        return mean, min_, max_

The implementation of initialize is essentially that of update. The combine calls a static method which uses the pairwise formulas from the paper. M1s are the central moments from set A, M2s are the central moments from set B, and s21 is the mean of set B subtracted from the mean of set A.

    @staticmethod
    def calculate_combined_moment(p : int, 
                                  M1s : List[float], 
                                  M2s: List[float], 
                                  s21 : float) -> float:

        n,n1,n2 = M1s[0] + M2s[0], M1s[0], M2s[0]

        Σ = 0
        for k in range(0,p+1):
            res = math.comb(p,k) * ((M1s[p-k] * (s21*(-n2/n))**k ) + (M2s[p-k] * (s21*n1/n)**k ))
            Σ += res

        return Σ

Does It Work?

You bet it does. Let's create a StatsBucket that computes the first 10 central moments, and initialize it with 100 observations from a normal distribution:

bucket = StatsBucket(n_moments = 10)
ys = np.random.randn(100)
bucket.initialize(ys)

Then we'll print out various sample statistics and compare them to the corresponding calculations from numpy:

print(np.mean(ys),                '=', bucket.sample_mean())
print(np.var(ys, correction = 1), '=', bucket.corrected_sample_variance())
print(np.std(ys, correction = 1), '=', bucket.corrected_sample_stdev())
print(scipy.stats.skew(ys),       '=', bucket.sample_skewness())
print(scipy.stats.kurtosis(ys),   '=', bucket.sample_excess_kurtosis())

Here are the results - they all agree to at least the 11th decimal place:

0.04460728719077849 = 0.04460728719077859
0.9989105114779977 = 0.9989105114780001
0.9994551072849635 0.9994551072849647
0.04379407058692495 0.04379407058692474
-0.022496024676501136 -0.022496024676511794

Now, let's take a look at the central moments. First, we can compute them in the traditional way:

expected_moments = []
for m in range(10+1):
    u = np.mean(ys)
    expected_moments.append((sum((y-u)**m for y in ys)))

And then we can test to see if they all agree to within 6 decimal places, which they do:

expected_moments = np.asarray(expected_moments)
stats_bucket_moments = np.asarray(bucket.moments)
abs_diff = np.abs(expected_moments - stats_bucket_moments)
print(np.all(abs_diff < 1e-6))

Result:

> True

Summary

In Part 1 of this blog post series, we showed how we can incrementally update arbitrary order central moments using a one-pass, in-place algorithm. In the next part we'll use the central moments to approximate the shape of the data distribution. We are well on our way to having a powerful one-pass data analyzer.

Stay tuned!

Final Note

A shoutout is deserved for John D. Cook, for making me aware that it was even possible to compute central moments through recurrence relations.

Derivation of Welford's Algorithm

Kyle Pena — Mon, 28 Oct 2024 18:07:06 +0000

This will be somewhat out of context if you're coming here first. It's really a footnote in a much longer series of blog posts on summarizing data distributions in computing environments where storage is at a premium.

In that blog post, I wanted to explain how to derive Welford's Algorithm for the recurrence relation for the second central moment, and found the explanations I could find a little lacking (at least for me).

I found this helpful post to be a great starting point, but the algebra part of the post skipped over so many steps that I couldn't follow it.

So I worked it out in greater detail. I'm hoping this is helpful to other mere mortals who, like myself, couldn't quite connect the dots.

Notation:

y represents a new observation.
μ' is the updated mean (the mean after incorporating y).
$M_2$ is the second central moment - well, not quite. Technically you'd have to divide by N to get the central moment. But we'll call it the central moment here.
N is the number of observations including y.

First, some work on the mean update formula:

μ′=μ+y−μN\mu^\prime = \mu + \frac{y - \mu}{N}

Nμ′=Nμ+y−μN\mu^\prime = N\mu + y - \mu

(N−1)μ′+μ′=(N−1)μ+y(N-1)\mu^\prime + \mu^\prime = (N-1)\mu + y

(N−1)μ′−(N−1)μ=y−μ′(N-1)\mu^\prime - (N-1)\mu = y - \mu^\prime

(N−1)μ−(N−1)μ′=μ′−y(N-1)\mu - (N-1)\mu^\prime = \mu^\prime - y

Now, the main derivation:

M2′−M2=∑1N(yi−μ′)2−∑1N−1(yi−μ)2M_2^\prime - M_2 = \sum_1^N{(y_i - \mu^\prime)^2} - \sum_1^{N-1}{(y_i - \mu)^2}

\mu^\prime)^2 + \sum_1^{N-1}{(y_i - \mu^\prime)^2 - (y_i - \mu)^2}

\mu^\prime)^2 + \sum_1^{N-1}{(y_i^2 - 2y_i\mu^\prime +\mu^{\prime 2}) - (y_i^2 - 2y_i\mu + \mu^2)}

\mu^\prime)^2 + \sum_1^{N-1}{-2y_i\mu^\prime + 2y_i\mu + (\mu^{\prime 2} - \mu^2))}

\mu^\prime)^2 + \sum_1^{N-1}{-2y_i(\mu^\prime - \mu) + (\mu^\prime - \mu)(\mu^\prime + \mu)}

\mu^\prime)^2 + \sum_1^{N-1}{(\mu^\prime - \mu)(-2y_i + \mu^\prime + \mu)}

\mu^\prime)^2 + (\mu - \mu^\prime) \sum_1^{N-1}{(2y_i - \mu^\prime - \mu)}

\mu^\prime)^2 + (\mu - \mu^\prime) [ 2(N-1)\mu - (N-1)\mu^\prime - (N-1)\mu ]

\mu^\prime)^2 + (\mu - \mu^\prime) [ (N-1)\mu - (N-1)\mu^\prime ]

Substitute the mean update term we worked out earlier.

\mu^\prime)^2 + (\mu - \mu^\prime) (\mu^\prime - y)

\mu^\prime)^2 + (\mu^\prime - \mu) (y - \mu^\prime)

\mu^\prime)(y - \mu^\prime + \mu^\prime - \mu)

\mu^\prime)(y - \mu)

This equals the difference $M2′−M2M_2^\prime - M_2$ , and so the complete recurrence relation for the second central moments is:

M2′=M2+(y−μ′)(y−μ)M_2^\prime = M_2 + (y - \mu^\prime)(y - \mu)

And thus the recurrence relation for the corrected sample variance based on the second central moment is:

σ′2=1N−1[M2+(y−μ′)(y−μ)]\sigma^{\prime 2} = \frac{1}{N-1} [ M_2 + (y - \mu^\prime)(y - \mu) ]