Forem: Kurt

From Basic to Fancy Indexing

Kurt — Mon, 27 May 2024 09:36:01 +0000

This post describes the myriad ways of indexing tensors in libraries like NumPy and PyTorch. We start from the basics: indexing with integers, identical to multi-dimensional array indexing in everyday programming languages. We add slicing, a terse language to select regular subsets of a tensor without copying its underlying data. Finally, advanced or fancy indexing is a Numpy feature that allows indexing tensors with other tensors. All these can be combined to slice and dice tensors in every way imaginable.

You’ll be slicing, dicing and chopping tensors like these veggies in no time. Photo by Sanket Shah on Unsplash

To explain, I'll give examples using Tensorken, the tensor library I'm developing in Rust. All code here works as of the v0.4 tag. I'll also dive into implementation details after the indexing semantics are clear.

Don't worry if the examples don't make sense yet - this is just to whet your appetite!

>>> let t = TrI::new(&[4, 3, 2], &(1..25).collect::<Vec<_>>())
┌                                                     ┐
│ ┌       ┐   ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 7    8  │   │ 13   14 │   │ 19   20 │ │
│ │ 3   4 │   │ 9    10 │   │ 15   16 │   │ 21   22 │ │
│ │ 5   6 │   │ 11   12 │   │ 17   18 │   │ 23   24 │ │
│ └       ┘   └         ┘   └         ┘   └         ┘ │
└                                                     ┘

// Slicing allows taking regular subsets of a tensor
>>> t.vix3(1..3, 2.., ..)
┌                       ┐
│ [ 11  12]   [ 17  18] │
└                       ┘

// Fancy indexing allows indexing a tensor with another int tensor
>>> let i = TrI::new(&[2], &[2, 1])
[ 2  1]

>>> t.vix3(&i, &i, 1)
[ 18  10]

// Masking allows indexing a tensor with another bool tensor
>>> let b = TrB::new(&[4], &[false, false, true, false])
[ false  false  true  false]

>>> t.vix3(&b, &i, 1..)
┌               ┐
│ [ 18]   [ 16] │
└               ┘

Previously in Tensors From Scratch: tensor basics, GPU acceleration, and automatic differentiation

This post is the fourth in the Tensors from Scratch series. I try to make each as self-contained as possible, but if you have the time I recommend reading the first post in the series. It lays the necessary groundwork for this one. I'll recap the salient parts throughout. Here's an overview:

Fun and Hackable Tensors in Rust, From Scratch: Basic implementation of tensors on the CPU. Explains concepts like shape broadcasting, and describes essential tensor operations.
Massively Parallel Fun with GPUs: Accelerating Tensors in Rust: An implementation of the essential tensor operations from part 1, but this time on the GPU. Read this if you're interested in GPU computation and how it's different from working with the CPU.
Beyond Backpropagation - Higher Order, Forward and Reverse-mode Automatic Differentiation for Tensorken: Adding automatic differentiation to Tensorken, in a particularly flexible way that allows arbitrary combinations of forward and reverse AD up to any order. Read this if you're interested in how AD works.

Basic indexing

Let's start with the basics: indexing with a single integer index per dimension, exactly like n-dimensional array indexing in nearly all programming languages. Let's use the following 3-dimensional tensor as a running example:

>>> let t = TrI::new(&[4, 3, 2], &(1..25).collect::<Vec<_>>())
┌                                                     ┐
│ ┌       ┐   ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 7    8  │   │ 13   14 │   │ 19   20 │ │
│ │ 3   4 │   │ 9    10 │   │ 15   16 │   │ 21   22 │ │
│ │ 5   6 │   │ 11   12 │   │ 17   18 │   │ 23   24 │ │
│ └       ┘   └         ┘   └         ┘   └         ┘ │
└

The tensor t has shape [4, 3, 2]. We'll see the effect on the shape and result of each indexing example.

If we provide a positive integer index for all dimensions, we get the value back at that index:

>>> t.ix3(0, 1, 2).to_scalar()
[4, 3, 2] -> []
4

In the examples, the first line always shows how the input shape of the tensor t has changed. The following lines show the result. For technical reasons, Tensorken does not use actual indexing notation in Rust, but various functions. ix is one such function. Since Rust does not support variadic arguments, Tensorken's indexing functions are post-fixed with the number of arguments they take: ix1, ix2, and so on.

At the risk of being obvious, it's worth stating what has happened here: an integer index selects a row in the index's corresponding dimension, and the dimension is removed from the tensor.

I use the term row loosely: in a two-dimensional tensor, it could be either a row or a column. In more dimensions, it is a tensor with one fewer dimension than the original tensor.

We have three integer indexes here, one for each dimension, so the result is a scalar with (somewhat artificially) shape [], and the value is the element of t at the "coordinate" (0, 1, 2).

And that's where some programming languages stop.

The first novelty of tensor libraries is that we don't have to provide an index for every dimension:

>>> t.ix1(1)
[4, 3, 2] -> [3, 2]
┌         ┐
│ 7    8  │
│ 9    10 │
│ 11   12 │
└         ┘

That selects index 1 in the first dimension and leaves the other dimensions unchanged. The result is a tensor with shape [3, 2]. We see that an integer index removes the dimension it selects from - we went from three to two dimensions.

A second novelty is that we don't need to index by counting from the start. Sometimes it's handy to start counting from the end of the dimension. Compare:

>>> t.ix3(t.shape()[0] - 1, t.shape()[1] - 1, t.shape()[2] - 1).to_scalar()
[4, 3, 2] -> []
24
>>> t.ix3(tl(0), tl(0), tl(0)).to_scalar()
[4, 3, 2] -> []
24

You may recognize the arr[arr.length - 1] pattern to get the last element of an array. The second line uses tl(0) instead, where tl stands for tail: tl(0) is like 0 but starts counting at the end of the dimension. So tl(0) selects the last element in a dimension, tl(1) the second-to-last, and so on. In fact, 0 is shorthand for hd(0) which stands for head. I could have written the first example as t.ix3(hd(0), hd(1), hd(2)).

In Python (and thus Numpy and PyTorch), tl(0), tl(1) are written as -1 and -2, which is shorter but has a displeasing symmetry: the first element is 0, the last element is -1. Essentially counting from the start of the dimension is zero-based, and counting from the end is one-based. I'm biased, but in Python, I remember times when I got confused by the asymmetry. (See also https://github.com/rust-ndarray/ndarray/issues/435.)

On to the next feature: we can leave dimensions unchanged by using .. (or : in Python). The last example was shorthand for t.ix3(1, .., ..). We can now index any subset of dimensions:

>>> t.ix3(.., .., 2)
[4, 3, 2] -> [4, 3]
┌              ┐
│ 2    4    6  │
│ 8    10   12 │
│ 14   16   18 │
│ 20   22   24 │
└              ┘

Which has shape [4, 3] because the first two dimensions are left unchanged, and the last dimension is removed by the index.

Additionally, there's a shorthand for consecutive ..: Ellipsis. In Python, this is represented by ..., but Rust doesn't have that symbol. In the last two examples, we could also have written t.ix2(1, Ellipsis) and t.ix2(Ellipsis, 2).

An Ellipsis leaves all dimensions covered by the ellipsis unchanged. An ellipsis can cover zero or more dimensions, and there can be at most one in a given index operation. Otherwise, it would be ambiguous: in t.ix3(Ellipsis, 5, Ellipsis) it's unclear which dimension 5 applies to. In that case, you have to use ... Besides being a bit easier to type, ellipses have the advantage that they make your tensor program more resilient to shape changes.

So far, we can only select an entire dimension or a single row in the dimension. Via slicing, we can also express taking a "rectangular" (in n dimensions) subset:

>>> t.ix3(1..3, 1..2, ..)
[4, 3, 2] -> [2, 1, 2]
┌                      ┐
│ [ 9  10]   [ 15  16] │
└                      ┘

The .. ranges we saw earlier were slices. You can also give a lower bound (inclusive) and an upper bound (exclusive) to select just those rows. Such a range index selects a contiguous subset of the rows in its dimension.

A range index never removes dimensions, which means the only difference between a range index 1..2 and the integer index 1 is that the latter additionally removes the dimension of size 1.

You can omit the start or end from a range. If you do, they'll default to 0 and the dimension's size respectively:

>>> t.ix3(..3, 1.., ..)
[4, 3, 2] -> [3, 2, 2]
┌                                       ┐
│ ┌       ┐   ┌         ┐   ┌         ┐ │
│ │ 3   4 │   │ 9    10 │   │ 15   16 │ │
│ │ 5   6 │   │ 11   12 │   │ 17   18 │ │
│ └       ┘   └         ┘   └         ┘ │
└                                       ┘

You can also use the hd and tl functions to indicate whether you're counting from the start or the end:

>>> t.ix3(..tl(0), ..tl(1), ..)
[4, 3, 2] -> [4, 2, 2]
┌                                                    ┐
│ ┌       ┐   ┌        ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 7   8  │   │ 13   14 │   │ 19   20 │ │
│ │ 3   4 │   │ 9   10 │   │ 15   16 │   │ 21   22 │ │
│ └       ┘   └        ┘   └         ┘   └         ┘ │
└                                                    ┘

A final index we can use is NewAxis. NewAxis inserts a new dimension with size 1:

>>> t.ix4(.., .., NewAxis, ..)
[4, 3, 2] -> [4, 3, 1, 2]
┌                                   ┐
│ [ 1  2]     [ 3  4]     [ 5  6]   │
│ [ 7  8]     [ 9  10]    [ 11  12] │
│ [ 13  14]   [ 15  16]   [ 17  18] │
│ [ 19  20]   [ 21  22]   [ 23  24] │
└                                   ┘

We can use as many NewAxis as we like.

Of course, we can combine all these features:

>>> t.ix4(0, 1.., NewAxis, ..tl(0))
[4, 3, 2] -> [2, 1, 1]
┌             ┐
│ [ 3]   [ 5] │
└             ┘

That concludes the tour of basic indexing. Basic indexing is relatively intuitive, albeit with some off-by-one madness in the mix, especially with inclusive-exclusive ranges and tl-based indexing.

The happy property of basic indexing is that it never needs to copy the tensor. Only a small section of memory containing the shape and strides needs to be updated, and the much larger buffer that contains the numbers is shared among the views you create with basic indexing.

The real brain-fuckery begins with so-called fancy indexing.

Fancy indexing

Fancy (or advanced) indexing allows you to use int or bool tensors as indexes.

Indexing with int tensors

Let's start with a one-dimensional int tensor as an index:

>>> let i = TrI::new(&[2], &[1, 2])
[ 1  2]

>>> t.oix1(&i)
[4, 3, 2] -> [2, 3, 2]
┌                           ┐
│ ┌         ┐   ┌         ┐ │
│ │ 7    8  │   │ 13   14 │ │
│ │ 9    10 │   │ 15   16 │ │
│ │ 11   12 │   │ 17   18 │ │
│ └         ┘   └         ┘ │
└                           ┘

To enable fancy indexing in addition to basic indexing, Tensorken makes you use the oix function, which stands for outer indexing. Because fancy indexing always copies the indexed tensor, and basic indexing never copies, it made sense to make the difference apparent in the API. (Python does not do this. It has more important things to worry about, I suppose.)

In this example, we can see that the elements of the indexing tensor i are interpreted as indexes in the first dimension of the indexed tensor t. It's like slicing, except we're not limited to taking contiguous sets of elements. Since i has size 2, the first dimension of the result r also has size 2, and indexes 1 and 2 are selected.

This kind of indexing is like selection and permutation - we can change the order of the elements of t:

>>> let i = TrI::new(&[4], &[3, 0, 2, 1])
[ 3  0  2  1]

>>> t.oix1(&i)
[4, 3, 2] -> [4, 3, 2]
┌                                                     ┐
│ ┌         ┐   ┌       ┐   ┌         ┐   ┌         ┐ │
│ │ 19   20 │   │ 1   2 │   │ 13   14 │   │ 7    8  │ │
│ │ 21   22 │   │ 3   4 │   │ 15   16 │   │ 9    10 │ │
│ │ 23   24 │   │ 5   6 │   │ 17   18 │   │ 11   12 │ │
│ └         ┘   └       ┘   └         ┘   └         ┘ │
└                                                     ┘

The elements of i do not need to be unique - we can duplicate elements of t:

>>> let i = TrI::new(&[4], &[0, 0, 1, 1])
[ 0  0  1  1]

>>> t.oix1(&i)
[4, 3, 2] -> [4, 3, 2]
┌                                                   ┐
│ ┌       ┐   ┌       ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 1   2 │   │ 7    8  │   │ 7    8  │ │
│ │ 3   4 │   │ 3   4 │   │ 9    10 │   │ 9    10 │ │
│ │ 5   6 │   │ 5   6 │   │ 11   12 │   │ 11   12 │ │
│ └       ┘   └       ┘   └         ┘   └         ┘ │
└                                                   ┘

Or increase the size of the indexed dimension:

>>> let i = TrI::new(&[5], &[1; 5])
[ 1  1  1  1  1]

>>> t.oix1(&i)
[4, 3, 2] -> [5, 3, 2]
┌                                                                     ┐
│ ┌         ┐   ┌         ┐   ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 7    8  │   │ 7    8  │   │ 7    8  │   │ 7    8  │   │ 7    8  │ │
│ │ 9    10 │   │ 9    10 │   │ 9    10 │   │ 9    10 │   │ 9    10 │ │
│ │ 11   12 │   │ 11   12 │   │ 11   12 │   │ 11   12 │   │ 11   12 │ │
│ └         ┘   └         ┘   └         ┘   └         ┘   └         ┘ │
└                                                                     ┘

We're not restricted to indexing with a 1-dimensional indexing tensor. Re-arranging things a bit:

>>> let i = TrI::new(&[2, 2], &[0, 1, 1, 0])
┌       ┐
│ 0   1 │
│ 1   0 │
└       ┘

>>> t.oix1(&i)
[4, 3, 2] -> [2, 2, 3, 2]
┌                           ┐
│ ┌       ┐     ┌         ┐ │
│ │ 1   2 │     │ 7    8  │ │
│ │ 3   4 │     │ 9    10 │ │
│ │ 5   6 │     │ 11   12 │ │
│ └       ┘     └         ┘ │
│ ┌         ┐   ┌       ┐   │
│ │ 7    8  │   │ 1   2 │   │
│ │ 9    10 │   │ 3   4 │   │
│ │ 11   12 │   │ 5   6 │   │
│ └         ┘   └       ┘   │
└                           ┘

In terms of the shape, the indexing tensor's shape replaces the indexed dimension, so we can increase the number of dimensions of t.

Finally, we can combine fancy indexing with slicing to rearrange any dimension:

>>> let i = TrI::new(&[2, 2], &[0, 1, 1, 0])
┌       ┐
│ 0   1 │
│ 1   0 │
└       ┘

>>> t.oix3(.., .., &i)
[4, 3, 2] -> [4, 3, 2, 2]
┌                                         ┐
│ ┌       ┐     ┌       ┐     ┌       ┐   │
│ │ 1   2 │     │ 3   4 │     │ 5   6 │   │
│ │ 2   1 │     │ 4   3 │     │ 6   5 │   │
│ └       ┘     └       ┘     └       ┘   │
│ ┌       ┐     ┌         ┐   ┌         ┐ │
│ │ 7   8 │     │ 9    10 │   │ 11   12 │ │
│ │ 8   7 │     │ 10   9  │   │ 12   11 │ │
│ └       ┘     └         ┘   └         ┘ │
│ ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 13   14 │   │ 15   16 │   │ 17   18 │ │
│ │ 14   13 │   │ 16   15 │   │ 18   17 │ │
│ └         ┘   └         ┘   └         ┘ │
│ ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 19   20 │   │ 21   22 │   │ 23   24 │ │
│ │ 20   19 │   │ 22   21 │   │ 24   23 │ │
│ └         ┘   └         ┘   └         ┘ │
└                                         ┘

To summarize our observations so far:

Based on the position of the indexing tensor i in the index expression, it is matched with a dimension in the indexed tensor t in exactly the same way as any other index expression in basic indexing.
The positive integer values in the indexing tensor tensor are interpreted as indexes in the indexed tensor's corresponding dimension. These indexes select rows in that dimension.
The indexing tensor i can have an arbitrary shape, and this shape replaces the indexed tensor t's dimension.

Int tensor indexes create considerably more expressive power. Where basic indexing only reduces the size of the tensor and only selects regular parts of it, now we can almost arbitrarily extend or rearrange the tensor.

Multiple fancy indexes

So far, we've only used a single int tensor as an index. What happens when we use multiple fancy indexers in the same indexing expression?

This is where things get mind-blowing. Tensorken implements two distinct ways of handling such cases, called outer indexing via oix and vectorized indexing via vix. The latter is the more powerful. Vectorized indexing can express everything outer indexing can and more. But it is also harder to understand, and the use cases where outer indexing is applicable are often easier to express with outer indexes than with vectorized indexes.

Since outer indexing is a relatively straightforward generalization of basic indexing with slices, we'll start with that and then work our way through vectorized indexing.

Outer indexing - an extension of slicing

Slicing in our running example:

>>> let t = TrI::new(&[4, 3, 2], &(1..25).collect::<Vec<_>>())
┌                                                     ┐
│ ┌       ┐   ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 7    8  |   │ 13   14 │   │ 19   20 │ │
│ │ 3   4 │   │ 9    10 │   │ 15   16 │   │ 21   22 │ │
│ │ 5   6 │   │ 11   12 │   │ 17   18 │   │ 23   24 │ │
│ └       ┘   └         ┘   └         ┘   └         ┘ │
└                                                     ┘

>>> t.ix3(1..3, 1..2, ..)
[4, 3, 2] -> [2, 1, 2]
┌                      ┐
│ [ 9  10]   [ 15  16] │
└                      ┘

Now, if what we said above about fancy indexing makes sense, expanding the slices to int tensors should give the same result:

>>> let i1 = TrI::new(&[2], &[1, 2])
[ 1  2] // equivalent to slice 1..3

>>> let i2 = TrI::new(&[1], &[1])
[ 1]    // equivalent to slice 1..2

>>> let i3 = TrI::new(&[2], &[0, 1])
[ 0  1] // equivalent to slice ..

And that works!

>>> t.oix3(&i1, &i2, &i3)
[4, 3, 2] -> [2, 1, 2]
┌                      ┐
│ [ 9  10]   [ 15  16] │
└                      ┘

As before, the advantage of int tensors is that we are not restricted to regular slices - we can arbitrarily duplicate or rearrange elements. The following example keeps just the first and last element in each dimension, and reverses their order:

>>> let i1 = TrI::new(&[2], &[3, 0])
[ 3  0]

>>> let i2 = TrI::new(&[2], &[2, 0])
[ 2  0]

>>> let i3 = TrI::new(&[2], &[1, 0])
[ 1  0]

>>> t.oix3(&i1, &i2, &i3)
[4, 3, 2] -> [2, 2, 2]
┌                         ┐
│ ┌         ┐   ┌       ┐ │
│ │ 24   23 │   │ 6   5 │ │
│ │ 20   19 │   │ 2   1 │ │
│ └         ┘   └       ┘ │
└                         ┘

And we can still use multi-dimensional indexers to change the shape:

>>> let i1 = TrI::new(&[2, 2], &[3, 3, 0, 0])
┌       ┐
│ 3   3 │
│ 0   0 │
└       ┘

>>> let i2 = TrI::new(&[2], &[2, 0])
[ 2  0]

>>> let i3 = TrI::new(&[2, 2], &[1, 0, 1, 0])
┌       ┐
│ 1   0 │
│ 1   0 │
└       ┘

>>> t.oix3(&i1, &i2, &i3)
[4, 3, 2] -> [2, 2, 2, 2, 2]
┌                                                           ┐
│ ┌                           ┐   ┌                       ┐ │
│ │ ┌         ┐   ┌         ┐ │   │ ┌       ┐   ┌       ┐ │ │
│ │ │ 24   23 │   │ 20   19 │ │   │ │ 6   5 │   │ 2   1 │ │ │
│ │ │ 24   23 │   │ 20   19 │ │   │ │ 6   5 │   │ 2   1 │ │ │
│ │ └         ┘   └         ┘ │   │ └       ┘   └       ┘ │ │
│ │ ┌         ┐   ┌         ┐ │   │ ┌       ┐   ┌       ┐ │ │
│ │ │ 24   23 │   │ 20   19 │ │   │ │ 6   5 │   │ 2   1 │ │ │
│ │ │ 24   23 │   │ 20   19 │ │   │ │ 6   5 │   │ 2   1 │ │ │
│ │ └         ┘   └         ┘ │   │ └       ┘   └       ┘ │ │
│ └                           ┘   └                       ┘ │
└                                                           ┘

Pretty crazy stuff.

What's crazier is that this is not what NumPy does. From the NumPy docs:

Advanced (fancy -ed) indices always are broadcast and iterated as one

What the hell does that mean?

Vectorized indexing - the NumPy way

I'll first give a quick refresher on broadcasting, which you can skip if you're familiar with it, and then move on to vectorized indexing.

Broadcasting refresher

Broadcasting refers to what tensor libraries do when they apply binary element-wise operations to two tensors with different shapes.

For example, it's clear what to do if you want to element-wise multiply a tensor with shape [4, 3] with another tensor of the same shape: you multiply each element in the left tensor with each element in the right tensor. The shape of the result is unchanged at [4, 3]. Likewise, it's intuitive what happens when you want to multiply a singleton tensor of shape [1] with any other tensor: multiply the single element on the left with every element in the tensor on the right. The resulting shape is whatever the shape of the tensor on the right.

Broadcasting formalizes and generalizes these cases, based on two rules that are applied when two tensors are not the same shape:

If the number of dimensions differs, add size dimensions of size 1 at the start of the shape of the tensor with fewer dimensions.
Looking at the dimension sizes in pairs, if the sizes are the same or one of them is size 1, then the tensors can be broadcasted together. The resulting shape is the pairwise maximum of the input shapes.

A nice way to write this is to right-align the shapes:

lhs [3, 4, 5]
rhs [4, 5]

// rule 1
lhs [3, 4, 5]
rhs [1, 4, 5]

// rule 2
result [3, 4, 5]

Align the input tensors on the right, and add 1s to the front. Now in each column, there must be a 1, or the sizes must be equal. If so, the resulting shape is just the maximum of each pair. Otherwise, the shapes are not broadcast-able.

From outer to vectorized indexing

Let's compare the outputs of vectorized with outer indexing.

>>> let t = TrI::new(&[3, 3], &(1..10).collect::<Vec<_>>())
┌           ┐
│ 1   2   3 │
│ 4   5   6 │
│ 7   8   9 │
└           ┘

>>> let i1 = TrI::new(&[2], &[0, 2])
[ 0  2]

>>> let i2 = TrI::new(&[2], &[0, 2])
[ 0  2]

We'll index the 3-by-3 matrix t with i1 and i2. Both index tensors select the first and last element in their respective dimension. With oix, we get the four corners of the matrix:

>>> t.oix2(&i1, &i2)
[3, 3] -> [2, 2]
┌       ┐
│ 1   3 │
│ 7   9 │
└       ┘

With vix however:

>>> t.vix2(&i1, &i2)
[3, 3] -> [2]
[ 1  9]

What gives?

One way to think about this is in terms of coordinates. For outer indexing, we're selecting the coordinates in the matrix t as follows:

┌                 ┐
│ (0, 0)   (0, 2) │
│ (2, 0)   (2, 2) │
└                 ┘

Which is, in row-major order, the cartesian product of the indexes in both tensors [0, 2] × [0, 2].

For vectorized indexing, the coordinates end up being:

[ (0, 0)  (2, 2)]

That's what "broadcasted together and iterated as one" means: the indexes are zipped together one element at a time, subject to the rules of broadcasting, and then used as if they were one combined index.

Note that this also illustrates the greater expressive power of vectorized indexing: it's not possible to get just the upper left and lower right corners of a matrix using outer indexing. You have to get all four of them. Using vectorized indexing, you can still get the four corners, but you have to work a bit harder for it:

>>> let i1 = TrI::new(&[2, 2], &[0, 0, 2, 2])
┌       ┐
│ 0   0 │
│ 2   2 │
└       ┘

>>> let i2 = TrI::new(&[2, 2], &[0, 2, 0, 2])
┌       ┐
│ 0   2 │
│ 0   2 │
└       ┘

>>> t.vix2(&i1, &i2)
[3, 3] -> [2, 2]
┌       ┐
│ 1   3 │
│ 7   9 │
└       ┘

If you use vectorized indexing with two or more tensors that can't be broadcasted together, you'll get a panic in Tensorken and an exception in NumPy.

One ambiguity with vectorized indexing remains. Let's return to our original running example with 3 dimensions.

>>> let t = TrI::new(&[4, 3, 2], &(1..25).collect::<Vec<_>>())
┌                                                     ┐
│ ┌       ┐   ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 7    8  │   │ 13   14 │   │ 19   20 │ │
│ │ 3   4 │   │ 9    10 │   │ 15   16 │   │ 21   22 │ │
│ │ 5   6 │   │ 11   12 │   │ 17   18 │   │ 23   24 │ │
│ └       ┘   └         ┘   └         ┘   └         ┘ │
└                                                     ┘

>>> let i1 = TrI::new(&[2], &[0, 2])
[ 0  2]

>>> let i2 = TrI::new(&[2], &[0, 1])
[ 0  1]

>>> t.vix3(&i1, .., &i2)
[4, 3, 2] -> ???

If we index the first and the third dimensions, and with vectorized indexing we broadcast the tensors together, how do we construct the shape of the resulting tensor? With outer indexing, because all the indexers are independent, we can insert the indexing tensor's shape in the original shape at the position where the indexing tensor appears:

>>> t.oix3(&i1, .., &i2)
[4, 3, 2] -> [2, 3, 2]

The first and last dimensions are indexed by a tensor with size 2, so both their sizes become 2 in the result. The second dimension is untouched so its size is unchanged.

But for vectorized indexing we end up with a shape [2], because broadcast_shape([2], [2]) == [2]. We still need to leave the second dimension untouched...so do we end up with shape [2, 3] or shape [3, 2]?

Here's where Tensorken diverges from Numpy. Tensorken always inserts the indexed dimensions at the front. So the result is:

>>> t.vix3(&i1, .., &i2)
[4, 3, 2] -> [2, 3]
┌              ┐
│ 1    3    5  │
│ 14   16   18 │
└              ┘

NumPy, on the other hand, has some complicated rules around this - if all the indexed dimensions are consecutive, then it inserts the broadcasted dimensions there. Otherwise, it inserts them at the front like Tensorken. I didn't think this adds much value - it's easy to transpose the result if you need a different shape, and the NumPy rules are often confusing.

Quite a lot to get your head around initially - I recommend starting with only thinking about the shapes first. Try to predict the shape of the result based on the indexed tensor, the kind of indexing, and the indexing tensors. Here are a few examples with minimal explanation to get you going:

let arr = CpuI32::ones(&[5, 6, 7, 8]);
let i1 = &CpuI32::new(&[1], &[0]);
let i2 = &CpuI32::new(&[2], &[0, 1]);

let r = arr.oix4(.., i1, i2, ..);
assert_eq!(r.shape(), &[5, 1, 2, 8]);

let r = arr.oix4(.., i1, .., i2);
assert_eq!(r.shape(), &[5, 1, 7, 2]);
 
assert_eq!(r.shape(), &[2, 5, 7]);

let r = arr.vix4(.., i1, 0, ..);
assert_eq!(r.shape(), &[1, 5, 8]);

let r = arr.vix4(.., i1, .., 0);
assert_eq!(r.shape(), &[1, 5, 7]);

Indexing with masks

The last possible indexing type is with boolean tensors, commonly called masks.

Starting with one-dimensional indexers:

>>> let t = TrI::new(&[4, 3, 2], &(1..25).collect::<Vec<_>>())
┌                                                     ┐
│ ┌       ┐   ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 7    8  │   │ 13   14 │   │ 19   20 │ │
│ │ 3   4 │   │ 9    10 │   │ 15   16 │   │ 21   22 │ │
│ │ 5   6 │   │ 11   12 │   │ 17   18 │   │ 23   24 │ │
│ └       ┘   └         ┘   └         ┘   └         ┘ │
└                                                     ┘

>>> let i = TrB::new(&[4], &[false, false, true, false])
[ false  false  true  false]

>>> t.oix1(&i)
[4, 3, 2] -> [1, 3, 2]
┌             ┐
│ ┌         ┐ │
│ │ 13   14 │ │
│ │ 15   16 │ │
│ │ 17   18 │ │
│ └         ┘ │
└             ┘

The effect is unsurprising in this case: the elements with a true mask are kept, the other discarded. The size of the indexed dimension in the result is equal to or smaller than the size in the original tensor t. In this example, we select only one value, because there is only a single true in i.

An important difference with int tensor indexers is that the size of the indexed dimensions can only stay the same - if all mask values are true - or decrease.

Another difference becomes apparent when we index with multi-dimensional masks.

>>> let i = TrB::new(&[3, 2], &[false, false, true, false, true, true])
┌               ┐
│ false   false │
│ true    false │
│ true    true  │
└               ┘

>>> t.oix2(.., &i)
[4, 3, 2] -> [4, 3]
┌              ┐
│ 3    5    6  │
│ 9    11   12 │
│ 15   17   18 │
│ 21   23   24 │
└              ┘

We indexed the last two dimensions of t with a mask of shape [3, 2] - the same shape as those two dimensions. Since the mask contains three true values, those two dimensions are removed and replaced with a single dimension of shape [3]. In fact, we have to flatten the tensor, because unlike int indexing, where the indexing shape was always a "rectangle" (n-cube in more dimensions), with bool tensors the true values can make any jagged shape we like. In the example, we select the three elements in the lower left corner:

┌         ┐
│ true    │
│ true    true  │
└               ┘

While this makes for some interesting ASCII art, that is not a valid tensor.

It's perhaps easier to understand the point of bool indexing if you know that NumPy has loads of operations to construct masks, i.e. bool tensors from int or float tensors. Tensorken really has only one such operation at the moment: eq. But it's sufficient to illustrate the principle.

>>> let i = t.eq(&TrI::new(&[2], &[1, 2]))
┌                                               ┐
│ ┌       ┐   ┌       ┐   ┌       ┐   ┌       ┐ │
│ │ t   t │   │ f   f │   │ f   f │   │ f   f │ │
│ │ f   f │   │ f   f │   │ f   f │   │ f   f │ │
│ │ f   f │   │ f   f │   │ f   f │   │ f   f │ │
│ └       ┘   └       ┘   └       ┘   └       ┘ │
└                                               ┘

>>> t.oix1(&i)
[4, 3, 2] -> [2]
[ 1  2]

The first operation creates the mask i to identify all rows that contain [1, 2]. (I abbreviated true and false so it fits the screen better.) Then the indexing operation pulls out the values corresponding to the mask (which is just [1, 2] again, of course).

One difference with NumPy is that Tensorken makes no distinction between vix and oix for masks. NumPy does the same "broadcasted together" thing for masks and int tensors. It seems that this is generally speaking lightly used and not well understood, so I choose not to support it.

In conclusion, let's compare all the different forms of indexing in terms of what they can do to the indexed tensor's shape.

Basic indexing can remove a dimension by selecting a single element via a scalar index.
Basic indexing via slicing can make any dimension smaller, but not change the number of dimensions, except by adding dimensions of size 1 with NewAxis.
Outer indexing can keep the number of dimensions the same or increase them. It can decrease or increase the size of dimensions.
Vectorized indexing can change both the number and size of dimensions, but all new dimensions are added at the front.
Masks or boolean indexers can only decrease or keep the number of dimensions, never increase. Likewise, they can only decrease or keep the size of dimensions the same.

Implementation

The second half of this post describes the implementation of indexing operations in Tensorken, my from-scratch implementation of a tensor library in Rust. If you're only interested in the semantics of indexing operations, you might as well stop reading right now. If you're up for some Rust to gain a deeper understanding, read on.

Recap - where Tensorken is at

Tensorken has really grown up in the past year. With the addition of indexing, it's functionally fully-featured tensor library now. While unlikely to be fast enough for more than toy models, we'll cross that bridge when we come to it.

Tensorken has:

Implementations for CPU and GPU of about 20 fundamental tensor operations such as exp, add, sum, and reshape. These operations are defined in a Rust trait RawTensorOps (formerly RawTensor, see the next section). If we want to support a new target for Tensorken, say Cuda, those are the 20 fundamental operations we have to implement.
An automatic differentiation layer built on about 15 fundamental differentiable operations defined in DiffableOps (formerly Diffable, see the next section). All the top-level operations on tensors, including the indexing operations, are built on these differentiable operations. That means that the composed operations are themselves differentiable.

That's where we left it in the last part. To add indexing operations, I first needed to allow bool and int tensors in Tensorken. Previously, I had only worked with tensors containing floating point numbers, using 1.0 and 0.0 as boolean values when needed. For example, the eq operation compared two tensors for equality by returning a float tensor with ones and zeros.

The Big Refactor: Bringing int and bool tensors to Tensorken

I figured the original definitions of RawTensor and Diffable would be easy to extend with other element types, given they had an associated type Elem:

pub trait RawTensor {
    type Elem;
    // fns ...
}

When I actually tried to add types besides float I realized I was wrong. The main problem is apparent when we look at eq. The original signature of eq was:

pub trait RawTensor {
    type Elem;
    fn eq(&self, other: &Self) -> Self;
}

If we also have bool tensors, we'd like to make the result of eq a bool tensor. The ideal signature would be:

pub trait RawTensor {
    type Elem;
    fn eq(&self, other: &Self) -> TRes where TRes=Self<Elem=bool>;
}

Which is not a valid Rust type signature. We can get close with:

pub trait RawTensor {
    type Elem;
    fn eq(&self, other: &Self) -> TRes where TRes: RawTensor<Elem=bool>;
}

But this is not accurate: we have several implementations of RawTensor, one for CPU and another for GPU, and this signature of eq would allow the application of eq to two CPU RawTensors to return a GPU RawTensor or vice versa. That's not what we want.

After trying for weeks to get everything to work, seemingly small inaccuracies like these compound. After a long list of compiler errors, I gave up and changed tack.

I re-read my own post on typed tagless final interpreters, and realized what I should have done from the get-go: introduce a higher-order representation of the RawTensor and Diffable traits, with generic associated types.

For the occasion, I renamed RawTensor to RawTensorOps:

pub trait RawTensorOps {
    type Repr<E: Clone>: Clone;
    // fns omitted
}

We now have a generic associated type Repr<E>. This type is the concrete representation of the particular raw tensor we'll implement - for CpuRawTensor for example, it's a buffer in memory with some information on shape and strides. The associated type is generic on E, the tensor's element type - this can be f32, i32, bool, or any other type we care to support.

The raw tensor operations, represented by function on the RawTensorOps trait, then also become generic on the element type E:

fn exp<E: Float>(t: &Self::Repr<E>) -> Self::Repr<E>;

fn add<E: Num>(lhs: &Self::Repr<E>, rhs: &Self::Repr<E>) -> Self::Repr<E>;

Each individual method now potentially has a separate element type E for each argument. Thus, eq:

fn eq<E: PartialEq + Elem>(lhs: &Self::Repr<E>, rhs: &Self::Repr<E>) -> Self::Repr<bool>;

That's exactly the type signature we wanted.

Once you have tensors with different element types, you want to cast between them. Hence the new addition:

fn cast<EFro: Elem, ETo: CastFrom<EFro> + Elem>(t: &Self::Repr<EFro>) -> Self::Repr<ETo>;

Elem, Num, and Float are relatively uninteresting traits that enable successively more operations on the element types. Those traits ensure we can't call exp on a bool tensor and other nonsense.

As a result of this refactor, every implementation consists of two parts: the representation, what we'd think of as the tensor type, and the implementation, typically a singleton or even void type (meaning it has no instances, it's just a type) which implements the RawTensorOps trait. For example, let's look at the types involved in implementing RawTensorOps on the CPU.

First, the representation type is unchanged from before the refactor. It keeps the buffer with the tensor data, and some extra data that stores shape and other information:

pub struct CpuRawTensor<E> {
    buffer: Arc<Buffer<E>>,
    strider: ShapeStrider,
}

E is the type of element - bool, i32, f32 - I've tried to use the name E to mean the element type.

Then we have the implementation type:

pub enum CpuRawTensorImpl {}

Rust guarantees this type can't be instantiated, which is great because we don't need any instances:

impl RawTensorOps for CpuRawTensorImpl {
    type Repr<E: Clone> = CpuRawTensor<E>;
    // fns omitted
}

This pattern repeats for all other RawTensorOps implementations, and for all DiffableOps implementations.

Finally, we need to tie everything together and provide the user-facing API, where we put handy utility methods built out of the base DiffableOps operations. The Tensor struct is where that happens:

pub struct Tensor<T, E: Clone, I: DiffableOps<Repr<E> = T>>(
    pub(crate) T,
    pub(crate) PhantomData<(E, I)>,
);

Tensor ties the representation type T to the implementation type I via the constraint Repr<E> = T. E is again the element type. Here are some valid instantiations of Tensor with concrete types:

pub type Cpu<E> = Tensor<CpuRawTensor<E>, E, CpuRawTensorImpl>;
pub type Wgpu<E> = Tensor<WgpuRawTensor<'static, E>, E, WgpuRawTensorImpl>;

The actual types Tensorken uses are slightly more complicated because simple operation fusing is done via a RawTensorOps implementation in FuseImpl, but the principle remains the same.

And that is it as far as the big refactor is concerned. Conceptually not much has changed compared to v0.3 - but looking at the diff I had to touch pretty much every single line of code. I tried every trick in the book to make the edit manageable - there was no simple path through refactoring, so I had to break the code hard and work through hundreds of compiler errors. The way the Rust compiler works makes this especially disheartening. For example, the compiler doesn't check borrowing rules until it's happy all your trait types are correct. That leads to several iterations of reducing errors from 100 to 0, only to have another 100 errors pop up.

I did most of the editing manually or using search and replace. I also tried to get Gemini to rewrite some code by giving it an example rewrite and asking it to do the rest, one function at a time. That approach was somewhat successful, but I quickly grew tired of copy-pasting back and forth. I have GitHub CoPilot but it seems incapable of rewriting code - it only generates new code.

Implementing slicing

To keep Tensorken small, the higher-level tensor operations like matrix multiplication are implemented in terms of primitive differentiable operations defined on the DiffableOps trait. DiffableOps has basic operations like exp, add, and mul for calculating, but also a set of slicing and dicing operations to change the shape of tensors. To implement basic indexing, we'll use two in particular:

/// Crop the tensor according to the given limits. Limits are inclusive-exclusive.
pub fn crop(&self, limits: &[(usize, usize)]) -> Self;

/// Reshape the tensor to the given shape.
/// The number of elements must remain the same.
pub fn reshape(&self, shape: &[usize]) -> Self;

For details on how these work, see the first part in the Tensorken series. Here are a few quick examples:

>>> let t = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
┌       ┐
│ 2   1 │
│ 4   2 │
│ 8   4 │
└       ┘

>>> t.crop(&[(0, 2), (1, 2)])
┌   ┐
│ 1 │
│ 2 │
└   ┘

>>> t.reshape(&[1, 6])
[ 2  1  4  2  8  4]

crop reduces the size of a tensor to a rectangular subset. reshape changes the shape, but does not change the size. Both operations don't copy the tensor - they only change the view on the underlying data buffer.

The advantage of implementing all operations based on the available operations in DiffableOps is that the resulting indexing operations become differentiable too. That means we can use them in programs that use gradient descent for learning.

We proceed by translating the indexing operations as detailed above, to a few datatypes we'll interpret later on. To begin with, we'll define an IndexSpec which contains a number of IndexElements. We split an index expression like t.ix[..4, 2, NewAxis] into three parts. Each part is an index element:

pub struct IndexSpec {
    axes: Vec<IndexElement>,
}

pub enum IndexElement {
    // A single element in an axis.
    Single(SingleIndex),
    // A range of elements in an axis. The second element is inclusive if the bool is true.
    Slice(SingleIndex, SingleIndex, bool),
    // Create a new axis with size 1.
    NewAxis,
    // Keep the remaining dimensions as is.
    Ellipsis,
}

The different cases should be somewhat clear - they map one-on-one to the different slicing possibilities. SingleIndex is a simple enum that specifies if we start counting from the start or the end:

pub enum SingleIndex {
    Head(usize),
    Tail(usize),
}

Even if we would allow negative i32 like Python does, I'd still translate those to this type: all indexing in Rust is done with usize, so a usize-based representation is easier to work with.

The implementation translates an IndexSpec to a BasicIndexResolution:

struct BasicIndexResolution {
    limits: Vec<(usize, usize)>,
    shape: Vec<usize>,
}

We apply a BasicIndexResolution to a tensor like this:

&tensor.crop(&resolution.limits).reshape(&resolution.shape);

In principle, we could also crop and reshape as we interpret an index element dimension per dimension, but it is cleaner to only use two operations.

The implementation itself is not very interesting. It is off by one hell - dealing with counting from the start, counting from the end, the inclusive-exclusive nature of ranges, and other such shenanigans caused me quite a few headaches.

In broad strokes we iterate over the various elements in the index spec, and append limits and update the shape in BasicIndexResolution as we go along.

The four possible cases of IndexElement are handled as follows:

Single: a single index, e.g. t.ix1(3). Limits are updated to only keep the given index. The shape is updated to remove this dimension of size 1.
Slice: a range, e.g. t.ix1(1..20). Limits are updated to only keep the range. The shape is updated to the resulting size of the dimension.
NewAxis: a new axis of size 1. Limits are unchanged. The shape is updated with a new dimension of size 1. We can always add such a dimension because it doesn't change the overall size of the tensor.
Ellipsis: keep remaining dimensions as is. Since ellipsis can occur at most once, but anywhere in an index, this needs to figure out how many dimensions the ellipsis spans, and then add the original limits and shape unchanged.

The full implementation if you're interested in going into more detail. It also handles fancy indexing, which we'll discuss next.

Adding fancy indexing

We now know how to resolve basic indexes, but we can use both basic indexing and fancy indexes in the same index expression. Before we start on the implementation of fancy indexes, we have to figure out how to interleave the two.

As it turns out, resolving basic and fancy indexes can be split into a basic indexing phase and a fancy indexing phase. We can first extract and apply the basic indexing expressions, leaving any dimensions with fancy indexes intact as if the user had specified ... Thanks to the magic of basic indexing, this never needs a copy. Then, we apply any fancy indexing expressions to the result.

>>> let t = TrI::new(&[4, 3, 2], &(1..25).collect::<Vec<_>>())
┌                                                     ┐
│ ┌       ┐   ┌         ┐   ┌         ┐   ┌         ┐ │
│ │ 1   2 │   │ 7    8  │   │ 13   14 │   │ 19   20 │ │
│ │ 3   4 │   │ 9    10 │   │ 15   16 │   │ 21   22 │ │
│ │ 5   6 │   │ 11   12 │   │ 17   18 │   │ 23   24 │ │
│ └       ┘   └         ┘   └         ┘   └         ┘ │
└                                                     ┘

>>> let i1 = TrI::new(&[2], &[0, 2])
[ 0  2]

>>> let i2 = TrI::new(&[2], &[0, 1])
[ 0  1]

>>> t.vix3(&i1, ..2, &i2)
[4, 3, 2] -> [2, 2]
┌         ┐
│ 1    3  │
│ 14   16 │
└         ┘

>>> t.vix3(.., ..2, ..).vix3(&i1, .., &i2)
[4, 3, 2] -> [2, 2]
┌         ┐
│ 1    3  │
│ 14   16 │
└         ┘

This example splits vix3(&i1, ..2, &i2) into vix3(.., ..2, ..), which has only basic indexing expressions, and vix3(&i1, .., &i2) which has only fancy indexing expressions. Happily, we arrived at the same result, so we can now focus on implementing fancy indexing without worrying about how it interacts with basic indexing.

In the implementation, we extend the IndexElement enum with an additional case:

pub enum IndexElement<I: DiffableOps> {
    // as before
    // Fancy index - mask or int tensor.
    Fancy(Fancy<I>),
}

pub enum Fancy<I: DiffableOps> {
    Full,
    IntTensor(Tensor<I::Repr<i32>, i32, I>),
    BoolTensor(Tensor<I::Repr<bool>, bool, I>),
}

The Full case indicates that the corresponding dimension was handled in the basic indexing phase, which happens first. The fancy indexing phase keeps the dimension unchanged.

Implementing outer indexing

Let's start with outer indexing. It's not obvious how to implement with just the operations in DiffableOps, yet there is a way. It relies on the observation that if we can convert the integer indexes to one-hot vectors - vectors that have a 1 in the position they index, and are 0 otherwise - and then multiply the original tensor with these one-hot vectors, we keep exclusively the elements we want.

Let's look at an example to clarify. We start with the following one-dimensional tensor, and index its only dimension with the tensor i:

>>> let t = TrI::new(&[4], &(1..5).collect::<Vec<_>>())
[ 1  2  3  4]

>>> let i = TrI::new(&[2], &[2, 0])
[ 2  0]

The expected result is [ 3 1].

First, we convert i = [ 2 0] to its corresponding one-hot representation manually (we'll see how to do this automatically later on):


>>> let i_one_hot = // coming soon
┌       ┐
│ 0   1 │
│ 0   0 │
│ 1   0 │
│ 0   0 │
└       ┘

We have two indexes, 2 and 0, so there are two column vectors. The first index is 2, and the one-hot vector for 2 is [0 0 1 0], so that's the first column. The one-hot vector for 0 is [1 0 0 0], so that's the second column. We replaced the numbers in each column with one-hot vectors representing that number.

Now, after reshaping t to a column vector:

>>> let t = t.reshape(&[4, 1])
┌   ┐
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└   ┘

We can see that if we multiply i_one_hot with the reshaped t, thanks to broadcasting, we multiply each column in i_one_hot with the column vector t. This results in a tensor that contains only one non-zero entry per column, and this entry is the entry in the original t we want to keep:

>>> let mul_result = t.mul(&i_one_hot)
┌       ┐
│ 0   1 │
│ 0   0 │
│ 3   0 │
│ 0   0 │
└       ┘

We can smell victory now! All that's left to do is get rid of the zeros. Zero is a neutral element for addition, so we can just sum the columns:

>>> let sum_result = mul_result.sum(&[0]).squeeze(&Axes::Axis(0))
[ 3  1]

And that's exactly what we wanted.

Now, how do we turn a vector of positive indexes into a one-hot representation? It is simple - once you know the trick! Create a range vector as long as the size of the dimension:

>>> let i_range = TrI::new(&[4, 1], (0..4).collect::<Vec<_>>().as_slice())
┌   ┐
│ 0 │
│ 1 │
│ 2 │
│ 3 │
└   ┘

And compare it to the index vector using eq, casting true to 1 and false to 0:

>>> let i_one_hot = i.eq(&i_range).cast::<i32>()
┌       ┐
│ 0   1 │
│ 0   0 │
│ 1   0 │
│ 0   0 │
└       ┘

Voila! We can repeat this procedure dimension by dimension for every outer index.

To understand what happens in more dimensions, it's easier to think in terms of the shapes. For example, let's say we have a tensor t with shape [4, 6] and we index its first dimension with a tensor i of shape [2]. Here's what happens with the shapes:

t [4, 6]
i [2]
t[i, ..]       [2, 6] // expected result shape

// First create the one-hot representation of the index.
range          [4, 1]
// Its shape is the size of the indexed dimension, by the size of i.
one_hot        [4, 2]

// Reshape t - add a size 1 dimension...
t           [4, 1, 6]
// Reshape one_hot - add a size one dimension...
one_hot     [4, 2, 1]
// ...so that the multiplication lines up nicely for broadcasting.
t * one_hot [4, 2, 6]

// Sum and remove the dimension we don't need.
sum         [1, 2, 6]
squeeze        [2, 6]

Pretty neat! Say we wanted to index the second dimension of t with another tensor j, we'd continue from the intermediate result [2, 6] as follows:

t[i, ..]       [2, 6]
j [3]
t[i, j]        [2, 3]  // expected result shape

// First create the one-hot representation of the index.
range          [6, 1]
// Its shape is the size of the indexed dimension of t, by the size of i.
one_hot        [6, 3]

// Reshape t - add a size 1 dimension after the indexed dimension
t           [2, 6, 1]
// Reshape one_hot not necessary for the last dimension - 
// broadcasting adds dimensions at the front automatically.
one_hot        [6, 3]
// This multiplication lines up nicely.
t * one_hot [2, 6, 3]

// Sum and remove the dimension we don't need.
sum         [2, 1, 3]
squeeze        [2, 3]

You can equivalently think of indexing with an int tensor as matrix multiplication with a one-hot representation of the indexing tensor.

In summary, to outer index with an int tensor:

convert the int tensor to a one-hot representation
reshape the tensor to make room for the shape of the indexing tensor by adding dimensions of size 1
multiply with the one hot representation
reduce the original dimension by summing it.

The Rust implementation is here.

Implementing vectorized indexing

Vectorized indexing works via the same principle as outer indexing, but in vectorized indexing the indexing tensors are broadcasted together and the new dimensions are added at the front. We can still play the one hot, multiply, and sum game, but we adjust the shapes differently.

Let's go through an example again. We'll use the same tensor t but with indexing tensors i and j this time.

>>> let t = TrI::new(&[4, 6], &(1..25).collect::<Vec<_>>())
┌                             ┐
│ 1    2    3    4    5    6  │
│ 7    8    9    10   11   12 │
│ 13   14   15   16   17   18 │
│ 19   20   21   22   23   24 │
└                             ┘

>>> let i = TrI::new(&[2], &[2, 0])
[ 2  0]

>>> let j = TrI::new(&[2], &[1, 0])
[ 1  0]

Using the same i and j as in the previous section won't work: their shapes can't be broadcasted together.

We again proceed iteratively, starting with i. Let's first create the one hot representation:

>>> let i_range = TrI::new(&[4], (0..4).collect::<Vec<_>>().as_slice())
[ 0  1  2  3]

>>> let i = i.reshape(&[2, 1])
┌   ┐
│ 2 │
│ 0 │
└   ┘

>>> let i_one_hot = i.eq(&i_range).cast::<i32>()
┌               ┐
│ 0   0   1   0 │
│ 1   0   0   0 │
└               ┘

The representation is transposed from the oix implementation: we now have two row vectors stacked on top of each other. That's because the new dimensions added by indexing now always go to the front of the resulting tensor, and as we'll see in the next step this transposed one hot representation works out better:

>>> let i_one_hot = i_one_hot.reshape(&[2, 4, 1])
┌               ┐
│ ┌   ┐   ┌   ┐ │
│ │ 0 │   │ 1 │ │
│ │ 0 │   │ 0 │ │
│ │ 1 │   │ 0 │ │
│ │ 0 │   │ 0 │ │
│ └   ┘   └   ┘ │
└               ┘

>>> let mul_result = t.mul(&i_one_hot)
┌                                                             ┐
│ ┌                             ┐   ┌                       ┐ │
│ │ 0    0    0    0    0    0  │   │ 1   2   3   4   5   6 │ │
│ │ 0    0    0    0    0    0  │   │ 0   0   0   0   0   0 │ │
│ │ 13   14   15   16   17   18 │   │ 0   0   0   0   0   0 │ │
│ │ 0    0    0    0    0    0  │   │ 0   0   0   0   0   0 │ │
│ └                             ┘   └                       ┘ │
└

We've reshaped the one hot representation to [2, 4, 1] and multiply that with t of shape [4, 6]. That multiplication results in a [2, 4, 6] tensor. The dimension of size 4 is the dimension we're indexing, so that's the one we need to get rid of:

>>> let t = mul_result.sum(&[1]).squeeze(&Axes::Axis(1))
┌                             ┐
│ 13   14   15   16   17   18 │
│ 1    2    3    4    5    6  │
└                             ┘

And that's the first index done. We're left with an intermediate result t of shape [2, 6]. So far, the result is the same as if we'd used oix - the difference becomes visible when we handle the second index. We make the one hot representation of j, again transposed when compared to outer indexing:

>>> let j_range = TrI::new(&[6], (0..6).collect::<Vec<_>>().as_slice())
[ 0  1  2  3  4  5]

>>> let j = j.reshape(&[2, 1])
┌   ┐
│ 1 │
│ 0 │
└   ┘

>>> let j_one_hot = j.eq(&j_range).cast::<i32>()
┌                       ┐
│ 0   1   0   0   0   0 │
│ 1   0   0   0   0   0 │
└                       ┘

Now we have a one-hot shape of [2, 6] and a tensor of shape [2, 6].We can multiply them without any further reshaping. This step is where we use the requirement that the indexing tensors must be broadcast-able: if not, we'd fail because the one-hot shape (determined by the second indexing tensor) and the intermediate result shape (determined by the first indexing tensor) wouldn't line up.

>>> let mul_result = t.mul(&j_one_hot)
┌                        ┐
│ 0   14   0   0   0   0 │
│ 1   0    0   0   0   0 │
└                        ┘

Finally, we reduce the first axis to get the result:

>>> let sum_result = mul_result.sum(&[1]).squeeze(&Axes::Axis(1))
[ 14  1]

Putting everything together in terms of shapes:

t [4, 6]
i [2]
t[i, ..]       [2, 6]  // expected result shape

// First create the one-hot representation of the index.
range             [4]
// Its shape is the size of the indexed dimension of t, by the size of i.
one_hot        [2, 4]

// Reshape one_hot...
one_hot     [2, 4, 1]

// multiplication lines up nicely.
t * one_hot [2, 4, 6]

// Sum and remove the dimension we don't need.
sum         [2, 1, 6]
squeeze        [2, 6]

// Continue with the second index
j                 [2]
t[i, j]           [2]  // expected result shape

// First create the one-hot representation of the index.
range             [6]
// Its shape is the size of the indexed dimension of t, the by size of i.
one_hot        [2, 6]

// multiplication lines up nicely.
t * one_hot    [2, 6]

// Sum and remove the dimension we don't need.
sum            [2, 1]
squeeze           [2]

The Rust implementation is here.

Implementing masking

The final piece is masking, or indexing with a boolean tensor. There are no new tricks here - the best we can do (as far as I can tell) is manually convert the bool tensor to a one-dimensional vector, convert that to the equivalent int tensor, and then index using the int tensor.

For example, a bool tensor:

>>> let b = TrB::new(&[2, 3], &[false, false, true, false, true, false])
┌                       ┐
│ false   false   true  │
│ false   true    false │
└                       ┘

Is turned into:

>>> let i_b = TrI::new(&[2], &[2, 4])
[ 2  4]

Since the b mask had two dimensions, it spans two dimensions when used as an index, and those two dimensions reduce to one. To achieve this, we reshape the dimensions to a single one, with a size equal to the product of the size of the original dimensions. Then we use outer indexing to index with the equivalent int index.

The implementation is here.

Conclusion

I hope that demystified tensor indexing. There's a lot more to it than meets the eye. Check out the references for useful additional material if you want to learn more.

Thank you for reading Get Code. This post is public so feel free to share it.

Indexing on ndarrays: NumPy's indexing documentation.
NEP 21 — Simplified and explicit advanced indexing: A (deferred) Numpy enhancement proposal, with a good analysis of how Numpy's indexing currently works and some issues with it. Tensorken's split between outer and vectorized indexing follows this proposal.
How does advanced indexing work when I combine a tensor and a single index: PyTorch maintainer explains how basic and fancy indexing compose.

Beyond Backpropagation - Higher Order, Forward and Reverse-mode Automatic Differentiation for Tensorken

Kurt — Sat, 09 Dec 2023 22:00:04 +0000

This post describes how I added automatic differentiation to Tensorken. Tensorken is my attempt to build a fully featured yet easy-to-understand and hackable implementation of a deep learning library in Rust. It takes inspiration from the likes of PyTorch, Tinygrad, and JAX.

Tensorken's approach to automatic differentiation (or AD) is heavily inspired by JAX. Like JAX, Tensorken supports higher-order derivatives - besides the first derivative, it can calculate the second, third, and so on. Tensorken supports both forward and reverse-mode AD, and can arbitrarily compose the two. Finally, thanks to good fundamentals explained in the previous two posts (part 1 and part 2), Tensorken can compute derivatives on the CPU or GPU.

All code for this post is in the Tensorken repository, tagged v0.3.

There's a "it's turtles all the way down" reference somewhere in this post, and only then will this image make sense. Generated by Lexica's Aperture model.

Previously in Tensors From Scratch: neural networks, matrix multiplication, and GPU acceleration

Modern neural networks, for example, large language models (LLMs) like OpenAI's ChatGPT and GPT-4, Microsoft's Bing, Google's Bard, and Anthropic's Claude, are powered by tensors. Tensors are multi-dimensional arrays augmented with operations that execute efficiently on modern hardware, most notably GPUs.

To understand all that, I am building a neural network library like PyTorch or JAX, from the ground up in Rust. These libraries consist of:

A tensor library, to provide efficient operations to slice, dice, and apply bulk operations to tensors.
Accelerators, to accelerate tensor operations on the GPU.
Automatic differentiation, to train neural networks via gradient descent.
Neural network building blocks, to simplify using common activation functions and layers.

In the first post, I focussed on the tensor library. I described almost twenty fundamental tensor operations and abstracted them in a Rust trait called RawTensor. RawTensor had a single implementation, CpuRawTensor which executes tensor computations on the CPU. In the second post, I implemented RawTensor again in WgpuRawTensor to execute on the GPU using wgpu, Rust's implementation of WebGPU. We dove into the nitty-gritty of GPU programming in general and wgpu in particular.

This third part of the series describes how to add automatic differentiation to Tensorken. Automatic differentiation (AD) is a technique to compute derivatives of tensor computations, without programmer intervention. AD is crucial because neural networks are trained via gradient descent, which relies on the efficient calculation of derivatives.

How to train your neural network

Let's sketch how to train a neural network to emphasize how important AD is for deep learning.

First, gather training data, and lots of it. Training data are lots of input-expected output pairs. The input examples are encoded as numbers and aggregated in a tensor 𝚇. The outputs go in a tensor 𝚈. Think of 𝚈 as the correct predictions for the inputs 𝚇. For a language model, each example in 𝚇 could be a sequence of words, and 𝚈 the next word, encoded as numbers. (How to encode text as numbers is an interesting problem that's not relevant to this story.)

Second, decide on the architecture of your neural network. A neural network consists of tensors 𝚆ᵢ that contain the parameters of the network. That's what you download when you get a model's weights. The architecture determines how many parameters we have and how we combine the input 𝚇 with parameters Wᵢ to obtain an output 𝚈'. Whatever the architecture is, we can execute it to predict 𝚈':

𝚈' = 𝚏(𝚇, 𝚆ᵢ).

I'm simplifying - researchers distinguish weights 𝚆 and biases b, but in the end, they're both part of the trainable parameters so I'm just lumping them together in the 𝚆s.

Third, using the expected 𝚈 and the prediction 𝚈', calculate the loss 𝙻. The loss is a single number that is high when the prediction is bad, and low when it is good. The loss is calculated by comparing the network's prediction 𝚈' with the expected output 𝚈:

𝙻 = 𝚕(𝚈, 𝚈').

Fourth, calculate the gradient 𝙶 of the loss. The scalar loss value 𝙻 is a function of 𝚇, 𝚆ᵢ, and 𝚈. Imagine the loss function as describing a (highly dimensional!) landscape. Training the network to improve its predictions means changing the parameters Wᵢ to make the loss small. We'd like to know how we should change the parameters to achieve that.

Now is when the gradient comes in. Going back to the landscape analogy, to make the loss smaller we'd like to know the best direction to "move" in to go "down" - that is, from the current value of the parameters, find the direction with the highest slope. If you remember some calculus, the derivative of a function at a point is that slope. So, to calculate the gradient, we calculate the loss function's derivative with respect to each parameter 𝚆ᵢ. In other words, we'll have a number for each parameter that tells us how to change that parameter to make the loss smaller.

Fifth, update the parameters using the gradient. There are many ways of doing this. The simplest is to multiply the gradient with a small number ϵ and subtract it from the parameters:

Wᵢ <- Wᵢ − ϵ𝙶ᵢ.

That's one training step done! Your neural network just got a tiny bit better. Now repeat from step 2 until you've had enough. You can stop when the loss becomes small enough, when it stops changing for some number of iterations, or when your AWS bill exceeds the budget.

Tensorken can already do almost all of those steps. Running a network, calculating a loss, and updating the parameters amounts to applying tensor operations. What's missing is calculating the gradient via the loss function's derivative. In the olden days, people would calculate the derivative of the network by hand, symbolically, and then implement it manually. Clearly tedious and error-prone, not to mention limiting the complexity and size of the networks. Modern neural network libraries calculate a function's output and its derivative without programmer intervention using the miracle of automatic differentiation.

AD is a vast and intricate topic. For a (much) longer primer on the basics, see my earlier post. If you are unfamiliar with AD I encourage you to read it or any of the AD primers in the links.

The following section demonstrates Tensorken's AD capabilities and interface via small examples. Then we'll dive into implementation details, but I'll stay away from the detailed mechanics of AD since that's already covered elsewhere. Instead, I'll focus on how Tensorken implements higher-order, mixed-mode, JAX-style AD as an elegant and minimal Rust library.

The Autodiff Cookbook in Tensorken

To demonstrate Tensorken's AD capabilities, I translated a significant part of JAX's Autodiff Cookbook to Tensorken. I reproduced and edited part of the original text here. JAX's license is Apache 2.0, so I hope this does not incur the wrath of Google. The titles in this section are similar to the ones in JAX's cookbook, in case you want to compare. The full example code is in jax_autodiff_cookbook.rs.

Before we begin - Tensorken runs on the CPU if you create tensors via the Cpu32 type alias and on the GPU via Wgpu32. (The 32 is because they work with 32-bit floating point numbers.) To make it easy to switch, I'll use the Tr type alias throughout:

type Tr = Cpu32; // or Wgpu32

Gradients

You can differentiate a function using grad1. The 1 indicates the number of arguments of the function - a poor man's variadic arguments. In the text, I'll sometimes refer to the family of grad1, grad2 functions as grad. In the code, I'll use the function with the correct number of arguments.

To start with, we'll use a simple scalar function - a function that takes a single number and returns a single number:

let p = Tr::scalar(2.0);
let df = grad1(|x| x.tanh(), &p);

> df: [0.07065082]

In Tensorken, all arguments must be a tensor Tr - it doesn't support mixed tensors and scalar numbers. To turn a number into a tensor we first use Tr::scalar. It makes a tensor with shape [1].

grad1 takes a function of one argument 𝚏 and evaluates ∇𝚏(𝚙), the derivative of 𝚏 at a given point 𝚙. You can think of ∇ as a higher-order function that takes a differentiable function 𝚏 and produces a function that evaluates the derivative.

Pronouncing ∇: I say "grad", I've heard people say "del", and the symbol's Unicode name is "nabla".

Similarly, if you have a Rust function f that evaluates the mathematical function 𝚏, then grad(f, p) computes the value ∇𝚏(𝚙).

Unlike JAX, Tensorken does not directly expose ∇𝚏 as a first-class function, mostly because I had a hard time accomplishing that in Rust and staying sane! It required returning a closure from grad(f) so you can write grad(f)(p), but satisfying the compiler proved difficult. So far this hasn't been a constraint in practice.

Like JAX, Tensorken does support applying grad to functions that themselves call grad to calculate higher-order derivatives:

let ddf = grad1(|x| grad1(|x| x.tanh(), x), &p);
let dddf = grad1(|x| grad1(|x| grad1(|x| x.tanh(), x), x), &p);

> ddf: [-0.13621868]
> dddf: [0.25265408]

Let’s try computing gradients with grad in a linear logistic regression model. In other words, a simple neural network with one neuron. First, the setup:

// Outputs probability of a label being true.
fn predict<'t, T>(w: &'t T, b: &'t T, inputs: &T) -> T
where
    T: TensorLike<'t>,
{
    (inputs.dot(w) + b).sigmoid()
}

The function predict encodes the architecture of our toy model. Its parameters are a vector w and a scalar b, for weights and bias. As you can see, we're multiplying the weights with the inputs and adding the bias. Then we use sigmoid to squish the output values in the [0, 1] interval. This model predicts the probability of an outcome based on some input measurements.

Why is this equivalent to a neural network with a single neuron? Say the vector w has three elements - three weights. We thus have three inputs as well, in inputs. The function dot multiplies each input with its corresponding weight, and then adds them up. The bias b in the neuron analogy is typically a negative number, which represents a threshold that inputs.dot(w) must exceed to "activate" the neuron.

All arguments are tensors, but are represented by a generic argument T. The type needs to be generic so automatic differentiation can work. We'll see later why. T: TensorLike is a handy constraint to make tensor operations like dot, +, and sigmoid available on T. You'll see the TensorLike constraint often when using Tensorken's AD: to make functions differentiable, replace concrete Tensor types with T: TensorLike.

Let's run the model.

// Build a toy dataset.
// These are four measurements of some unspecified variable, one in each row.
let inputs = Tr::new(
    &[4, 3],
    &[
        0.52, 1.12, 0.77, //
        0.88, -1.08, 0.15, //
        0.52, 0.06, -1.30, //
        0.74, -2.49, 1.39,
    ],
);
// These are four observed outcomes, one for each row in the input.
let targets = Tr::new(&[4], &[1.0, 1.0, 0.0, 1.0]);

// Initialize the parameters w and b randomly
let key = 0;
let mut rng = StdRng::seed_from_u64(key);
let w = Tr::randn(&[3], &mut rng);
let b = Tr::randn(&[1], &mut rng);

let prediction = predict(&w, &b, &inputs);

> prediction: [0.4059896 0.37711427 0.9770815 0.007901279]

The inputs could be "changes in temperature observed on three consecutive days" and the targets could be "temperature went up or down on the next day". We're then training a model that predicts the probability of the temperature going up given three days' changes in temperature.

Since we initialized the model randomly, its prediction is random. We got unlucky: if you compare targets (what we want) with prediction (what we have) there is a big difference. The 3rd and 4th predictions are especially bad, almost the exact opposite of the training data.

To improve our model, we first need to quantify how crap the model is via a loss function.

// Training loss is the negative log-likelihood of the training examples.
fn loss<'t, T>(w: &'t T, b: &'t T, inputs: &T, targets: &'t T) -> T
where
    T: TensorLike<'t>,
    for<'s> &'s T: TensorLikeRef<T>,
{
    let prediction = predict(w, b, inputs);
    // ones_like makes a tensor of the same shape with all values equal to 1.
    let label_probs = &prediction * targets
        + (&prediction.ones_like() - &prediction) * (targets.ones_like() - targets);
    -label_probs.log().sum(&[0])
}

let l = loss(&w, &b, &inputs, &targets);

> loss: [10.4931755]

This loss function is negative log-likelihood. You can intuit why it works: prediction is "compared" with targets in label_probs. It contains a high value for predictions that are close to the target. We then take the log of each, which exaggerates its value: the logarithm is -infinity when label_probs is zero. Since the logarithm is negative, we negate it to get a positive number. Then we take the sum of the vector so we have a single positive loss number that is high when the model is doing badly, and low when it's making good predictions.

Now we can improve the model by adjusting its weights and biases. We use grad to differentiate the loss function with respect to the parameters w and b:

// Differentiate loss wrt weights
let w_grad = grad1(
    |w| {
        loss(
            w,
            &Reverse::lift(&b),
            &Reverse::lift(&inputs),
            &Reverse::lift(&targets),
        )
    },
    &w,
);
print!("w_grad: {w_grad}");

// Differentiate loss wrt bias
let b_grad = grad1(
    |b| {
        loss(
            &Reverse::lift(&w),
            b,
            &Reverse::lift(&inputs),
            &Reverse::lift(&targets),
        )
    },
    &b,
);

> w_grad: [-1.0830948 2.5363755 -3.2000453]
> b_grad: [-1.2319121]

To make the types work out, we need to Reverse::lift all the arguments to loss we do NOT want to differentiate. They are treated as constants. The type is called Reverse because Tensorken uses reverse mode AD in this case. The Reverse type reveals why we need to make the arguments to loss and predict generic: the grad function, while taking a plain Tr type as the second argument, passes Reverse<Tr> to the closure. So the function f can be called with Tr, Reverse<Tr>, or other types we'll see later.

Here's the simplified signature for grad1.We'll get to the full signature later:

pub fn grad1<F>(f: F, at: &Tr) -> Tr where F: Fn(&Reverse<Tr>) -> Reverse<Tr>

Briefly, Reverse is a wrapper to interpret tensor operations so they calculate the derivative along with the main result. In this example, it'll run a different dot, +, and sigmoid compared to calling loss with plain tensors of type Tr.

Calling the loss function twice is not ideal - we're doing twice the work. We can also calculate the gradients with respect to both w and b at the same time, using grad2.

let (w_grad, b_grad) = grad2(
    |w, b| loss(w, b, &Reverse::lift(&inputs), &Reverse::lift(&targets)),
    &w,
    &b,
);

> w_grad: [-1.0830948 2.5363755 -3.2000453]
> b_grad: [-1.2319121]

Finally, let's do a single training iteration and check if that improves our model.

// Update parameters
let new_w = &w - &w_grad;
let new_b = &b - &b_grad;

// Predict
let new_prediction = predict(&new_w, &new_b, &inputs);
let new_loss = loss(&new_w, &new_b, &inputs, &targets);

> new_prediction: [0.7384342 0.99262685 0.7747804 0.9996524]
> new_loss: [1.8016509]

A massive improvement - we're now only 1.8 crap, down from 10.5!

Evaluate a function and its gradient using `value_and_grad`

In a real training run, we'd do the above in a loop while keeping an eye on the loss to see when to stop. Again loss is called twice: once inside grad and once outside. Luckily, we don't have to. Another convenient family of functions is value_and_grad to efficiently compute a function and its gradient.

let (loss_value, (w_grad, b_grad)) = value_and_grad2(
    |w, b| loss(w, b, &Reverse::lift(&inputs), &Reverse::lift(&targets)),
    &w,
    &b,
);

> loss: [10.4931755]
> w_grad: [-1.0830948 2.5363755 -3.2000453]
> b_grad: [-1.2319121]

Checking against numerical differences

Our loss improved, which is a good indication that things work. To gain confidence we can compare Tensorken's derivatives with finite differences.

// step size for finite difference
let eps = Tr::scalar(1e-4);
let half_eps = &eps / Tr::scalar(2.);
let b_grad_numerical = (loss(&w, &(&b + &half_eps), &inputs, &targets)
    - loss(&w, &(&b - &half_eps), &inputs, &targets))
    / &eps;

> b_grad_numerical [-1.2207031]
> b_grad_autodiff [-1.2319121]

Close enough.

Jacobians using `jacfwd` and `jacrev`

Ignoring bias b for now, the loss function is a function of three parameters, represented as a single tensor w with three elements. It has a single scalar output, represented as a tensor with a single element. Taking the gradient of this function results in a vector of three elements, the sensitivity of the loss to each parameter. This picture becomes more complicated if there is more than one output parameter. grad still gives an answer, but what does it mean?

let deriv = grad1(
    |w| predict(w, &Reverse::lift(&b), &Reverse::lift(&inputs)),
    &w,
);

> deriv: [0.34956074 -0.0017646346 0.20271438]

Remember that predict returns a vector with four elements, and the input w is a vector with three elements. We get a vector with three sensitivities - one for each input. But the sensitivity of which output? There are four. As we'll check below, grad returns the sum of the sensitivity of all outputs. That's typically not what we want: we'd like to disaggregate the sensitivities.

The usual approach is to represent the sensitivity of each output with respect to each input as a matrix, called the Jacobian. In this case, a 4 by 3 matrix - number of outputs by number of inputs. Tensorken can compute Jacobians, in forward and reverse mode using jacfwd and jacrev:

let J = jacfwd(
    |w| predict(w, &Forward::lift(&b), &Forward::lift(&inputs)),
    &w,
);

> jacfwd result, with shape [4, 3]
┌ ┐
│ 0.12540425 0.2701015 0.18569478 │
│ 0.20671119 -0.25369102 0.03523486 │
│ 0.01164451 0.0013435973 -0.029111274 │
│ 0.0058007482 -0.019518733 0.010895999 │
└ ┘

let J = jacrev(
    |w| predict(w, &Reverse::lift(&b), &Reverse::lift(&inputs)),
    &w,
);

> jacrev result, with shape [4, 3]
┌ ┐
│ 0.12540427 0.27010152 0.18569478 │
│ 0.20671119 -0.25369102 0.03523486 │
│ 0.01164451 0.0013435973 -0.029111274 │
│ 0.005800748 -0.019518731 0.010895998 │
└ ┘

These two functions compute the same values (up to machine precision), but differ in their implementation: jacfwd uses forward-mode automatic differentiation, which is more efficient for "tall" Jacobian matrices, while jacrev uses reverse-mode, which is more efficient for "wide" Jacobian matrices. For matrices that are near-square, jacfwd probably has an edge over jacrev.

We can now check that grad computed the sum of the sensitivity of all four outputs:

&J.sum(&[0])

> [0.34956074 -0.0017646346 0.20271438]

Using a composition of javfwd and jacrev gives us a way to compute dense Hessian matrices. Hessian matrices contain all the second derivatives.

let hessian = jacfwd(
    |w| {
        jacrev(
            |w| {
                predict(
                    w,
                    &Reverse::lift(&Forward::lift(&b)),
                    &Reverse::lift(&Forward::lift(&inputs)),
                )
            },
            w,
        )
    },
    &w,
);
println!("hessian with shape {:?}", hessian.shape());

> hessian shape [4, 3, 3]

Why this shape? We start with a function f:𝙽→𝙼. Traditionally, we'd write 𝚏:ℝⁿ→ℝᵐ, but that there are 𝙽 inputs and 𝙼 outputs is more important than that we're talking about real numbers, so I'll omit the ℝ from now on.

At a point 𝚡 ∈ 𝙽 we expect to get the shapes

𝚏(𝚡) ∈ 𝙼, the value of 𝚏 at 𝚡,
∂𝚏(𝚡) ∈ 𝙼 × 𝙽, the Jacobian matrix at 𝚡,
∂²𝚏(𝚡) ∈ 𝙼 × 𝙽 × 𝙽, the Hessian at 𝚡,

and so on.

To implement hessian we could have used jacfwd(jacrev(f)) or jacrev(jacfwd(f)) or any other composition of the two. But forward-over-reverse is typically the most efficient. That’s because, in the inner Jacobian computation, we’re often differentiating a function with a wide Jacobian (maybe like a loss function 𝚏:𝙽→1), while in the outer Jacobian computation, we’re differentiating a function with a square Jacobian (since ∇𝚏:𝙽×𝙽), which is where forward-mode wins out.

Note we now need to lift the inputs twice to make Rust's type checker happy.

That concludes the tour of Tensorken's AD capabilities. It packs a lot of punch - now let's see how to fit it in a small package.

A tale of two functions

All AD functions like jacfwd, jacrev, and grad are implemented in terms of two function-type pairs: jvp with Forward, and vjp with Reverse. JVP stands for Jacobian-vector product, and VJP stands for Vector-Jacobian product. These functions are directly inspired by JAX. To explain their names, we need some math background that deserves a standalone post. If you can't wait, refer to this section in JAX's Autodiff Cookbook.

I'll now introduce jvp and vjp, and the beginnings of how AD works in Tensorken. I assume some background knowledge about AD, in particular AD for scalar functions. See my earlier post for a primer.

From scalars to tensors

Forward AD on scalar functions works by replacing operators and functions on numbers with versions that operate on a dual number - a (f32, f32) tuple. The first element is the primal, which the function computes without AD. The second is the derivative, or tangent. Operations on dual numbers are straightforward:

apply the operation to the primal(s), and
apply differentiation rules to the tangent(s).

For example, multiplication on dual numbers is:

(p₁, t₁) . (p₂, t₂) = (p₁.p₂, p₁.t₂ + p₂.t₁)

Reverse mode is more involved. The primal computation is identical, but instead of calculating the tangent alongside the primal, we collect a trace - essentially a stack of operations. A reverse pass through the trace calculates the derivatives.

Exactly how these operations are replaced is a concern for the implementation. Common methods are code transformation in the compiler, code generation, and operator overloading. Tensorken uses trait-based overloading.

Forward mode has little extra memory requirements beyond bringing the tangent along for the ride, while reverse mode needs to keep a trace that's as long as the computation is deep. As a result, for scalar-to-scalar functions, forward mode is more efficient.

That situation changes if we consider functions from many scalars to one, or vice versa. One extreme is a function that takes a single input and computes n outputs. That's great for forward mode: in one execution of the function on dual numbers, we'll have both the primal result and the derivative - or in other words the sensitivity of each output to a small change in the single input.

However, a function that takes many inputs and has a single output is efficient only in reverse mode. In forward mode, we'd need as many executions of the function as there are inputs - we'd have to pass 1 as tangent for each input separately. In reverse mode, we still need the extra memory for the trace, but one forward pass for the primal and one backward pass for the partial derivatives is all we need.

The good news is that if you understand this, nothing much changes if we allow tensors instead of scalars. After all, a tensor is a container of scalars, and operations on tensors can be broken down into operations on scalars. That's not how we want to implement them though! Bulk operations are where the performance is at.

For forward mode, we'll overload tensor operations to propagate a "dual tensor", a tuple of a primal and a tangent tensor. For reverse mode, we'll build up a trace of tensor operations in the forward pass and get the tangent tensors from a backward pass.

One difference with scalar AD is that we need to take the shape of tensors into account. Besides arithmetic operations like addition and multiplication, we also need to figure out differentiation rules for sum, reshape, and others, which affect the shape of both primal and derivative tensors.

JVP for forward-mode AD

Here's the signature of jvp:

pub fn jvp1<T: Diffable + Clone, F>(f: F, at: &T, tangent: &T) -> (T, T)
where
    for<'a> F: Fn(&'a Forward<T>) -> Forward<T>,

As the 1 suffix indicates, this version takes a single primal tensor at and a single tangent tensor tangent. It evaluates the primal and tangent of the function f and returns them as a tuple.

To understand why this signature makes sense, think of AD as a program transformation. Without AD we'd write programs that boil down to:

let p1 = f1(x);
let p2 = f2(p1);
...

With forward AD and jvp we can rewrite them as:

let (p1,t1) = jvp(f1, x, x.ones_like());
let (p2,t2) = jvp(f2, p1, t1);
...

That illustrates how programs that compose functions can be transformed into programs that compose jvp-wrapped functions.

Importantly, at and tangent must have the same shape, and the two tensors in the output tuple have the same shape. f computes out.0 from at, and jvp additionally computes out.1 from at and tangent.

jvp works for any tensor-like type that implements Diffable. Diffable is the foundational trait that defines Tensorken's primitive, differentiable tensor operations. Higher-level operations like matmul are built on these. Keeping the tensor generic in jvp allows it to work with any Diffable implementation - something we'll use when doing higher-order AD. We'll come back to the details soon.

VJP for backward AD

Here is the signature for vjp:

pub fn vjp1<'b, 't, T: Diffable + Clone + 't, F>(f: F, at: &T) -> (T, PullBack<'t, T>)
where
    for<'a> F: Fn(&'a Reverse<'a, 't, T>) -> Reverse<'a, 't, T>,

It looks a bit different because reverse mode has a backward pass. What's the same are the differentiable function f: F and the primal input at. vjp calls f with Reverse wrapping the input T. Since reverse mode needs two passes, vjp only returns the primal directly.

PullBack (a term from differential geometry) is a named struct that executes the backward pass. It takes a cotangent, a tensor in the shape of the output of f, and calculates the tangent, a tensor in the shape of the input of f.

impl<T: Diffable + Clone> PullBack<'_, T> {
    pub fn call(&self, cotangent: &T) -> T
}

It's all backward! But that's why reverse mode AD is more efficient if you have the right tensor shape.

A short note on why jvp and vjp have different signatures. On the one hand, we could re-write jvp to return a PushForward struct with a call function that works similarly to jvp's PullBack. However, that would require keeping a trace of the operations around so users can call multiple times with different tangents. That jeopardizes the memory efficiency of forward mode. The ability to re-execute the differentiating pass with different tangents does not offset the added memory usage.

We could also write vjp with a signature like jvp by making the PullBack internal and calling at the end. In reverse mode, we have to expend the memory anyway, so we might as well make it available to the user for potential reuse.

Interpreters all the way down

We're now at the point where we can dive into the code, and it's interpreters all the way down.

Before AD, Tensorken's core was the RawTensor trait, with implementations for the CPU and the GPU. It's useful to think of this trait as the definition of a language for primitive tensor operations, and implementations of the trait as interpreters of that language. Interpreters don't necessarily have to produce a tensor - for debugging and testing a pretty-printing interpreter for RawTensor is useful:

impl RawTensor for String {
    type Elem = f32;

    fn exp(&self) -> Self {
        format!("{self}.exp()")
    }

    fn add(&self, other: &Self) -> Self {
        format!("({self} + {other})")
    }

    // etc
}

We can use it as follows:

let t1: String = RawTensor::new(&[2, 2], &[1., 2., 3., 4.]);
let t2: String = RawTensor::new(&[2, 2], &[5., 6., 7., 8.]);
let r = t1.exp().add(&t2.log());

> r: "(new([2, 2], [1.0, 2.0, 3.0, 4.0]).exp() + new([2, 2], [5.0, 6.0, 7.0, 8.0]).log())"

Or even:

let t1: String = "A".to_string();
let t2: String = "B".to_string();
let r = t1.exp().add(&t2.log());

> r: "(A.exp() + B.log())"

We could generate source code or an abstract syntax tree this way, turning the interpreter into a compiler of sorts. That is the essence of the final tagless approach I described in depth in an earlier post. It has many extensibility advantages, which we'll take advantage of soon.

What does this have to do with automatic differentiation? AD is achieved by hard-coding how to differentiate primitive operations like addition and multiplication, and composing those primitive rules via the chain rule. The primitive operations define a language of tensor operations which we can interpret in a few ways - in particular, as straightforward tensor operations without differentiation via Tensor, as a forward mode differentiated program via Forward, or as a reverse mode differentiated program via Reverse. As for RawTensor we represent the primitive operations of the differentiable language as a trait, Diffable, and then implement this trait for each interpreter.

Let's start with the trait definition:

pub trait Diffable {
    type Elem: Num;

    fn log(&self) -> Self;
    fn exp(&self) -> Self;

    fn elementwise_add(&self, other: &Self) -> Self;
    fn elementwise_sub(&self, other: &Self) -> Self;
    fn elementwise_mul(&self, other: &Self) -> Self;
    fn elementwise_div(&self, other: &Self) -> Self;
    fn elementwise_pow(&self, exp: &Self) -> Self;
    fn elementwise_eq(&self, other: &Self) -> Self;

    fn sum(&self, axes: &[usize]) -> Self;
    fn max(&self, axes: &[usize]) -> Self;

    fn reshape(&self, shape: &[usize]) -> Self;
    fn permute(&self, dims: &[usize]) -> Self;
    fn expand(&self, shape: &[usize]) -> Self;
    fn pad(&self, padding: &[(usize, usize)]) -> Self;
    fn crop(&self, limits: &[(usize, usize)]) -> Self;

    fn new(shape: &[usize], data: &[Self::Elem]) -> Self;
    fn shape(&self) -> &[usize];
}

Diffable's operations are similar to RawTensor's, and we can categorize them in much the same way - unary operations, binary operations, reduce-like operations, and shape-changing operations. Missing is the optimized fused multiply-add in RawTensor, which illustrates the difference in intent between RawTensor and Diffable. While we could make RawTensor differentiable, I'll now try to convince you we don't want to.

Fused multiply-add is an optimized operation that we need on the lowest level to have some hope of efficiency. It is likely that to make Tensorken more efficient, we'll need to add more special-purpose operations to better exploit hardware primitives, reduce memory usage, and so on.

We don't (necessarily) want to figure out how to differentiate those special-purpose operations - we'd like a small set of primitive operations, define their derivatives, and then compose those into higher-level operations. We then get derivatives of those higher-level operations for free, because differentiation is so beautifully composable. Separating Diffable from RawTensor allows us to add efficient, special-purpose operations to RawTensor without figuring out their derivatives. Vice versa, we can add operations to Diffable without having to change RawTensor and its implementations.

Before Diffable, we translated user-facing operations like matrix multiplication to RawTensor operations, which were interpreted by a concrete RawTensor like CpuRawTensor. Now we add another interpreter, Diffable, between the user-facing operations and RawTensor, which not only calculates the primal results but also derivatives. Diffable interpreters execute both primal and derivative calculations as RawTensor operations. That means we can combine all implementations of Diffable with all implementations of RawTensor. So we can do forward AD on the GPU, reverse AD on the CPU, or any other combination.

Let's make our way down the interpreter layers to see how this works in practice. We'll start with matrix multiplication and end up at CpuRawTensor.

Each of the sections that follow is one layer of the interpreter lasagne:

High-level tensor operations like matmul are translated to Diffable operations like sum and elementwise_mul.
A Diffable interpreter like Forward and Reverse translates primitive operations like sum and elementwise_mul to RawTensor operations, adding calculation of derivatives.
A RawTensor interpreter like CpuRawTensor executes the operations on a particular device.

A nicely layered design. Is it lunchtime yet? Photo by Parnis Azimi on Unsplash

User-facing layer: matrix multiplication in terms of `Diffable`

Here is a sketch of matmul, omitting everything that is not an operation on Diffable:

pub trait DiffableExt: Diffable
{
    fn matmul(&self, other: &Self) -> Self {
        // preconditions, shape manipulation omitted
        // special cases omitted

        let l = self.reshape(&l_shape);
        // shape manipulation omitted
        let r = other
            .reshape(&r_shape)
            .transpose(r_shape.ndims() - 1, r_shape.ndims() - 2);

        // after multiply: [..., m, o, n]
        l.mul(&r)

        // after sum: [..., m, o, 1]
        let sum = prod.sum(&[prod.shape().ndims() - 1]);

        // after reshape: [..., m, o]
        let s = sum.shape();
        sum.reshape(&s[..s.ndims() - 1])
    }
}

Tensorken has three implementations of Diffable: Tensor, Forward, and Reverse. Tensor doesn't do any differentiation at all - it translates Diffable to RawTensor operations. Forward and Reverse augment the operations with their respective mode of AD. We'll come back to these later - first, we need to find a Rust vehicle to put the user-facing operations that are not in Diffable. We could re-implement them on each implementation of Diffable, but that is redundant. Instead, I've defined DiffableExt, a sub-trait of Diffable with a blanket implementation:

pub trait DiffableExt: Diffable
{
    // all the fns we want, like matmul, go here.
    // They'll need to be defined in terms of Diffable,
    // because that's all that's available.

    fn matmul(&self, other: &Self) -> Self { ... }
}

impl<T: Diffable> DiffableExt for T {}

The advantage is we only have to implement Diffable on a concrete type, then anything defined on DiffableExt is available too (as long as DiffableExt is in scope.)

The first `Diffable` implementation: Tensor

We now need a concrete type to present to users. Tensor is that type. Its definition is mysteriously simple:

pub struct Tensor<T>(T);

The idea is that the generic type argument T is a Diffable. Why not add the type constraint here? Because it's unnecessary - for all interesting implementations, T is Diffable. Constraining T here adds nothing new.

We can now make Tensor<T> implement Diffable for any T that's Diffable:

impl<T: Diffable> Diffable for Tensor<T> {
    type Elem = T::Elem;

    fn log(&self) -> Self {
        Tensor(self.0.log())
    }

    // etc
}

All operations delegate to T. Full implementation here.

From `Diffable` to `RawTensor`

That gets us nowhere - we can have a differentiable Tensor<T> if we have a differentiable T. To execute tensor operations we need to get to a RawTensor. We can do that by interpreting Diffable operations as RawTensor operations. In Rust, this means creating a blanket implementation of Diffable for any RawTensor:

impl<T: Num, TTensor: RawTensor<Elem = T>> Diffable for TTensor {
    type Elem = T;

    fn log(&self) -> Self {
        self.log()
    }

    // etc
}

Since Diffable is a subset of RawTensor, the implementation is again straightforward. A type like Tensor<CpuRawTensor> now works, and we can apply all operations in Diffable and DiffableExt to it.

It seems like we went around in a big circle. After Tensorken parts 1 and 2, we had a Tensor<T: RawTensor> with high-level operations like matmul and primitive operations on RawTensor. Now we have Tensor<T: Diffable> with high-level operations like matmul moved to DiffableExt, differentiable primitive operations on Diffable, and primitive executable operations still on RawTensor.

What we gained is the ability to have other Diffable implementations. We're going to use that ability now.

Forward-mode AD with `Forward`

The Forward type wraps T with extra stuff so we can transform and trace the computation to calculate the derivative. In interpreter terms, Forward is an interpreter for the Diffable language that calculates the derivative alongside the primal result using forward-mode AD. It does that by applying all tensor operations on a dual tensor.

The Forward type:

pub enum Forward<T> {
    Lift(T),
    Forward(T, T),
}

Like for Tensor<T>, the T here is a Diffable tensor. The Forward case should make sense - it's the primal and the tangent tensors. We use the Lift case if we're not interested in computing the derivative of a tensor. Lifted tensors are treated as constants for the derivative computation. Another way of saying this is that their derivative is zero. We avoid many multiplications with zero by having a dedicated case instead of using the functionally equivalent Forward(t, zero).

We can understand jvp1's implementation now:

pub fn jvp1<T: Diffable + Clone, F>(f: F, at: &T, tangent: &T) -> (T, T)
where
    for<'a> F: Fn(&'a Forward<T>) -> Forward<T>,
{
    let forward = Forward::Forward(at.clone(), tangent.clone());
    let result = f(&forward);

    match result {
        Forward::Lift(p) => (p.clone(), p.zeros_like()),
        Forward::Forward(p, t) => (p, t),
    }
}

We wrap the at and tangent arguments in Forward, then call f with them and unwrap the Forward from the result.

Forward must implement Diffable for this to work. Finally, we come to the implementation of differentiation rules for the primitive operations:

impl<T: Clone + Diffable> Diffable for Forward<T> {
    type Elem = T::Elem;

    fn elementwise_mul(&self, rhs: &Self) -> Self {
        self.binary::<MulOp<T>>(rhs)
    }

    fn sum(&self, axes: &[usize]) -> Self {
        self.unary::<SumOp, _>(axes)
    }

    // etc
}

Full implementation here.

Calculating the primal and derivatives are encapsulated in Op structs. The unary and binary functions deal with handling Lift or Forward enum cases in one place, and delegate to a given Op struct for the calculation:

impl<T: Diffable> Forward<T> {
    fn unary<Op: UnaryOp<T, Args = TArgs> + UnaryDiffOp<T>, TArgs: ?Sized>(
        &self,
        args: &TArgs,
    ) -> Self {
        let (primal, op) = Op::f(self.primal(), args);
        match self {
            Forward::Lift(_) => Forward::Lift(primal),
            Forward::Forward(_, tan) => Self::Forward(primal, op.dfda(tan)),
        }
    }
}

binary is similar but more involved because it has 4 combinations of Lift and Forward.

Here's MulOp:

pub(crate) struct MulOp<TTensor>(TTensor, TTensor);

impl<TTensor: Clone + Diffable> BinaryOp<TTensor> for MulOp<TTensor> {
    fn f(a: &TTensor, b: &TTensor) -> (TTensor, Self) {
        (a.elementwise_mul(b), MulOp(a.clone(), b.clone()))
    }
}

impl<TTensor: Diffable> BinaryDiffOp<TTensor> for MulOp<TTensor> {
    fn dfda(&self, d: &TTensor) -> TTensor {
        d.elementwise_mul(&self.1) // da * b
    }

    fn dfdb(&self, d: &TTensor) -> TTensor {
        d.elementwise_mul(&self.0) // db * a
    }
}

Differentiation rules often capture intermediate results or arguments of the primal computation. So f returns not only the result of the primal computation but also a struct to store whatever data is needed for the derivative computation. For MulOp, it captures the input tensors a and b.

dfda and dfdb define how to compute the derivative with respect to the first and second argument, given d, the derivative of downstream functions. The differentiation rule for elementwise tensor multiplication is essentially the same as for scalar multiplication.

Unary operations are similar but don't define dfdb:

pub(crate) struct SumOp(Vec<usize>);

impl<TTensor: Diffable> UnaryOp<TTensor> for SumOp {
    type Args = [usize];
    fn f(a: &TTensor, axes: &Self::Args) -> (TTensor, Self) {
        let r = a.sum(axes);
        (r, SumOp(axes.to_vec()))
    }
}

impl<TTensor: Diffable> UnaryDiffOp<TTensor> for SumOp {
    fn dfda(&self, d: &TTensor) -> TTensor {
        d.sum(&self.0)
    }
}

SumOp only needs the reduced axes from the primal computation to calculate dfda The derivative of the sum is the sum of the derivatives, so we can apply the same sum to primal and tangent.

You can find all the ops here and here.

`Forward<Forward<T>>` for higher order derivatives

Reiterating this signature:

pub fn jvp1<T: Diffable + Clone, F>(f: F, at: &T, tangent: &T) -> (T, T)
where
    for<'a> F: Fn(&'a Forward<T>) -> Forward<T>

Since the only requirement on T is that it's Diffable and Forward is Diffable, besides a Tensor<T> we can pass a Forward<T> to jvp1 to calculate higher-order derivatives.

let p: Tensor<CpuRawTensor<f32>> = Tr::scalar(2.0);
let ddf = diff1(|x: &Forward<Tensor<_>>| 
            diff1(|x: &Forward<Forward<Tensor<_>>>| x.tanh(), x), 
            &p
        );

As we'll see soon, this same design will allow us to combine forward and reverse modes up to arbitrary depth, by building up types like Forward<Reverse<Tensor<..>>>.

This scheme works because the Diffable operations are implemented in terms of Diffable. That looks circular, but it's not: it's a stack of interpreters with a Tensor at the bottom, which translates Diffable operations to RawTensor operations:

Forward<Forward<...>>: Diffable ->... -> Tensor: Diffable -> RawTensor

Somewhat surprisingly, differentiating a differentiated program gets us the second derivative. One way to make sense of that is that if you compute second or third derivatives symbolically, that's exactly what you do: you apply the differentiation rules multiple times. If you want to do a "fun" exercise, you can work out that stacking Forward types amounts to operating on duals-of-duals up to the desired order. If you work out the differentiation rules by hand, you'll find that it yields the correct higher-order derivative.

Reverse-mode AD with `Reverse`

The implementation of Diffable for Reverse follows the same pattern as Forward but is more involved. Because reverse mode accumulates the derivative in a separate backward pass, we can no longer compute everything on the fly when we compute the primal. Instead, Reverse builds a trace of operations in a forward pass while calculating the primal result, then accumulates derivatives in the backward pass.

The difference with forward mode is visible in the signature of vjp:

pub fn vjp1<'b, 't, T: Diffable + Clone + 't, F>(f: F, at: &T) -> (T, PullBack<'t, T>)
where
    for<'a> F: Fn(&'a Reverse<'a, 't, T>) -> Reverse<'a, 't, T>

Like jvp, it returns the primal result. Unlike jvp, it doesn't return the tangent, but instead a PullBack struct. The only available operation on that is call:

pub fn call(&self, cotangent: &T) -> Vec<T>
    where
        T: Diffable + Clone,

This takes a cotangent tensor - in other words, a tensor with the same shape as the result of f, and returns the tangents of all the arguments of f. Here's how vjp is used:

pub fn value_and_gradn<'t, T: Diffable + Clone + 't, F>(f: F, at: &[&T]) -> (T, Vec<T>)
where
    for<'a> F: Fn(&'a [Reverse<'a, 't, T>]) -> Reverse<'a, 't, T>,
{
    // one forward pass, tracing
    let (primal, pullback) = vjpn(f, at);
    // one backward pass, accumulating derivatives
    let tangents = pullback.call(&primal.ones_like());
    // but we get multiple tangents in one go
    (primal, tangents)
}

Other implementations for grad functions follow a similar pattern.

The details of how this is implemented (via a Trace type) are explained in my post on AD, so I won't repeat them here. It is not substantially different from the scalar case. Briefly, here is the Reverse type:

pub enum Reverse<'a, 't, T> {
    Lift(T),
    Reverse(&'a Trace<'t, T>, T, usize),
}

Like Forward, it has a Lift case for tensors we don't want to differentiate. The Reverse case contains the primal T, and some administrative data to record the trace and do the backward pass.

The implementation of Diffable looks similar to Forward:

impl<T: Clone + Diffable> Diffable for Reverse<'_, '_, T> {
    type Elem = T::Elem;

    fn elementwise_mul(&self, rhs: &Self) -> Self {
        self.binary::<MulOp<T>>(rhs)
    }

    fn sum(&self, axes: &[usize]) -> Self {
        self.unary::<SumOp, _>(axes)
    }

    // other omitted
}

Again we have unary and binary helper methods to deal with Lift and call the appropriate functions on the Op structs.

Interestingly, even though reverse mode calculates derivatives backward, from the output to the input, MulOp is identical for forward and reverse mode. This is true for all elementwise operations.

sum however is different from forward mode. In the backward pass, we get a d in the shape of the result of the sum (i.e. with fewer elements) and we need to produce a tensor in the shape of the input of sum. To do that, we need expand:

pub(crate) struct SumOp(Vec<usize>);

impl<TTensor: Diffable> UnaryOp<TTensor> for SumOp {
    type Args = [usize];
    fn f(a: &TTensor, axes: &Self::Args) -> (TTensor, Self) {
        let r = a.sum(axes);
        (r, SumOp(a.shape().to_vec()))
    }
}

impl<TTensor: Diffable> UnaryDiffOp<TTensor> for SumOp {
    fn dfda(&self, d: &TTensor) -> TTensor {
        d.expand(&self.0)
    }
}

Full implementation for reverse mode is in ad_reverse.rs and the reverse operations are in ad_ops_reverse.rs.

After all that, we can run all the examples in the demo section. However, there is one remaining issue.

Un-blowing up matmul, again

The problem is serious. Repeating the (pseudo-code) implementation of matmul in DiffableExt:

pub trait DiffableExt: Diffable
{
    fn matmul(&self, other: &Self) -> Self {
        // preconditions, shape manipulation omitted
        // special cases omitted

        let l = self.reshape(&l_shape);
        // shape manipulation omitted
        let r = other
            .reshape(&r_shape)
            .transpose(r_shape.ndims() - 1, r_shape.ndims() - 2);

        // TROUBLE BEGINS HERE
        // after multiply: [..., m, o, n]
        l.mul(&r)

        // after sum: [..., m, o, 1]
        let sum = prod.sum(&[prod.shape().ndims() - 1]);

        // after reshape: [..., m, o]
        let s = sum.shape();
        sum.reshape(&s[..s.ndims() - 1])
    }
}

See that mul followed by sum? In the second part of Tensors from Scratch, I explained that this blows up memory, to the point where this approach is utterly unscalable. The fused multiply-add function in RawTensor came to the rescue - we rewrote the separate sum and mul calls into one l.fused_multiply_add(&r, dims), which made it efficient. Now we've regressed to the previous bad situation. What gives?

First, Diffable doesn't have fused_multiply_add, so we can't write the optimized version directly. We could add fused_multiply_add to Diffable as a primitive operation, but then we have to define a differentiation rule for it in the various modes. One of the main reasons for Diffable's existence is to avoid that.

Second, while manually fusing mul and sum worked for this particular case, users may inadvertently write a mul followed by a sum, and fall into this trap themselves. Worse, while we're calculating derivatives by composing operations in forward or reverse mode, Tensorken itself may introduce a mul followed by a sum. Manually fusing all cases is not going to work. We need a better solution.

If we were writing a compiler, it'd be straightforward to go through the abstract syntax tree of tensor operations and transform any l.mul(r).sum(axes) into l.fused_multiply_add(r, axes). Can we do a similar optimization here?

Let's think about what's happening from the perspective of interpreters. We have defined a language for writing differentiable programs using the trait Diffable. Everything we do with tensors - matmul, crop, max, sigmoid as well as getting derivatives, is eventually a program in terms of the operations on Diffable. We have three interpreters for Diffable - one that translates the differentiable program to RawTensor operations, and two that augment the differentiable program with forward or reverse mode AD. No matter how many times we stack Diffable on top of Diffable, eventually the program gets run via a RawTensor interpreter.

We only have concrete RawTensor interpreters so far - that calculate the results on CPU or GPU, or that print a string representing the result. But we can also write an interpreter that spits out a new, optimized RawTensor interpreter, with all mul + sum fused into fused_multiply_add.

This technique - which I didn't invent at all, to be clear - is introduced more gradually and gracefully in my post on typed tagless final interpreters. Here I'll give a whirlwind tour of the implementation.

We'll use a type called Fuse<T>. T is the target optimized RawTensor. Whenever mul followed by sum is detected in the unoptimized, original RawTensor, Fuse rewrites the two operations to a fused equivalent.

enum FuseCtx {
    Sum(Vec<usize>),
    NotSum,
}

pub struct Fuse<T>(Rc<dyn Fn(&FuseCtx) -> T>);

The function from FuseCtx to the fused T: RawTensor is a factory function we'll build up while interpreting the original RawTensor as Fuse<T>. In other words, Fuse<T> interprets RawTensor as a function that given a FuseCtx produces an optimized RawTensor. It works in two passes. A first pass builds up the factory function, then a second pass to run the function and get a new RawTensor.

Since Fuse only needs to fuse multiply and sum operations, it delays the application of sum, and instead passes Sum(axes) to the continuation via FuseCtx. The continuation calls the delayed sum if it can't fuse or fused_multiply_add if it can. Here's the implementation of mul where fusing happens:

impl<TRaw: RawTensor + Clone + 'static> RawTensor for Fuse<TRaw> {
    type Elem = TRaw::Elem;

    fn mul(&self, other: &Self) -> Self {
        let f_lhs = Rc::clone(&self.0);
        let f_rhs = Rc::clone(&other.0);
        let nextctx = FuseCtx::NotSum;
        Fuse::new(move |ctx| match ctx {
            FuseCtx::Sum(axes) => f_lhs(&nextctx).fused_multiply_add(&f_rhs(&nextctx), axes),
            FuseCtx::NotSum => f_lhs(&nextctx).mul(&f_rhs(&nextctx)),
        })
    }
}

The context passed in the closure represents what the next operation is, from the perspective of the current operation. If it's sum, the Sum enum case, we fuse. If it's anything else, represented by NotSum, we know the operation has already been applied and we can't fuse. Since mul is not a sum, we pass NotSum as the next context.

Here is the implementation of sum:

fn sum(&self, axes: &[usize]) -> Self {
    let f = Rc::clone(&self.0);
    let my_axes = axes.to_vec();
    Fuse::new(move |ctx| match ctx {
        FuseCtx::Sum(sum_axes) => f(&FuseCtx::Sum(combine_axes(&my_axes, sum_axes))),
        FuseCtx::NotSum => f(&FuseCtx::Sum(my_axes.clone())),
    })
}

We do not apply sum straight away to the resulting interpreter. Instead, we pass Sum through to the next operation, so it gets a chance to fuse it. Any operations that don't fuse, need to apply the delayed sum if they get the Sum enum. We might as well fuse consecutive sum calls into one by combining axes, hence the first match arm.

Fusing happens in two passes: the first pass builds the FuseCtx -> RawTensor function. The second pass creates the optimized RawTensor by calling the function:

impl<T> Fuse<T> {
    fn run(&self) -> T {
        (self.0)(&FuseCtx::NotSum)
    }
}

Link to full implementation of fusing.

Now I can finally reveal the full Cpu32 and Wgpu32 types:

pub type Cpu32 = Tensor<ShapeTracker<Fuse<CpuRawTensor<f32>>>>;
pub type Wgpu32<'d> = Tensor<ShapeTracker<Fuse<WgpuRawTensor<'d, f32>>>>;

The remaining unknown there is ShapeTracker. ShapeTracker is a RawTensor implementation that abstractly interprets the operations by only tracking tensor shapes. It delegates all operations to the RawTensor it wraps, except shape:

pub struct ShapeTracker<T>(ShapeStrider, T);

/// This implementation passes every operation through
/// to self.1, except for shape.
impl<TRaw: RawTensor> RawTensor for ShapeTracker<TRaw> {
    type Elem = TRaw::Elem;

    fn exp(&self) -> Self {
        Self(self.0.clone(), self.1.exp())
    }

    // etc

    fn shape(&self) -> &[usize] {
        self.0.shape()
    }
}

Because Fuse does not track shapes but does need to implement RawTensor::shape, it can only return its shape by running the delayed computation. We don't want that - some derivative operations require access to the shape of tensors, and it would be bad if we had to run the tensor computation at that point.

ShapeTracker solves this for us - it can answer shape queries without executing tensor operations. The order is important here. ShapeTracker needs to wrap Fuse which needs to wrap the concrete CpuRawTensor or WgpuRawTensor.

I love it when a plan comes together

Thanks to the power of interpreters aka final tagless encoding, Tensorken gained a capable yet small and extensible AD implementation. So far, I'm really happy with how Tensorken turned out. I started seriously researching deep learning from an implementation perspective at the beginning of 2023 with only some prior exposure to automatic differentiation. I randomly ran into the typed tagless final interpreters paper while I was studying TinyGrad, and figured that TinyGrad's style would lend itself well to the tagless final style. I could not have hoped for a better outcome.

After that, I saw a post on Reddit praising JAX and immediately preferred the functional style over PyTorch's imperative AD interface. It was much more challenging to implement in Rust though! Those signatures look straightforward now, but it took a lot of struggling with closures, lifetimes and lifetimes and closures and then lifetimes some more before everything came together. All this to say - I got lucky trying an implementation style I hadn't ever used and struggled for a long time. When AD finally worked, it felt almost magical. Persistence is worth some IQ points.

Now that Tensorken has all the pieces of a full-fledged deep learning library, it's time to put it to the test. I intend to follow along with Andrej Karpathy's Zero to Hero neural networks course, translating it from PyTorch to Tensorken. At the end of that, we should have a home-grown, walking and talking nanoGPT. Without a doubt, there'll be many interesting problems in Tensorken itself to solve along the way.

Many thanks for reading!

References

JAX: The Autodiff Cookbook. I adapted some parts for this post.
JAX: Autodidax - JAX core from scratch. Great insight into how JAX works under the hood, using a small implementation of JAX.
[Video: Automatic Differentiation by Matthew Johnson]. Matthew Johnson is one of the authors of JAX. Here he talks about the principles underlying Autograd, another Python-based AD library.
TinyGrad. A small but powerful tensor library with reverse-mode AD in Python.
Swift's Differentiable Programming Manifesto. Swift has a powerful differentiable programming component, integrated with the compiler.
Swift For Tensorflow (Google Drive). A great overview of Swift's approach to AD.
DiffSharp. A tensor library with support for differentiable programming for .NET.

Massively Parallel Fun with GPUs: Accelerating Tensors in Rust

Kurt — Sat, 08 Jul 2023 21:26:05 +0000

In this second installment of Tensors from Scratch, I'll walk through an implementation of tensor operations on the GPU using wgpu. Wgpu is a Rust implementation of the WebGPU working draft standard, which aims to make GPUs accessible to browsers.

I'll build on the CPU implementation of tensors from the first post, in the aspirational tensor library I'm building called Tensorken. All code for Tensorken is on GitHub. The version this post discusses is tagged v0.2, and all links are to that version.

I'm not assuming any knowledge of GPU programming, which is the state I started from before I wrote all this. I do assume proficiency with programming on the CPU, aka typical software engineering experience. As usual, some Rust experience is helpful but not strictly necessary.

First, I'll recap some tensor-related terms and basic tensor operations introduced in the last post. This post implements those same operations on the GPU, parallelizing them. As a result, they'll execute much quicker than my earlier, admittedly naïve CPU implementations. Feel free to skip the recap if you don't need the refresher.

This post introduces a new set of GPU-related terms, explains how GPUs work on a high level and meanders to a well-known parallel programming building block, the parallel prefix sum. Brace yourself for a long read!

Previously in Tensors From Scratch: neural networks and matrix multiplication

Last time, I argued that neural networks don't have much to do with either neurons or networks. Modern neural networks, for example large language models (LLMs) like OpenAI's ChatGPT and GPT-4, Microsoft's Bing, Google's Bard, and Anthropic's Claude, are powered by tensors. Tensors are multi-dimensional arrays augmented with powerful operations that execute remarkably efficiently on modern hardware, most notably GPUs.

To understand all that, I intend to build a neural network library like PyTorch or JAX, in Rust, from the ground up. While these libraries are gazillions of lines of code each, they consist of the following parts:

A tensor library, to provide efficient operations to slice, dice, and apply bulk operations to tensors.
Accelerators, to execute tensor operations on the GPU, accelerating them greatly.
Automatic differentiation, to efficiently train neural networks via gradient descent.
Neural network building blocks, to easily reuse commonly used activation functions and layers.

In the first post, I focussed on the tensor library. I described almost twenty fundamental tensor operations and abstracted them in a Rust trait called RawTensor. This trait had a single implementation, CpuRawTensor.

The fundamental tensor operations can be categorized as follows:

Unary operations take a single tensor as input and produce a new tensor of the same shape, by applying a function to each element: log and exp.
Binary operations take two tensors as input and produce a new tensor of the same shape, by applying a binary function element-wise: add, sub, mul, div, pow and eq.
Reduce operations take a tensor as input, and produce a new tensor of a smaller shape, by reducing one or more dimensions to length 1. These are sum and max.
Movement operations take a tensor as input, and produce a new tensor with a changed shape, without changing its elements. These are reshape, permute, expand, pad, and crop.
Creation and elimination operations make and destroy tensors. These are new, shape and ravel.

To prove that these operations are necessary and sufficient, we implemented matrix multiplication using only RawTensor operations:

massage the left tensor with shape (m, n) to shape (m, 1, n), using reshape.
massage the right tensor with shape (n, o) to shape (1, o, n), using reshape and permute.
broadcast-multiply left and right to make a tensor of shape (m, o, n) using expand, reshape, and mul.
reduce the last dimension to make a tensor of shape (m, o, 1) using sum.
reshape to (m, o).

So far the recap, on to the main event.

GPU and me

You only have to look at NVIDIA's stock price to see that GPUs and neural networks go hand in hand. Not surprising in hindsight: the main computational load of neural networks is a bunch of linear algebra operations, and the main computational load of "showing 3D shapes on a 2D screen" is definitely in the same ballpark, if not the same ball.

Think of your GPU as the supercomputer you never knew you had. Did you ever wonder why you can play Fortnite at 60 fps, moving around fluidly in a multiplayer 3D world with a UI overlay, while online banking takes more than 30 seconds to display a few static numbers? That's because your GPU is amazingly good at its job.

To drive this point home, let's compare my CPU with my GPU in units of peak floating point operations per second (FLOPS). The comparison is not necessarily apples to apples, and there are various problems with calculating such peak performance and how realistic it is. I'm just going to ignore that.

I bought a mid-to-high-end laptop 2.5 years ago. It has an Intel Core i7-10750H, which Intel reports has 249.6 GFLOPS (that's Giga FLOPS) peak performance. It's unclear whether that's for single (f32) or double (f64) precision. I suspect it's f32, but let's be generous and say it's f64, and assume single precision is twice as fast. Then my CPU can add f32s at a cool 500 GFLOPS.

My laptop has an NVIDIA GeForce RTX 2070 (released Oct 2018) GPU with a peak performance of 7.465 TFLOPS for f32. Yes, that's "T" for Tera.

Here, let me put that in a pie chart for you:

"But but but!" I hear you say. "According to that page, your GPU only has a peak performance of 233 GFLOPS for f64, which is LESS than your CPU." That's right. GPUs are optimized for f16 and f32 operations. You'll also notice that my GPU's f16 performance is twice as fast as f32 performance, while f64 is 32 times as slow. That's because f32 is plenty for graphics and neural network applications. People are getting good results with stuffing the numbers in 8 or even 4 bits. While that needs clever engineering, floating point precision does not seem to be a limiting factor.

You want the threads? You can't handle the threads

GPUs achieve such high FLOPS by massive parallelization.

On the most fundamental level, they have built-in instructions that execute vector or matrix operations in a single cycle. For example, NVIDIA's Turing architecture GPUs include so-called tensor cores. My RTX 2070 has 288 of them. These cores are specialized to execute General Matrix Multiplication or GEMM. GEMM computes the result of 𝙰×𝙱+𝙲, a fused multiply-add operation. Each tensor core executes 64 GEMMs per clock cycle on 4×4 matrices containing f16s. NVIDIA writes: "Tensor Cores are specialized execution units designed specifically for performing the tensor/matrix operations that are the core compute function used in Deep Learning. " This is from a post written in 2018. Their recent stock price explosion isn't entirely out of the blue.

On top of that, GPUs expose more parallelism using threads. GPU threads are more limited than the threads you are probably thinking about. For example, small groups of threads may share the same stack and so must all execute the same code. Threads execute in workgroups, groups of threads that share a fast cache. Using this shared cache well is often critical for performance. On modern GPUs, 100s or 1000s of threads are available in execution units like tensor cores.

Launching workgroups for execution on the GPU is called dispatching or a dispatch. The GPU driver schedules workgroups for execution, similar to how an operating system's kernel schedules processes. A single dispatch in WebGPU can launch up to 65,535³ threads, divided into workgroups of up to 256 threads each.

The technical term for this is a fucktonofthreads.

Everything, everywhere, all at once

Like any subfield, GPU programming has a set of terms that takes some time to get used to. I've introduced thread, workgroup, and dispatch already. Unfortunately, different hardware vendors or graphics APIs use other terms for essentially the same thing or the same terms for subtly different things! Since I'll use WebGPU, I use their terminology exclusively. Raph Levien has put together a useful Compute Shader Glossary.

I will focus on concepts, and leave the details of APIs to others. Also, I will not explain and frankly do not know how to do graphics programming on the GPU.

GPUs, like CPUs, execute code, so the first order of business is writing a program that compiles to whatever GPUs execute. For historical reasons, these programs are called compute shaders. In the original graphics context, they were simply called shaders. The compute emphasizes that the shader is a general computation, that doesn't show things on the screen.

Shaders are written in a domain-specific shading language. Most shading languages have a C or C++-like syntax. WebGPU's Shading Language, or WGSL, has a Rust-like syntax. To do you a flavor, here's a shader entry point, like a main function, in WGSL. Don't worry about what this does for now.

@compute
@workgroup_size(64)
fn call(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let fro = global_id.x * strides_and_shape[2];
    let to = fro + strides_and_shape[2];
    for (var gidx = fro; gidx < to; gidx = gidx + 1u) {
        let index = input_index_of(gidx);
        output_0[gidx] = log(input_0[index]);
    }
}

For CPUs, compilers like gcc take program text and compile it to a well-defined instruction set like x86. For GPUs, the situation is more complicated. Even GPUs produced by the same vendor like NVIDIA, don't all have the same instruction set. To avoid application programmers having to learn the interface of all GPU cards out there, a graphics API mediates between the programmer and the hardware. You've almost certainly heard about graphics APIs: Direct3D 12 on Windows, Metal on Apple, Cuda for NVIDIA chips, Vulkan, and WebGPU. The GPU driver, written by the manufacturer, bridges the gap between the hardware and the graphics API, and the graphics API is what programmers use.

As you can see from the examples, this doesn't make the situation comparable to CPUs, as even on the graphics API level you still have to pick a platform or manufacturer. Wgpu addresses this to some extent: it's not a standalone graphics API, but a layer that interfaces with actual graphics APIs. You can take a shader written in WebGPU's WGSL and run it on a Vulkan backend or a DirectX backend, all from the wgpu API, without changing any code. Wgpu uses a library called naga to translate shaders in WGSL and other supported shading languages to the desired shading language.

Compared to CPUs, the graphics API is in some sense the GPU's operating system and compiler in one. It is responsible both for compiling shader code and scheduling the resulting program for execution on the GPU.

The final piece of the puzzle is memory. Discrete GPUs, which come on a separate card, have dedicated memory onboard. Since GPUs are so highly parallelized, there must be enough memory bandwidth available, and having to copy over a PCI bus from main memory just doesn't cut it. GPU memory is not accessible by the CPU, but like other devices, the GPU can access shared memory on the CPU. Shared memory is used as a staging area: you fill buffers in shared memory and instruct the GPU to copy to its memory or do the same in reverse to get data from GPU to shared memory.

That's pretty much it. You start a GPU program by setting up the buffers it needs to read and write, pointing at the shader code that needs to run, and dispatch it by giving the number of workgroups you'd like to run. The graphics API does the rest and notifies you when the GPU is done with your dispatch. From the main program's perspective, which runs on the CPU, this all happens asynchronously.

So much for the high-level overview. As we'll find out, the details matter a lot!

In the rest of this post, I'll describe an implementation of raw tensor operations on the GPU using WebGPU. I picked WebGPU because it sounded like the easiest-to-understand API while remaining low-level and cross-platform.

As the name indicates, in principle WebGPU is executable entirely in the browser via WASM. However, WebGPU is an emerging standard, not an established implementation. While WebGPU is supported by all the important browser creators (Apple, Mozilla, Google, Microsoft), as of June 2023 it's not available by default in any major browser. The implementation in Chrome, called Dawn, is the closest. You can enable it using a special flag. The implementation for Firefox is called wgpu, and it's available behind a flag in nightly builds.

However, both Dawn and wgpu are available as standalone libraries. Dawn is written in C++ and wgpu in Rust. So, wgpu is a natural target for this series.

There are already good quality posts out there that tell you how to get started with wgpu, and explain the details of the wgpu API. I'll gloss over those. Instead, I'll try to explain the underlying concepts which also transfer to other GPU programming APIs.

Wgpu 101: Accelerating unary operations

Let's dip our toes in GPU programming by implementing the unary tensor operations, as defined in RawTensor:

fn exp(&self) -> Self;
fn log(&self) -> Self;

These two operations apply the exp and log functions to each element in the tensor.

Recall that a tensor is backed by a 1-dimensional array. On the CPU, the implementation of exp and log is straightforward: allocate a new result buffer, loop over each element in the original buffer, and store the result in the result buffer. Unary operations can be optimized easily, for example via multithreading and SIMD instructions. I've done no optimization whatsoever. The idea is to compare a naïve CPU implementation with a naïve GPU implementation, and pretend that's apples-to-apples.

CpuRawTensor is a struct defined as:

pub struct CpuRawTensor<T> {
    buffer: Arc<Buffer<T>>,
    strider: ShapeStrider,
}

It contains a reference-counted buffer, and a ShapeStrider which is responsible for translating multi-dimensional tensor indices to a one-dimension buffer index. The buffer is reference-counted because it's immutable and can be shared: operations like reshape and permute only change the shape or the strides of the tensor.

The definition of WgpuRawTensor is remarkably similar:

pub struct WgpuRawTensor<'a, T> {
    buffer: Arc<wgpu::Buffer>,
    strider: ShapeStrider,
    context: &'a WgpuContext,
    phantom: std::marker::PhantomData<T>,
}

Again we have a buffer, but here it lives in GPU memory. We also have ShapeStrider for the same reasons we have one on the CPU: movement operations don't usually touch the buffer, and they are almost identical to the CPU implementation. ShapeStrider lets us share that code.

The other fields are WgpuContext to facilitate interaction with Wgpu's API and keep some state that's shared among all tensors. Finally, since Wgpu's buffer is not typed, PhantomData<T> keeps Rust's type system happy.

Now onto implementing our first shader. Let's simplify further and assume we're writing a shader for exp only.

@group(0) @binding(0)
var<storage, read> input_0: array<f32>;

@group(0) @binding(1)
var<storage, read_write> output_0: array<f32>;

@compute
@workgroup_size(64)
fn call(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let gidx = global_id.x;
    output_0[gidx] = exp(input_0[gidx]);
}

The most familiar pieces here are:

Definitions of input_0 and output_0 buffers for the shader to read from and write to.
A function call as the entry point of the shader.
The last line of the shader reads a value from the input buffer at index gidx, applies the built-in exp function, and writes the result to the output buffer at gidx.

One mystifying aspect is that there's no loop: this program only updates a single index. The secret sauce here is threads. We'll dispatch as many threads as there are elements in the output buffer, so each thread reads a single element of the input and writes a single element to the output.

How does a thread know which index it should write to the output? That's where the @builtin(global_invocation_id) attribute comes in. A shader entry point like call is limited in the arguments it can accept. To my knowledge, a compute shader can only accept invocation ids, which are set by the graphics API. The shader entry point is called N times, on N different threads, and each of these threads gets a unique invocation id, with 0 <= id < N. The shader above uses this invocation id to figure out which index a thread should process. It's important to keep different threads from writing to the same location because that creates a race condition.

In principle, that's the end of the story, but for a mix of historical and performance reasons, it is complicated further. First, the invocation id is not a single number: it's a vec3 type, a coordinate (x, y, z). I can imagine this is because of roots in 3D graphics, although some material also implies that threads that are nearby in coordinate space are located close together (e.g. neighboring GPU cores or the same core), while others imply that for modern GPUs this isn't so much the case anymore. I'm only using the x-coordinate anyway, so you can think of the invocation id as a single number.

A further complication comes with the organization of GPU threads into workgroups. The wgpu API doesn't let you specify how many threads you want. Instead, you specify how many workgroups you want:

/// Dispatches compute work operations.
///
/// `x`, `y`, and `z` denote the number of work groups to dispatch in each dimension.
pub fn dispatch_workgroups(&mut self, x: u32, y: u32, z: u32)

The @workgroup_size attribute in the shader specifies how many threads each workgroup has. You may wonder why everything has an (x, y, z) coordinate except @workgroup_size. It does, but any omitted dimensions default to 1. So I could also have written @workgroup_size(64, 1, 1).

The idea is that threads in the same workgroup share fast cache memory and can coordinate, for example via barriers that wait for all threads in a workgroup to reach a certain point. Workgroups on the other hand may or may not run concurrently, depending on how they are scheduled by the GPU driver. There are no WGSL operations that allow coordination between threads in different workgroups.

What this means in practice is that a shader with an attribute @workgroup_size(wx, wy, wz) and dispatched using dispatch_workgroups(cx, cy, cz) executes the entry point wx × wy × wz × cx × cy × cz times, and each of those threads gets a unique invocation id (x, y, z).

The remaining bit I haven't explained yet is the declaration of the storage buffers:

@group(0) @binding(0)
var<storage, read> input_0: array<f32>;

@group(0) @binding(1)
var<storage, read_write> output_0: array<f32>;

They're familiar arrays of f32. But what's the var stuff between the angle brackets? WGSL partitions memory in address spaces. Address spaces are like a type of memory with specific properties. These properties determine mutability, visibility, the type of values that may be stored, and how to use the variables. For now, we'll only use the storage address space, which is for buffers provided when dispatching the computation. These buffers are visible to all threads in the dispatch and can be read-only or read-and-write. The other address space we'll use later is workgroup, declared as var<workgroup>, which is for buffers that are shared between threads in the same workgroup.

The @group and @binding attributes identify which buffers to bind before dispatching. On the Rust side, we need to specify a bind group that matches the definition in the shader:

// get bind group 0 = all the bindings with @group(0)
let bind_group_layout = pipeline.get_bind_group_layout(0);
// index 0 and 1 within group 0, for the input and output buffer (which are wgpu::Buffer types)
self.device().create_bind_group(&wgpu::BindGroupDescriptor {
    label: None,
    layout: &bind_group_layout,
    entries: &[
        wgpu::BindGroupEntry {
            binding: 0,
            resource: self.buffer.as_entire_binding(),
        },
        wgpu::BindGroupEntry {
            binding: 1,
            resource: output_buffer.as_entire_binding(),
        },
    ],
})

The last bit of admin in the shader is the @compute annotation. I've mostly pretended there's only one type of shader, the compute shader. There are other kinds of shaders specific to graphics computations, like @vertex and @fragment. You can think of those as more specialized compute shaders, although historically they predated them. I'll continue to ignore these other shader types. You may want to look into them if you're interested in graphics programming.

Can we run something now?

Pretty much! There is a bunch of wgpu-specific admin I cover briefly here. Feel free to skip this section.

Wgpu's entry points are Device and Queue. The device represents the logical GPU and has functions for creating compute resources, like buffers and compiled shaders. The queue enqueues commands for the GPU to execute. The only commands I've used are for dispatching a shader and copying buffers from shared memory to GPU memory. Once submitted, the GPU works through them, and you can poll for completion asynchronously.

I've created a Device and Queue lazily, and once per process, then pass a singleton WgpuContext to every WgpuRawTensor instance:

pub(crate) struct WgpuContext {
    pub(crate) device: wgpu::Device,
    pub(crate) queue: wgpu::Queue,
    pipelines: RwLock<HashMap<(&'static str, WorkgroupSize), Arc<wgpu::ComputePipeline>>>,
}

This brings us to ComputePipeline. A compute pipeline contains a compiled and validated shader, with some information like the name of its entry point. You create it via wgpu::Device.create_compute_pipeline. Since parsing, compiling, and validating the shader takes time, WgpuContext contains a cache of created pipelines. A pipeline can be re-executed as many times as desired.

Once you have your shader code, executing the shader proceeds as follows:

Create any necessary buffers, and copy data to them via functions on wgpu::Device.
Create the ComputePipeline for the shader.
Bind buffers from step 1 to corresponding variables defined in the shader via bind groups. Bind groups are created via wgpu::Device.
Dispatch the shader with a given number of workgroups. This step is somewhat tedious, and needs a few intermediate objects like a "command encoder" and a "compute pass". The gist is you submit a list of commands to the Queue, and get a submission index back.
Poll the device using this submission index, to learn when execution finishes.

256 by 256 is enough for everyone

Now can we run something? Why yes, I thought you'd never ask.

To compare my naïve CPU tensor implementation with my naïve GPU implementation, I set up a few benchmarks using criterion.rs. The benchmark creates random square tensors of various sizes (from 64x64 to 1024x1024) and then compares calling exp on the CPU (CpuRawTensor) with the GPU (WgpuRawTensor):

let mut rng = StdRng::seed_from_u64(12345u64);

for size in [64, 128, 256, 512, 1024] {
    let t1s = &[size, size];
    let t1_gpu = Wgpu32::randn(t1s, &mut rng);
    let t1_cpu = t1_gpu.to_cpu(); // copies the tensor from GPU to CPU memory

    group.bench_with_input(BenchmarkId::new("cpu contiguous", size), &size, |b, _| {
        b.iter(|| t1_cpu.exp())
    });
    group.bench_with_input(BenchmarkId::new("gpu contiguous", size), &size, |b, _| {
        b.iter(|| t1_gpu.exp())
    });
}

I've removed some criterion-related boilerplate to make it more readable.

When I run this, it fails. The maximum number of threads you can dispatch in a single call along a single dimension is 65k, and we're hitting that limit around size 256x256. Tensors used in neural networks today are bigger than that.

The solution is straightforward: instead of only a single element, each thread processes a few elements using a loop. Our shader becomes:

@group(0) @binding(0)
var<storage, read> input_0: array<f32>;

@group(0) @binding(1)
var<storage, read_write> output_0: array<f32>;

@group(0) @binding(3)
var<storage, read> chunk_size: u32;

@compute
@workgroup_size(64)
fn call(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let fro = global_id.x * chunk_size;
    let to = fro + chunk_size;
    for (var gidx = fro; gidx < to; gidx = gidx + 1u) {
        if(gidx >= arrayLength(&output_0)) {
            return;
        }
        output_0[gidx] = exp(input_0[gidx]);
    }
}

I added another binding buffer to pass the chunk size to the shader. We'll expand it with more parameters soon. If the length of the output buffer is not evenly divisible by the chunk size, the last thread accesses an out-of-bounds array index. WebGPU clamps array indices, so it doesn't cause an error but leads to wrong results.

With that change, we have a working shader!

Comparing GPU vs CPU performance gives:

Making shaders generic

Besides exp, we'd like to apply log. We could copy-paste the shader and replace one word, but that doesn't work well if there are more parameters. In particular, we'd also like to modify the workgroup size: no use starting 64 threads per workgroup if a tensor has only 16 elements. (As a reminder, workgroup size is the number of threads per workgroup, not to be confused with the number of workgroups.) Because workgroup size is defined in the shader using the workgroup_size attribute, we'll need to munge some shader text if we want to make this parametrizable.

Simplicity is my main objective, so instead of string templating, I've gone for parlor tricks and string replacement. Let's update the shader again:

fn replace_me_with_actual_operation(in: f32) -> f32 { discard; }

@compute
@workgroup_size(64)
fn call(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let fro = global_id.x * strides_and_shape[2];
    let to = fro + strides_and_shape[2];
    for (var gidx = fro; gidx < to; gidx = gidx + 1u) {
        if(gidx >= arrayLength(&output_0)) {
            return;
        }
        output_0[gidx] = replace_me_with_actual_operation(input_0[gidx]);
    }
}

The explicit call to exp is now a custom function we must replace before giving the shader to wgpu. The discard operation is a built-in not applicable to compute shaders. Leaving it in would cause an error, which is the point: it's an assert of sorts to check if string replacement has worked correctly. One advantage of not using templating is that syntax highlighting still works. VSCode has a decent language service addon for WGSL. It catches many errors beyond syntactic ones, so I was keen to keep it functional.

The Rust side now has to do a few string operations before passing the shader text to wgpu:

// include the shader text from a file
const MAP_SHADER: &'static str = include_str!("shaders/map.wgsl");
const REPLACE_OP_NAME: &'static str = "replace_me_with_actual_operation";
const REPLACE_UNARY_OP_DEF: &'static str =
    r"fn replace_me_with_actual_operation(in: f32) -> f32 { discard; }";
const REPLACE_WORKGROUP_SIZE: &'static str = "@workgroup_size(64)";

pub(crate) fn pipeline_for(
    &self,
    operation: &'static str,
    workgroup_size: WorkgroupSize,
) -> Arc<wgpu::ComputePipeline> {
    // snip
    &Self::MAP_SHADER
        .replace(Self::REPLACE_UNARY_OP_DEF, "")
        .replace(Self::REPLACE_OP_NAME, operation)
        .replace(
            Self::REPLACE_WORKGROUP_SIZE,
            &format!(
                "@workgroup_size({}, {})",
                workgroup_size.0, workgroup_size.1
            ),
    ),
    // snip
}

The first replace removes the replace_me_with_actual_operation function definition, the second replaces the call with the actual operation like exp or log, and the third puts in a given workgroup size. Note the workgroup size is in two dimensions - the second dimension is only used for reduce operations, as I'll describe below.

It's not the most beautiful code, but that comes with the territory.

Give yourself a pat on the back, you’ve reached the half-way point in the post!

Photo by Reiseuhu on Unsplash

Non-contiguous unary operations

Unary operations don't care if the tensor is non-contiguous: the result is correct either way. They don’t need to take the shape or strides of the tensor into account.

Interlude: Shape and strides reminder. Skip if you're familiar, or check out the last post if nothing makes sense.

A tensor can have any number of dimensions or axes. Each axis is indexable, conventionally zero-based, and has a length.

The shape of an n-dimensional tensor is an n-tuple of lengths for each dimension.

The product of the lengths is the number of elements in the tensor - called the tensor's size.

The strides of an n-dimensional tensor is an n-tuple that represents how many elements to skip in the underlying (one-dimensional) buffer to get to the next element in that dimension. Many movement operations, like reshape and permute, only need changes to shape and strides, which saves a copy of the underlying buffer.

In general, the tensor's shape and strides represent a coordinate transformation from a many-dimensional tensor index to a one-dimensional buffer index.

In contrast, for non-unary operations, we'll have to take shape and strides into account. It's incorrect to add two tensors elementwise by adding pairs of elements in their underlying buffers. Unless they are both contiguous, we need to pair up elements according to the tensor index, not the buffer index.

For consistency, I decided to make the result of all operations contiguous, including unary operations. We'll have to modify the shader one last time. We'll no longer assume the input buffer is contiguous, and we must make the output buffer contiguous. All the rest remains the same: we'll divide the output buffer into chunks, and each thread writes only to the chunk it is responsible for.

The problem becomes: given an index in a contiguous output buffer, which is given via the invocation id, what is the corresponding index in the potentially non-contiguous input buffer? In the contiguous case, the index in the input is the same. In the non-contiguous case it depends on the shape and strides of the input tensor.

Here's the final code of the shader's entry point, call:

@compute
@workgroup_size(64)
fn call(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let fro = global_id.x * strides_and_shape[2];
    let to = fro + strides_and_shape[2];
    for (var gidx = fro; gidx < to; gidx = gidx + 1u) {
        if(gidx >= arrayLength(&output_0)) {
            return;
        }
        let index = input_index_of(gidx);
        output_0[gidx] = replace_me_with_actual_operation(input_0[index]);
    }
}

This code first transforms gidx into the input buffer index by calling input_index_of, which I have yet to show.

To implement input_index_of, let's make the problem more precise. We have a mapping 𝚋 from an n-dimensional tensor index (𝚝₀, 𝚝₁, ..., 𝚝ₙ₋₁) in the input to its buffer index. Given n-dimensional shape (l₀, l₁, ..., lₙ₋₁) and strides (𝚜₀ , 𝚜₁ , ..., 𝚜ₙ₋₁):

𝚋(𝚝₀,𝚝₁,...,𝚝ₙ₋₁) = ∑ᵢ 𝚜ᵢ⋅𝚝ᵢ

This mapping is defined by the input tensor's shape and strides. Similarly, we have the shape and strides of the output tensor: the shape is identical to the input tensor, and the strides are chosen to make the tensor contiguous.

The mapping 𝚋 is reversible. From a buffer index 𝚎, we can back out the corresponding tensor indices (𝚝₀, 𝚝₁, ..., 𝚝ₙ₋₁):

𝚋⁻¹(e) = (..., (𝚎 ÷ sᵢ) % 𝚕ᵢ, ... )

The reverse mapping is what we need because we have the output buffer index as a given: it's gidx. We can back out the output buffer's tensor index by applying 𝚋⁻¹ with the shape and strides of the output buffer. The input tensor index is identical to the output tensor index because unary operations map element by element. So to get the input buffer index we apply 𝚋 with the input tensor's shape and strides to the tensor index we obtained from the reverse mapping. In short, input_index_of calculates:

𝚋ᵢₙ (𝚋ₒᵤₜ⁻¹ (e))

In code:

// ndims, input_offset, chunk_size, input_strides, output_strides, shape
@group(0) @binding(2)
var<storage, read> strides_and_shape: array<u32>;

const preamble: u32 = 3u;

fn input_strides(i: u32) -> u32 {
    return strides_and_shape[i + preamble];
}

fn output_strides(i: u32) -> u32 {
    return strides_and_shape[i + preamble + strides_and_shape[0] ];
}

fn shape(i: u32) -> u32 {
    return strides_and_shape[i + preamble + strides_and_shape[0] * 2u];
}

fn input_index_of(output_i: u32) -> u32 {
    let ndims = strides_and_shape[0];
    let offset = strides_and_shape[1];

    var input_i: u32 = offset;
    for (var i: u32 = 0u; i < ndims; i = i + 1u) {
        let len = shape(i);
        let stride = output_strides(i);
        let coord_i: u32 = output_i / stride % len;

        input_i += coord_i * input_strides(i);
    }

    return input_i;
}

I've added a third binding with an array that contains the necessary dimensions, shapes, and strides for the coordinate transformations. It also contains the chunk size which we added earlier. WGSL has structs, but a struct can only contain a dynamically sized array as its last element, so it was little help here. Instead, I've just plonked everything in a single array and created a few helper methods to keep things civilized.

The implementation avoids an intermediate tensor index representation by calculating dimension by dimension. I don't think it's possible in WGSL to have an explicit representation because local, dynamically sized array variables are not allowed.

Now that we can produce contiguous tensors from non-contiguous tensors, let's benchmark if this slows us down. I added a few reshaped and transposed tensors to the benchmark:

let t1_gpu = Wgpu32::randn(t1s, &mut rng);
let t1_gpu_nc = t1_gpu.reshape(&[size / 2, size * 2]).transpose(0, 1);
let t1_cpu = t1_gpu.to_cpu();
let t1_cpu_nc = t1_gpu_nc.to_cpu();

Running the exp operation on these four tensors, for the same sizes I showed earlier.

Not sure what I was expecting, but no significant difference works for me!

Binary operations

Implementing a shader for binary operations is straightforward now that we know how to implement the coordinate mapping. The entry point for the shader looks almost identical, except that the binary operation takes two arguments. It also takes an extra input buffer for the second input tensor. What's cute is that input_index_of is also nearly identical. We just need to use a vec2<u32>, a pair of u32 numbers, instead of a simple u32 to calculate the buffer indices in both input buffers at the same time:

fn input_index_of(output_i: u32) -> vec2<u32> {
    let ndims = strides_and_shape[0];
    let offset = vec2(strides_and_shape[1],strides_and_shape[2]);

    var input_i: vec2<u32> = offset;
    for (var i: u32 = 0u; i < ndims; i = i + 1u) {
        let len = shape(i);
        let stride = output_strides(i);
        let coord: u32 = output_i / stride % len;

        input_i += coord * vec2(input_0_strides(i), input_1_strides(i));
    }
    return input_i;
}

Neat! Here are the benchmark results for elementwise multiplication.

Movement operations

Movement operations (reshape, permute, expand, pad, and crop) are straightforward, as they don't need to touch the buffer. Their implementation looks remarkably similar to CpuRawTensor thanks to the ShapeStrider abstraction. However, if the operation can't be expressed by changing the tensor's shape or strides, we do have to create a new contiguous result buffer. Luckily, we already have a shader that can take a non-contiguous tensor and create a contiguous tensor: the shader for unary operations! All we need to do is change it so we just copy from the appropriate input buffer index to the output, without applying an operation.

One slight bump in the road is the shader for padding a tensor by adding zeros at its edges. The pad shader is a variation of the shader for unary operations. It's not particularly illuminating, so I'm not describing it in more detail.

Reduce operations

Reduce operations (sum and max) are a different can of worms. Unary and binary ops don't change the shape of their input tensors. Movement ops do change the shape, but they are implemented on the CPU for the most part. In contrast, reduce operations change the shape of the input tensor in a particularly intricate way.

The signature of the relevant operations in RawTensor is:

fn sum(&self, axes: &[usize]) -> Self;
fn max(&self, axes: &[usize]) -> Self;

Tensors can be reduced in any axis or multiple axes. It is easy to figure out the resulting shape, starting from the shape of the input tensor and the axes to reduce, by simply replacing all the lengths of the reduced dimensions with 1. What's not so easy is figuring out which elements of the input buffer reduce to a given element of the output buffer. We need to put our parallel thinking hat on. The first thing I tried is to chunk the output buffer and let each thread compute the necessary reduction for the elements in its chunk.

Let's work through an example. An input tensor with shape (4, 5) reduced in the first dimension yields a tensor with shape (1, 5):

>>> let t = Tr::linspace(1.0, 20.0, 20).reshape(&[4, 5])
[1 2 3 4 5]
[6 7 8 9 10]
[11 12 13 14 15]
[16 17 18 19 20]

>>> t.sum(&[0])
[34 38 42 46 50]

The first element of the result, 34, is the result of adding up a slice of t. This slice is a tensor of shape (4, 1):

>>> t.crop(&[(0, 4), (0, 1)])
[1]
[6]
[11]
[16]

For this example, we have five such reduced slices (a term I made up), one for each element in the output. The number of reduced slices can vary between 1 if all axes are reduced and the size of the input tensor if no axes are reduced.

The shape of the output tensor is the shape of the input tensor with all reduced axes replaced by 1, while the shape of each reduced slice has all non-reduced axes replaced by 1. In the example:

input: (4, 5) 
output: (1, 5)
reduced: (4, 1)

Now let's start implementing the shader. As with unary and binary operations, each thread gets the buffer indices in the output it needs to calculate via its invocation id. With the above insights, we can apply similar index transformations as before to obtain the buffer indices of the reduced slice. The key idea is to figure out the first element of the reduced slice in the input and then reduce the elements of the reduced slice to that same index in the output. Confused? Let's look at the example again.

To reduce the tensor t, we'll dispatch 5 threads, one for each element in the output. The first thread's reduced slice is the column vector:

[1]
[6]
[11]
[16]

which reduces to 34 at tensor index (0, 0) both in the input and output. Each thread gets an invocation id between 0 and 5. From its invocation id, the thread must figure out which elements in the input to reduce to which element in the output. In the example, an invocation id corresponds to a column vector number: thread 0 reduces the elements in column 0 to buffer index 0, and so on.

Generally, we know the thread's invocation id, which corresponds to a buffer index in the output tensor. As before, we're not starting a thread per element, but one per chunk, so we can reduce to tensors with more than 65k elements. We need an outer loop and a chunk_size parameter which is set in a storage buffer at dispatch time:

fn call(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let chunk_size = strides_and_shape[2];
    let fro = global_id.x * chunk_size;
    let to = fro + chunk_size;

    // Loop over the chunk of output elements this thread is responsible for.
    for (var gidx = fro; gidx < to; gidx = gidx + 1u) {
        if(gidx >= arrayLength(&output_0)) {
            return;
        }

        //TODO: reduce the slice starting at input_index_of(gidx) to acc
        output_0[gidx] = acc;
    }
}

Each gidx corresponds to an output buffer index. In place of the TODO, we write something like the following. I've specialized it to sum for clarity.

let reduced_slice_offset = input_index_of(gidx);

var acc = 0.0;
for (var reduced_slice_i = 0; reduced_slice_i < reduced_slice_size; reduced_slice_i += 1) {
    var input_i = reduced_slice_index_of(reduced_slice_offset, reduced_slice_i);
    acc += input_0[input_i];
}

output_0[gidx] = acc;

reduced_slice_offset is the buffer index of the first element in the reduced slice for this thread. It works in the same way as the index transformations in unary and binary operations: the output buffer index is reverse-mapped to an output tensor index, which is then mapped to an input buffer index. That works because the tensor index in the output is the same tensor index as the first element of each reduced slice in the input. Here is the example with the elements replaced by their tensor index.

[(0, 0) (0, 1) (0, 2) (0, 3) (0, 4)]
[(1, 0) ... ... ... ...]
[(2, 0) ... ... ... ...]
[(3, 0) ... ... ... ...]

reduces to:

[(0, 0) (0, 1) (0, 2) (0, 3) (0, 4)]

input_index_of is identical to the implementation we had for unary operations: it interprets the given output buffer index as a tensor index, and then interprets the tensor index as an input buffer index.

Next, let's try to understand the reduced_slice_i loop. Figuring out how many elements there are in each reduced tensor is a job for the CPU, and reduced_slice_size is a value we pluck out of a storage buffer. In our example, this value is 4.

Each reduced_slice_i is a virtual buffer index in the reduced slice. It's virtual because there is no separate buffer backing the reduced slices, they're just a part of the input buffer. But we can still apply the principle of index mapping: all we're trying to do is iterate over the elements of the reduced slice, which we can do by giving them all a virtual buffer index and looping over the buffer indices. Again, this is an instance of the problem we had before: we need to reverse-map the reduced_slice_i buffer index to a tensor index in the reduced slice, and then map that to a buffer index in the input. The implementation is almost identical to input_index_of, except with a different shape and strides:

fn reduced_slice_index_of(offset: u32, reduced_slice_i: u32) -> u32 {
    let ndims = strides_and_shape[0];

    var input_i = offset;
    for (var i = 0u; i < ndims; i = i + 1u) {
        let len = reduced_shape(i);
        let stride = reduced_strides(i);
        let coord = reduced_slice_i / stride % len;

        input_i += coord * input_strides(i);
    }

    return input_i;
}

And that was my first attempt at implementing reduce on the GPU. It is functionally correct, but it has a problem. Can you figure out what it is? Hint: what's the parallelism it achieves when a tensor is reduced to a single value?

Parallelizing reduce

In our approach so far, the maximum number of threads is limited by the size of the output tensor. As a result, it only works well if the output tensor is big. In the worst case, when reducing to a single number, a single GPU thread does the entire reduction.

// assuming t has two dimensions, this reduction uses only one thread.
>>> t.sum(&[0, 1])
[6]

That's slower than on the CPU! This case is important because when training a neural network the last step involves a reduction of the output tensor to a single value. This value, the loss, measures how well the neural network is doing, and is the value that training aims to minimize.

We turn to a well-known parallel programming building block: the prefix sum. Blelloch (1990) describes all-prefix-sums as an example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. He defines the all-prefix-sums operation as follows:

The all-prefix-sums operation takes a binary associative operator ⊕ and an array of n elements

[a₀, a₁,..., aₙ₋₁],

and returns the array

[a₀, (a₀ ⊕ a₁),..., (a₀ ⊕ a₁ ⊕ ... ⊕ aₙ₋₁)]

For example, if ⊕ is addition, then all-prefix-sums on the array

[3 1 7 0 4 1 6 3]

returns

[3 4 11 11 15 16 22 25].

The applications of parallel prefix sum are surprisingly wide-ranging. They include evaluating polynomials, implementing quicksort, and lexically comparing strings.

The idea behind parallel prefix sum is to exploit the associativity of the ⊕ operator. Since we can apply the operation in any order, we can sum pairs of elements in parallel. After a first iteration, this results in an array that is half the length of the original, which can be further reduced in parallel using half the number of threads. We keep doing that until the array contains a single element, the result.

Blelloch shows this illustration:

Reduction proceeds bottom up in the tree.

The full parallel prefix sum has an extra step to gather the intermediate sums, but we only need to reduce. The reduced result we want is the last element of the prefix sum, or the top of the tree in the picture.

For parallel reduction, we need to know when the threads reducing a level in the tree have finished, so they can reduce the next level. We accomplish that using a synchronization primitive called a barrier. As explained in the introduction, WebGPU only supports barriers within a workgroup, so we can only parallelize the reduction step within a workgroup. That limits us to at most 256 threads. The restriction could be lifted if we're prepared to dispatch for each reduction level separately, which would also solve the barrier problem: we'd wait for the current dispatch to finish before moving on to the next.

Staying within a workgroup has advantages too. We can store the intermediate results in a fast, workgroup-shared cache:

// replaced with the actual size at shader creation stage.
const INTERMEDIATE_SIZE: u32 = 64u;
var<workgroup> intermediate: array<f32, INTERMEDIATE_SIZE>;

This may help explain the var<storage, read> you saw before. workgroup vs storage indicates the address space in which the variable lives. The workgroup address space is for memory that is fast to access and shared between threads in a workgroup. It's comparable to an L1 CPU cache. Workgroup memory is not bound and can't be filled with data from the CPU at dispatch time. All of which makes it ideal for the kind of scratch space we need.

Since we're now parallelizing within a workgroup, there's another useful built-in called the local invocation id. We can access it by adding @builtin(local_invocation_id) on an entry point parameter:

@compute
@workgroup_size(64)
fn call(@builtin(global_invocation_id) global_id: vec3<u32>, @builtin(local_invocation_id) local_id: vec3<u32>)

It's similar to the global invocation id, except it only gives us unique ids within a workgroup. We'll make good use of it in the parallelized reduce loop. We start with what we had before:

let reduced_slice_size = strides_and_shape[3];

let lidx = local_id.x;
let lidy = local_id.y;

for (var gidx = fro; gidx < to; gidx = gidx + 1u) {
    if(gidx >= arrayLength(&output_0)) {
        return;
    }

    let reduced_slice_offset = input_index_of(gidx);

    // TODO parallel reduce loop
}

Note that we are working with two workgroup size dimensions: lidx and lidy. The x dimension parallelizes over reduced slices, like before. The y dimension further parallelizes each reduction.

Going back to the earlier example:

>>> let t = Tr::linspace(1.0, 20.0, 20).reshape(&[4, 5])
[1 2 3 4 5]
[6 7 8 9 10]
[11 12 13 14 15]
[16 17 18 19 20]

>>> t.sum(&[0])
[34 38 42 46 50]

Could be reduced with @workgroup_size(5, 2), for a total of 10 threads: two threads per column, each thread reducing two elements. Instead of a single acc variable which is only visible to a single thread, we now have a workgroup-visible intermediate buffer, which needs 10 elements, one per thread. Here is one reduction step, parallelized:

let intermediate_i = lidx * REDUCE_THREADS + lidy;
intermediate[intermediate_i] = 0.0;
for (var reduced_slice_i = lidy; reduced_slice_i < reduced_slice_size; reduced_slice_i += REDUCE_THREADS) {
    var input_i = reduced_slice_index_of(reduced_slice_offset, reduced_slice_i);
    intermediate[intermediate_i] += input_0[input_i];
}

First, intermediate_i is determined - this is the executing thread's place in the workgroup-visible intermediate buffer. I could not find a built-in to get the number of threads in the workgroup, so I used a constant REDUCE_THREADS to get that information. It is filled in using string replacement at shader dispatch time.

Then the loop reduces a subset of the elements in the reduced slice and writes the result to the intermediate buffer. Depending on the size of the reduced slice, a thread may reduce more than two elements. Given the limit of 256 threads, it wasn't tenable to limit each thread to just a pair of elements.

In the example, this is the state of the intermediate buffer after all threads have executed the loop:

[1+11 6+16 2+12 7+17 3+13 8+18 4+14 9+19 5+15 10+20]

That's one step away from the final result. To keep things manageable, I opted to implement just two levels of reduction. So instead of going through the reduction again with half the number of threads, a single thread does the final reduction.

workgroupBarrier();

if (lidy == 0u) {
    var acc = intermediate[lidx * REDUCE_THREADS];
    for (var i = 1u; i < REDUCE_THREADS; i += 1u) {
        acc += intermediate[lidx * REDUCE_THREADS + i];
    }
    output_0[gidx] = acc;
}

Before the final reduction, we make sure all threads have written to the intermediate buffer, using the workgroupBarrier built-in. A workgroup barrier marks a place in the shader where threads wait until all other threads in the workgroup have reached it too. After that, just one of the reduction threads calculates the final accumulated value and writes it to the storage buffer.

In the example, 5 threads in the x dimension with lidy==0 would write the following to the output buffer:

[(1+11)+(6+16) (2+12)+(7+17) (3+13)+(8+18) (4+14)+(9+19) (5+15)+(10+20)]

And we are done! Have a look at the full shader code here.

Let's run another CPU vs GPU shootout, again for square tensors of 64x64 up to 2048x2048, and reducing to a single scalar, i.e. t.sum(&[0, 1]). The speedup we achieve is due to the parallel reduction.

Towards matrix multiplication

Since we now have all RawTensor's operations covered, and they all seem pretty efficient, it's time to give matmul a go.

All seems fine until we run a matrix multiplication of two 512x512 tensors.

wgpu error: Validation Error

Caused by:
    In Device::create_buffer
      note: label = `Tensor (mul)`
    Buffer size 536870912 is greater than the maximum buffer size (268435456)

The matrix multiply of two 512x512 tensors exceeds the maximum allowed buffer size of 256 MiB. But that doesn't make sense. A tensor of 512x512 is 1MiB. Where does the 512MiB buffer come from?

The answer lies in the implementation of matmul in tensor.rs. Let's recap how that worked.

Massage the left input tensor of shape (m, n) to a tensor of shape (m, 1, n) using reshape.
Massage the right input tensor of shape (n, o) to shape (1, o, n) using reshape and transpose.
Broadcast-multiply left and right to make a tensor of shape (m, o, n).
Sum the last dimension to make a tensor of shape (m, o, 1).
Reshape to (m, o).

The problem is step 3, where an intermediate tensor is created of size 512 x 512 x 512, exactly 512MiB! The intermediate allocation scales very poorly with tensor size.

Finally efficient with fused multiply-add

We're not breaking new ground here. The problem of matrix multiplication has been solved before. The solution is to avoid the intermediate buffer by fusing the multiply and the sum. Practically, this means adding an extra operation to RawTensor:

/// Multiply self with other element-wise, and sum-reduce the given dimensions, 
/// in one fused operation.
fn fused_multiply_add(&self, other: &Self, axes: &[usize]) -> Self;

In other words, left.fused_multiply_add(right, axes) is functionally equivalent to left.mul(right).sum(axes) while avoiding the intermediate allocation, and allowing other optimizations as well.

After understanding the shaders for binary and reduce operations, the implementation of fused_multiply_add on the CPU or on the GPU does not produce more insights. On the GPU, I decided not to parallelize the sum, but use a simple loop instead. For now, this limits parallelism to a single thread when multiply-adding two big vectors, because then the result is a scalar.

One interesting note is that WGSL has a special-purpose fma instruction built-in: the result of fma(l, r, s) is l * r + s and executes in a single cycle on the GPU. The shader uses this instruction to good effect.

Now let's see the results of our matrix multiply benchmark.

I had to turn off testing on the CPU for tensors greater than 256x256 because it was taking too long. That just goes to show how abysmally bad my CPU implementation is - for a competent implementation please use something like matrixmulitply or BLAS. What's great is that a GPU implementation at the same level of incompetence performs and scales remarkably well. That said, undoubtedly the GPU implementation also leaves significant performance gains on the table.

Conclusion

Thanks for reading to the end! I hope you enjoyed this meandering exploration of GPU programming, parallel algorithms, and tensor operations. I certainly had a blast researching and building all this. Despite the already encyclopedic length of this post, there is much left to do! I encourage you to fork the repo and run experiments.

You can reach me on Twitter or Mastodon with comments and questions. I'm not so active there anymore, but I do check in once in a while. For longer form consider opening an issue on the tensorken repository.

In the next post in the “Tensors from Scratch” series, I plan to add automatic differentiation.

Subscribe over at Substack to get it in your inbox.

References

Blelloch, Guy E. 1990. "Prefix Sums and Their Applications." Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University.
GPU Gems 3: Parallel Prefix Sum (Scan) with CUDA
All of the cores, none of the canvas: A getting started tutorial for writing WebGPU compute shaders in JavaScript. Very helpful to learn about the concepts.
Get started with GPU Compute on the web: Another WebGPU tutorial for JavaScript.
Learn Wgpu: an excellent tutorial on getting started with wgpu in Rust. Focused on graphics pipelines, but helpful to set up Rust scaffolding and get an overview of wgpu.
WebGPU fundamentals: An (unfinished) website to explain some WebGPU concepts.
Compute Shader Glossary: One barrier to learning and talking about GPU compute is the bewildering terminology. This is an annotated glossary of some of these terms.
A trip through the Graphics Pipeline 2011: Deeper dive in compute shaders down to the hardware.
Slides: A deep dive in GPU architecture from the Chromium WebGPU lead.
WebGPU Shading Language Specification (W3C Working Draft)
WebGPU limits for workgroup and dispatch: A reference on WebGPU's limits on workgroup sizes and counts.
WebGPU best practices + slide deck: Useful practices and performance tips.
StackOverflow: A good explanation of how the maximum total number of invocations within a single dispatch call works. For Vulcan but concepts translate to WebGPU.
StackOverflow: What does storageBarrier in WebGPU actually do?
Wonnx is a GPU-accelerated ONNX inference run-time written 100% in Rust, ready for the web. I looked at their shaders, and how they figure out workgroup size and count.

Fun and Hackable Tensors in Rust, From Scratch

Kurt — Mon, 01 May 2023 14:39:15 +0000

Over the past year, deep learning has been a hot topic with the release of new language and generative models like ChatGPT, Llama, Stable Diffusion, and Midjourney. These models are impressive, and I recommend trying them out if you haven't already. I use GitHub CoPilot daily, and now I feel lost when it's not accessible, despite my initial hesitation towards it.

But how does deep learning work? Let's find out in the craziest way possible: by building a PyTorch from the ground up.

Image generated by Stable Diffusion.

My imposter syndrome compels me to inform you that I have little real-world experience with deep learning or programming GPUs, or working with tensor libraries like NumPy or PyTorch. I hope to provide a useful entry point for your explorations if you're in a similar spot. The library I'm trying to build, which I've called Tensorken , is not finished at the time of this writing. It's all about the journey!

In this post, we will explore a crucial aspect of deep learning frameworks, which is executing tensor operations. Along the way, I'll provide an overview of these operations and how they function, giving you a tour of a subset of NumPy.

Prerequisites: intermediate programming experience. Some Rust knowledge is useful, especially if you want to play around with Tensorken! The concepts are orthogonal, and I won't be using advanced Rust features.

No knowledge about tensors as used in NumPy or PyTorch s is expected: we'll find out about tensors, broadcasting, shapes, dimensions, and other strange beasts as we go along.

All code for Tensorken is on GitHub. The version this post discusses is tagged v0.1, and all further links are that version.

What does it take to build a neural network?

The main three real contenders at the time of this writing are Tensorflow (Google), PyTorch (Meta), and JAX (Google). If you're wondering why Google produced two libraries, they seem to be phasing out Tensorflow for JAX, after TensorFlow lost the popularity battle with PyTorch.

While these libraries are gazillions of lines of code each, my understanding is they consist of the following main parts:

A tensor library. A tensor is a multi-dimensional array - something programmers are pretty familiar with. However, as we'll see, multi-dimensional arrays in programming languages are pretty boring: you can just index in them and get a number out. Tensor libraries have many more higher-level (and frankly bewildering) operations to slice, dice, and apply bulk operations to tensors.
One or more accelerators. While it's possible to execute tensor operations on the CPU, operating on huge sets of floating point numbers is what GPUs and specialized hardware like TPUs (Tensor Processing Units) excel at. All deep learning libraries can transparently execute tensor operations on the GPU, accelerating them greatly.
Automatic differentiation. Neural networks are trained via an optimization process called gradient descent. Efficiently calculating gradients, i.e. derivatives, is of critical importance. Every library has a component that automatically calculates the gradient of a program expressed in terms of tensor operations.
Neural network building blocks. Interestingly, this is not where the meat is in terms of implementation: once you have the three things above, you're 90% there.

This post focuses on implementing a tensor library on the CPU and laying the groundwork for extending it with accelerators and automatic differentiation. We'll become familiar with commonly used operations on tensors, and how they are implemented.

I'll introduce what a tensor is, and what it's used for in the context of neural networks. I'll give a brief tour of some commonly used tensor operations. Then I'll describe a minimal yet powerful set of fundamental tensor operations. The idea is to compose useful user-facing tensor operations from these simple, lower-level operations. I'll then move on to describe an implementation strategy for representing tensors, and I'll implement the operations on the CPU. Finally, to convince you that the minimal operations are indeed powerful, I'll show how to implement matrix multiplication and create an "eye" matrix (a matrix with ones on the diagonal and zeros elsewhere).

The design of Tensorken is greatly inspired by tinygrad. Tinygrad is much more advanced and runs real models like Stable Diffusion.

Ten-what?

One tensor library you've probably heard about is NumPy, which describes itself as "...a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays ...". From a programmer's perspective, tensors are the most boring of data structures: an n-dimensional array of homogenous primitive type, with a size fixed at creation. They typically contain floating point numbers. The power of tensor libraries is the efficient implementation of a collection of useful tensor operations.

In the context of neural networks, tensor operations are everywhere. Neural networks don't have much to do with networks or neurons. A better name would be "tensor multiplication layers", but that's decidedly less catchy. (I was fired from the Naming Things committee ages ago).

My mental model of neural networks is as programs that take a tensor of floating point numbers, typically called X, as input. They spit out another tensor of floating point numbers Y as output. Y is produced by applying various operations to X, mostly matrix multiplications and elementwise function applications. Neural networks are organized in layers - "deep" in deep learning refers to the number of "hidden", or intermediate layers that are involved in eventually producing the Y. So we have something like X -> hidden layer -> H1 -> hidden layer -> H2 ... -> Y. Each hidden layer typically consists of a linear operation - i.e. a bunch of matrix multiplications - followed by a non-linear "activation function". The term activation again seems to refer to the idea that each output of a layer represents activating neurons, but I would say that has about as much to do with real neurons as my wedding ring with the Crown Jewels.

As an aside, a fascinating take on how this came to be the dominant architecture for AI is Sara Hooker's Hardware Lottery.

If you were completely unfamiliar with neural networks before reading this I'm fully expecting you to have all kinds of questions. I'm only hoping to convey that neural networks are short programs that mainly manipulate tensors. This allows them to run very efficiently on specialized, highly parallel hardware.

Don't just stand there, tensor, operate

Hopefully, that justifies spending some time learning how tensors work.

There are a few terms you need to be aware of. A tensor can have any number of dimensions or axes. Each axis is indexable - conventionally zero-based - and has a length. The shape of an n-dimensional tensor is an n-tuple that represents the length of each dimension. The product of the lengths of the shape is the total number of elements in the tensor - called the tensor's size.

Here are a few examples:

A tensor t with shape (37) is a one-dimensional row vector with 37 elements. You index for the first element using t[0]. This is interpreted equivalently to shape (1, 37) as a row vector or 1×37 matrix, except to get the first element you'd have to use t[0, 0].
A tensor t with shape (37, 1) is interpreted as a column vector or a 37x1 matrix with 37 elements. Its first element is t[0, 0].
A tensor t with shape (5, 5) is a square 5 by 5 matrix, which contains 25 elements.
A tensor t with shape (2, 3, 4) is a three-dimensional tensor with 24 elements. Its first element is t[0, 0, 0] and its last t[1, 2, 3].

Conventionally people think of the last two dimensions as rows and columns.

I'll illustrate further by showing some usages of Tensorken. The names and operations - and occasionally mind-boggling semantics - are similar to NumPy's. To follow along, clone the repo and check out the tour.rs example.

One difference with NumPy (but not PyTorch) is that tensors in Tensorken can live on the CPU or the GPU, and for the moment this is tracked in their type.

Further, tensors in NumPy have a dtype, short for datatype which is the type of their elements. In Tensorken, I tracked this in the static type of the tensor.

Here's the workhorse of tensor creation, new. It accepts a shape and a 1-dimensional Rust slice (if you're not familiar with Rust slices, you can think of them as read-only views on arrays):

>>> Tensor::<CpuRawTensor<f32>>::new(&[3, 2], &[0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
[0 1]
[2 3]
[4 5]

Due to the given shape [3,2] Tensorken interprets the slice as a 3×2 matrix and arranges the slice's contents conventionally in row-major order. In other words, it maps the indices of the slice to indices of the matrix as follows:

slice[0] -> matrix[0, 0]
slice[1] -> matrix[0, 1]
slice[2] -> matrix[1, 0]
slice[3] -> matrix[1, 1]
slice[4] -> matrix[2, 0]
slice[5] -> matrix[2, 1]

This concept is important to understand, and I'll revisit it several times throughout this post, so don't worry if it doesn't fully click yet. It may be helpful to think of the index in the tensor as "counting up" in a number system with bases given by the shape. From that perspective, indexing in a tensor with shape [10, 10, 10] is like counting in decimal up to 999, and should feel very familiar.

Tensor::<CpuRawTensor<f32>> is quite a mouthful. Tensorken has a few type aliases to make it shorter, like Cpu32. Since we'll only work with tensors on the CPU containing 32-bit floats, I've aliased that again to Tr:

type Tr = Cpu32

For people familiar with NumPy, you can think of this as the counterpart for import numpy as np.

Now that we can create tensors, let's crack on with examples.

Unary operations

Tensor operations can be broadly categorized into a handful of classes. The easiest to understand are unary operations, which take a tensor and produce a new tensor by applying a function to each element.

>>> let t = Tr::new(&[3, 2], &[0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
[0 1]
[2 3]
[4 5]

>>> t.exp()
[1 2.7182817]
[7.389056 20.085537]
[54.59815 148.41316]

>>> t.log()
[-inf 0]
[0.69314724 1.0986124]
[1.3862945 1.6094381]

There is no in-place mutation in Tensorken - all operations return a new tensor.

In functional programming terms, these operations are maps: they apply a function to each element of an input tensor, returning a new tensor while preserving the input tensor's shape.

Binary operations and broadcasting

Next, a few binary operations, which take two tensors as input.

>>> let t1 = &Tr::new(&[2, 2], &[0.0, 1.0, 2.0, 3.0])
[0 1]
[2 3]

>>> let t2 = &Tr::new(&[2, 2], &[6.0, 7.0, 8.0, 9.0])
[6 7]
[8 9]

>>> t1 + t2
[6 8]
[10 12]

>>> t1 * t2
[0 7]
[16 27]

The basic binary operations are addition, subtraction, multiplication, and division. They operate element-wise - in particular t1 * t2 is not matrix multiplication. Since they operate element-wise, their shapes must match.

In functional programming terms, these operations are like map2 or zip: they apply a binary function elementwise to two tensors.

I said the input tensors' shape must match, but I didn't explain what that means. Trivially, two shapes match if they are identical. However, tensor operations support broadcasting which allows different shapes to match under certain constraints. The simplest example is adding a scalar to a row vector:

>>> let t1 = &Tr::new(&[6], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[2 1 4 2 8 4]

>>> let s1 = &Tr::scalar(2.0)
[2]

>>> t1 + s1
[4 3 6 4 10 6]

Tensorken has done the sensible thing of adding the scalar to each element of the row vector t1. However the shape of t1 is [6], while the shape of s1 is [1]. These shapes match according to the first rule of broadcasting:

Two shapes match if:

they have the same number of dimensions, and
their dimensions are pairwise of equal length or either is of length 1.

More mechanically, a dimension of length 1 is "broadcasted" as many times as necessary to match the greater length.

Broadcasting generalizes to more than one dimension:

>>> let t1 = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[2 1]
[4 2]
[8 4]

>>> let t2 = &Tr::new(&[1, 2], &[10.0, 100.0])
[10 100]

>>> t1 + t2 // broadcast row [10 100] 3 times
[12 101]
[14 102]
[18 104]

We can also broadcast along the last dimension:

>>> let t3 = &Tr::new(&[3, 1], &[10.0, 100.0, 1000.])
[10]
[100]
[1000]

>>> t1 + t3 // broadcast column 2 times
[12 11]
[104 102]
[1008 1004]

Pretty confusing, and we're not done yet.

We can not only add a scalar to a vector, but to tensors with any shape:

>>> let t1 = &Tr::new(&[2, 3], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[2 1 4]
[2 8 4]

>>> let s1 = &Tr::scalar(2.0)
[2]

>>> (t1.shape(), s1.shape())
([2, 3], [1])

>>> t1 + s1
[4 3 6]
[4 10 6]

How is it adding two shapes that have different dimensions? Again, the scalar value is added to each element of the matrix, which seems sensible.

First realize that you can add any number of dimensions of length 1 anywhere in a shape, without substantially affecting the tensor. Row vectors may become column vectors or vice versa, but that's just a matter of interpretation. The number of elements and their order remains the same. So we can think of the shape [1] as [1, 1], and the latter shape does match with [2, 3] according to the first rule of broadcasting. The question remains, where do we add these dimensions of length 1? The second rule of broadcasting clarifies that:

If two shapes have a different number of dimensions, add dimensions of length 1 to the front of the shortest shape.

Let's see another example of that in action.

>>> let t1 = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[2 1]
[4 2]
[8 4]

>>> let t2 = &Tr::new(&[2], &[10.0, 100.0])
[10 100]

>>> (t1.shape(), t2.shape())
([3, 2], [2])

>>> t1 + t2
[12 101]
[14 102]
[18 104]

Here we add a shape [3, 2] to shape [2]. If we align them, and apply the broadcast rules:

3 2
  2

=> Apply rule 2: add dimensions of length 1 to the front until the number of dimensions matches

3 2
1 2

=> Check rule 1: these shapes match. Broadcast row [10 100] 3 times.

Resulting shape:

3 2

Looks like a massive foot gun right? I agree. Even Andrej Karpathy thinks so.

Broadcasting is implemented efficiently: it does not work by copying data. I'll come back to this in a later section, rest assured for now that neither input tensor allocates memory during broadcasting.

Reduction operations

The third category is reduction operations. These take a single tensor as input, but change its shape by accumulating elements along one or more dimensions. The dimensions that are accumulated become of length 1.

Here's a simple example:

>>> let t = &Tr::new(&[4], &[0.0, 1.0, 2.0, 3.0])
[0 1 2 3]

>>> t.sum(&[0])
[6]

We can also accumulate along more than one dimension:

>>> let t = &Tr::new(&[2, 2], &[0.0, 1.0, 2.0, 3.0])
[0 1]
[2 3]

>>> t.sum(&[0, 1])
[6]

The foot gun comes out again when we don't accumulate all dimensions - using the same tensor t:

>>> t.sum(&[0])
[2 4]

>>> t.sum(&[1])
[1]
[5]

In the first sum we accumulate over dimension 0, which is rows. So we added rows together, and end up with a tensor with one row. In the second, we accumulated over dimension 1, which is columns, hence we added columns together and ended up with a tensor with one column. All dimensions that you reduce over become length 1 in the result: the first sum has shape [1, 2], and the second has shape [2, 1].

The foot guns of broadcasting and reduce explosively combine to a truly impressive BFG (Big Foot Gun. What were you thinking?). For example, say you have a tensor P of shape [26, 26] that counts how many times each letter in the alphabet is followed by another, in a particular dataset (these are called bigrams). For example, the element at index [0, 0] indicates how many times the letter 'a' was followed by another 'a'. Now we'd like to normalize this tensor, by dividing each element in each row by the sum of counts of that row. The result should give us transition probabilities per row: how likely is it that 'a' is followed by 'a', 'b', 'c',... etc. Now there are a couple of possibilities: P / P.sum(&[0]) or P / P.sum(&[1]). Note that thanks to broadcasting both of these work, but have different effects. Can you work out which one has the desired effect?

Movement operations

The fourth and last category is movement operations. These don't change the content of tensors, but allow changing their shape.

A useful operation is taking slices, to make smaller tensors from existing ones:

>>> let t = &Tr::new(&[2, 2], &[0.0, 1.0, 2.0, 3.0])
[0 1]
[2 3]

>>> t.at(1)
[2 3]

>>> t.at(&[1, 0])
2

These should be pretty self-explanatory. at with a single argument slices the first (leftmost) dimension at the given index out of the tensor. at with a slice argument gets a single element. A more practical tensor library would have many more options here, all of which Tensorken in principle supports, but I haven't put in the effort of exposing that on Tensor yet. For Rust-specific reasons, it's some work to get everything working with a nice slicing syntax.

Slicing does not copy the underlying data - the underlying buffer is shared.

Another useful movement operation is reshape:

>>> let t = Tr::linspace(0.0, 23.0, 24)
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]

>>> let t6x4 = t.reshape(&[6, 4])
[0 1 2 3]
[4 5 6 7]
[8 9 10 11]
[12 13 14 15]
[16 17 18 19]
[20 21 22 23]

>>> let t3x8 = t6x4.reshape(&[3, 8])
[0 1 2 3 4 5 6 7]
[8 9 10 11 12 13 14 15]
[16 17 18 19 20 21 22 23]

First, we've created a row vector with 24 elements using linspace, another name blatantly stolen from NumPy. linspace takes a range, the desired number of steps in that range, and returns a row vector with evenly spaced numbers in the range.

Then we reshape to a 6×4 matrix, and again to a 3×8 matrix. Neither of these reshapes copy the data, they all share a single buffer. In some cases reshape does have to copy. Unfortunately, it's not easy to explain when without going into implementation details, so I'll come back to this in the next section.

The final operation we'll consider is permute, which is a generalized matrix transposition. It applies a permutation to a tensor's dimensions. Permute re-orders dimensions but keeps the order of elements in each dimension the same. Illustrated by an example:

>>> t3x8.permute(&[1, 0])
[0 8 16]
[1 9 17]
[2 10 18]
[3 11 19]
[4 12 20]
[5 13 21]
[6 14 22]
[7 15 23]

In the original t3x8 tensor, keeping the first index constant at 0 and increasing the second index, you get elements 0, 1, 2, 3, ... 7. In the permuted tensor, you get the same elements if you keep the second index constant and increase the first index. Since permute is a re-ordering of the dimensions, it never needs to copy the underlying data.

Phew, you're about halfway through! This is a good time to take a break. The next section dives into Tensorken's internals.

Photo by Rumman Amin on Unsplash

It's a tensor, Jim, but not as we know it

After that whirlwind tour, let's move on to Tensorken's design. Tensorken is split into two layers:

A set of fundamental tensor operations, embodied in the RawTensor trait.
User-facing tensor operations, built out of the fundamental operations, in the Tensor type. Tensor is parametrized (via a generic type parameter) with a RawTensor implementation, and translates higher-level operations to the fundamental RawTensor operations. Tensor in addition implements all the necessary traits like Add, Neg etc so tensors work with all the usual rust operators.

The idea is that to implement a new kind of accelerator - whether it's via WebGPU, cuda, torch, XLA, or any other - we only need to implement the operations in RawTensor. The backend then becomes immediately available to use with all the high-level tensor operations in Tensor! This is quite powerful and is borne out in Tensor's tests. You can write code that's generic on RawTensor, and so works with any backend:

fn fun<'t, RT: RawTensor>(t1: &'t Tensor<RT>, t2: &'t Tensor<RT>) -> Tensor<RT> {
    let r1 = t1.exp();
    let r2 = t2.log();
    let r3 = t1 + r1;
    // ...more tensor operations
    r6 + r5 + r7
}

And then run it on the CPU or GPU:

let t_cpu = &Cpu32::linspace(1.0, 6.0, 6).reshape(shape);
let r_cpu = fun(t_cpu, t_cpu);

let t_wgpu = &Wgpu32::linspace(1.0, 6.0, 6).reshape(shape);
let r_gpu = fun(t_wgpu, t_wgpu);

assert_tensor_eq(&r_cpu, &r_gpu);

Another way to think of this is that RawTensor is a final-style embedded DSL for accelerators, and enjoys all its advantages. In particular, it's relatively easy to extend RawTensor with optimized operations on a particular backend, and then make those available on Tensor only if you select that backend. If that doesn't make sense to you, check out my previous post.

I'll leave the WebGPU-based backend for another post, but you can of course look at the code. I'll first explain the RawTensor operations and why they are there, and then go into some more implementation details.

Tensors, raw

The operations on RawTensor can be classified in the same way as the high-level operations explained above.

The trait itself is simple:

pub trait RawTensor {
    type Elem: Num;

    // ...fns...
}

It has an associated type Elem, a placeholder for whatever you can put into tensors. Currently only f32 is supported, as there's only a single implementation for Num. Num is a simple trait that groups a handful of existing traits like Add and Mul, and adds a few other operations that the Rust standard library doesn't have a trait for, like log and pow.

I'll start with the straightforward unary and binary operations:

fn exp(&self) -> Self;
fn log(&self) -> Self;

fn add(&self, other: &Self) -> Self;
fn sub(&self, other: &Self) -> Self;
// + mul, div, pow, eq

All these operations borrow their inputs and return a new RawTensor. As a result, no mutation is allowed by the rust compiler (except unsafe code or interior mutability, neither of which is used.)

The unary operations do what they say on the tin. For a tensor library that's more widely targeted than neural networks, there'd probably be more here, but this is pretty much what we need for activation functions.

The same goes for the binary operations. Note that these operations are all elementwise, and don't need to support broadcasting - they can assume both their arguments have the same shape.

Next are the reduce operations:

fn sum(&self, axes: &[usize]) -> Self;
fn max(&self, axes: &[usize]) -> Self;

Next creation and elimination operations:

fn new(shape: &[usize], data: &[Self::Elem]) -> Self;
fn shape(&self) -> &[usize];
fn ravel(&self) -> Vec<Self::Elem>;

new and shape we've already seen. Ravel (again, the name is the same as NumPy) returns a contiguous copy of the tensor's buffer as a Vec. Remember how I wrote in the previous section that it'd be important to understand how to "count up" the indices in a tensor? The symmetry is that new creates a tensor from a shape and a contiguous buffer, interpreting the order of the index in the buffer into a particular order of the indices in the tensor. ravel goes the opposite way: it creates a contiguous buffer by "counting up" indices in the tensor and placing them into the buffer. We thus have the invariant new(shape, buffer).ravel() === buffer.

Finally, movement operations:

fn reshape(&self, shape: &[usize]) -> Self;
fn permute(&self, permutation: &[usize]) -> Self;
fn expand(&self, shape: &[usize]) -> Self;
fn crop(&self, limits: &[(usize, usize)]) -> Self;
fn pad(&self, padding: &[(usize, usize)]) -> Self;

reshape and permute we've already seen in action earlier. expand is a primitive to allow broadcasting: it can replicate dimensions of length 1 to any desired length, but doesn't add any dimensions of length 1:

>>> let t = &Tr::new(&[1, 2, 2], &[0.0, 1.0, 2.0, 3.0])
+---------+
| |
| [0 1] |
| [2 3] |
| |
+---------+

>>> t.expand(&[5, 2, 2])
+---------------------------------------------+
| |
| [0 1] [0 1] [0 1] [0 1] [0 1] |
| [2 3] [2 3] [2 3] [2 3] [2 3] |
| |
+---------------------------------------------+

To implement broadcasted binary operations on the Tensor level, the process is roughly as follows:

Use reshape to add dimensions of length 1 to the front of the shortest shape.
Use expand on both shapes to increase any dimensions of length 1 to the length needed of the corresponding dimension in the other tensor.
Use the desired binary operator on the result.

Full code here.

Finally, we have crop and pad. crop is a limited slice operator: it creates a new tensor with a given sub-range of indexes in each dimension of the source tensor. crop is what underpins the implementation of at.

>>> let t = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[2 1]
[4 2]
[8 4]

>>> t.crop(&[(0, 2), (1, 2)])
[1]
[2]

pad is the opposite of crop: it adds a given number of 0.0 values before and after each dimension, increasing the tensor's shape:

>>> let t = &Tr::new(&[3, 2], &[2.0, 1.0, 4.0, 2.0, 8.0, 4.0])
[2 1]
[4 2]
[8 4]

>>> t.pad(&[(1, 2), (1, 3)])
[0 0 0 0 0 0]
[0 2 1 0 0 0]
[0 4 2 0 0 0]
[0 8 4 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]

pad is currently unused, but it will be once I introduce reverse-mode automatic differentiation, as all the operations will then need a suitable reverse.

Strides, shapes, and shenanigans

Now we come to CpuRawTensor, an implementation of RawTensor on the CPU. This part goes quite deep into implementation details - you may want to skim it on first reading.

The most constraining aspect of the implementation is the movement operations, and in particular, a desire to minimize copying of the data buffer. The idea is to have a shared set of read-only, contiguous buffers:

struct Buffer<T> {
  data: Vec<T>,
}

The CpuRawTensor struct has a reference counted pointer to a buffer and a mysterious ShapeStrider:

pub struct CpuRawTensor<T> {
  buffer: Rc<Buffer<T>>,
  strider: ShapeStrider,
}

The buffer is reference counted so we can share it among many tensors: this allows Tensorken to do certain operations without copying (part of) the buffer.

To understand what ShapeStrider does, we must get to the essence of what tensors are. Their backing data is stored in a contiguous buffer in memory, while tensors are addressable as dynamically-shaped multi-dimensional arrays. ShapeStrider translates between these two representations, by mapping indices in the tensor to an index in the buffer. It's a more malleable and extended version of what compilers do when you access an index in an array: they find the address in memory by multiplying the index with the size of the array items (and possibly add an offset, if the array index is not zero-based).

To a first approximation, ShapeStrider is defined as:

pub(crate) struct ShapeStrider {
  shape: Vec<usize>,
  strides: Vec<usize>,
}

The shape is a vector that has as many elements as there are dimensions in the tensor, and stores the length of each dimension. The strides vector also has as many elements as there are dimensions and stores the offset in the buffer between successive elements in that dimension.

Here's an illustration for a tensor of shape (3, 3) with strides (3, 1):

As we go left to right within a row, the stride 𝚜₁ is 1. As we go top to bottom along a column, the stride 𝚜₀ is 3.

Let's be more precise. We need to define a mapping 𝚋 from an n-dimensional index (𝚝₀, 𝚝₁, ..., 𝚝ₙ₋₁) in the tensor to a single index in the buffer. Given n-dimensional shape lengths (l₀, l₁, ..., lₙ₋₁) and strides (𝚜₀ , 𝚜₁ , ..., 𝚜ₙ₋₁), that's:

𝚋(𝚝₀,𝚝₁,...,𝚝ₙ₋₁) = ∑ᵢ 𝚜ᵢ⋅𝚝ᵢ

Or in code:

fn buffer_index(&self, index: &[usize]) -> usize {
  index.iter().zip(self.strides.iter()).map(|(&t, &s)| t * s).sum::<usize>()
}

Now let's turn to the problem of how to choose strides. The basic tensor constructor takes a shape and a contiguous buffer as arguments, so what we're missing are the strides. We're interpreting the buffer in row-major order, which means that consecutive elements in the tensor's rows are consecutive in the buffer. As a result, we know that the last (rightmost) stride sₙ₋₁ = 1.

We can now figure out the rest of the strides, right to left:

sₙ₋₁ = 1 sₙ₋₂ = sₙ₋₁⋅lₙ₋₁ ... s₀ = s₁⋅l₁

If the shape is well-formed with all lengths strictly positive, then all strides are strictly positive: ∀ᵢ sᵢ > 0.

In code:

let mut strides = vec![1; shape.len()];
for i in (0..shape.len() - 1).rev() {
    strides[i] = strides[i + 1] * shape[i + 1];
}

This code is kind of kludgy - to my slight surprise Rust's standard library appears to be missing rscan (a reverse fold that returns a sequence of the intermediate results), while it does have scan, fold, and rfold. Also, I find (0..shape.len() - 1).rev() mentally taxing to read, I'd prefer a C-style loop here.

With this, it turns out we can implement reshape, expand, and permute without copying the underlying buffer (spoiler: for the most part...). pad always copies the underlying buffer to a new buffer with space for the padding, and updates the shape accordingly. It's possible to pad virtually by keeping an additional Vec of paddings in ShapeStrider, but I've not implemented that. We can now also implement all unary, binary, and reduce operations.

Crop and pad

Interestingly, we get stuck with crop - remember that crop reduces the length of dimensions by removing a given number of elements at the beginning and end of each dimension. To crop the end of a dimension, there's no issue: we just change the length for that axis by subtracting the crop amount.

A naïve way to crop at the beginning is to keep an additional Vec<usize> of cropped offsets for each of the dimensions. We don't need to do that, because this vector corresponds to a unique buffer index, and we can just calculate and store that single buffer index. The full ShapeStrider struct is then:

pub(crate) struct ShapeStrider {
    shape: Vec<usize>,
    strides: Vec<usize>,
    offset: usize,
}

And we need to update our function 𝚋 accordingly to take offset 𝚘 into account:

𝚋(𝚝₀,𝚝₁,...,𝚝ₙ₋₁) = 𝚘 + ∑ᵢ 𝚜ᵢ⋅𝚝ᵢ

And in buffer_index:

self.offset 
+ index.iter().zip(self.strides.iter()).map(|(&i, &s)| i * s).sum::<usize>()

Now, to crop we keep strides the same, and update shape and offset:

fn crop(&self, limits: &[(usize, usize)]) -> ShapeStrider {
  let offset = self.buffer_index(&limits.iter().map(|&(start, _)| start).collect::<Vec<_>>());
  let shape = limits
      .iter()
      .map(|&(start, end)| end - start)
      .collect::<Vec<_>>();
  // ... return new ShapeStrider with shape and offset
}

Note how we can use buffer_index to calculate the new offset - it will automatically take the current offset into account. The new shape is simply the difference between the given limits. I've put validation in place separately, so at this point it's safe to assume this gives sensible results.

crop's counterpart pad is relatively uninteresting - it calculates the new shape by adding before and after padding to each dimension length, allocates a new buffer of the new size, and then copies the relevant elements from the existing buffer.

Expand

An interesting development occurs when implementing expand. Recall that expand changes the shape by repeating dimensions of length 1. For example, expanding a row vector of shape (1, 3) to shape (10, 3) yields 10 rows with copies of the original row. We can do that without actually copying out data to a new buffer by allowing strides to become zero, i.e. we relax our earlier constraint to ∀ᵢ sᵢ ≥ 0. expand is then a no-copy operation: the shape is expanded to the desired size, and strides for the expanded dimensions are set to 0. Even when the original strides for dimensions of length 1 were non-zero, the only index they could have been multiplied with is 0 (otherwise the index would be out of bounds), so changing the stride to 0 does not affect the resulting buffer index - it just allows passing any index to that dimension, where the 0 stride will now absorb it.

Permute

As for permute, it simply changes the shape and the strides according to the given permutation. You may be able to convince yourself that this is correct by thinking of the two-dimensional case - transposing a matrix means "flipping" the order of the indices. We can leave the offset alone since it was calculated by the addition of strides times indexes, which is commutative.

Here's an illustration of a 3×3 matrix that can share the same buffer with the earlier illustration, but the tensor is transposed thanks to the clever use of strides.

Reshape

The final movement operation is reshape, which is subtle and has a surprising interplay with permute. First, let's introduce the concept of a contiguous tensor. You can guess what a contiguous buffer is: a buffer whose elements are laid out contiguously in memory, i.e. all elements are next to each other. A contiguous tensor is similar, in that the tensor elements as enumerated in increasing row-major index order (0, ..., 0, 0), (0, ..., 0, 1), (0, ..., 0, 2), (0, ..., 1, 0), and so on, are stored in the buffer contiguously.

Why does this matter? Because it turns out that, if a tensor is contiguous, we can reshape it without copying, and the resulting tensor is also contiguous. That's because reshape is unable to alter the order or the total number of elements - it can only split or join dimensions. Remember that the product of the shape's lengths is the total number of elements, and reshape cannot change this product. From a different perspective, the shape of a tensor is a factorization of the total number of elements. So, reshape either:

Splits ("factors") a dimension into two or more dimensions, e.g. changing a shape (..., 6, ...) to (..., 3, 2, ...); or
Joins ("multiplies") two or more dimensions into one, e.g. changing a shape (..., 3, 6, ...) to (..., 18, ...).

If you accept that, then it should be understandable why reshaping a contiguous tensor doesn't need a copy: neither joining nor splitting dimensions affects the order of the elements. So we "simply" need to update the shape and strides to effect the reshape. Simply is in scare quotes there because the actual algorithm is relatively subtle - we're only given the new expected shape, and we have to figure out appropriate joins and splits from the old and new shape.

At this point, it might be worth independently trying to think of the one movement operation that can make a tensor non-contiguous. Your options are reshape, permute, pad, crop, or expand.

(spoiler alert)

It is permute. This should become obvious if you think of a transposed matrix. We know that calling permute on a tensor re-orders the shape and strides. For example, if we start with a tensor of shape (6, 2), then its strides are (2, 1). As we increase the last (column) index, we increase the buffer index by 1. As we increase the first (row) index, we increase the buffer index by 2, because each row contains 2 elements. This tensor is contiguous because the mapping from tensor to buffer index is nice and clean:

(0, 0) -> 0
(0, 1) -> 1
(1, 0) -> 2
(1, 1) -> 3
(2, 0) -> 4
...
(5, 1) -> 11

Now let's transpose the matrix by calling permute(&[1, 0]). The resulting tensor has shape (2, 6) and strides (1, 2). Let's check if this is contiguous:

(0, 0) -> 0
(0, 1) -> 2
(0, 2) -> 4
(0, 3) -> 6
(0, 4) -> 8
(0, 5) -> 10
(1, 0) -> 1
...
(1, 5) -> 11

It's not! Now let's reshape this transposed tensor to shape (2, 2, 3), and try to work out what the strides would be. Reshaping does not change the order, so we can duplicate the second column of our schema, and update the tensor indexes:

(0, 0, 0) -> 0 <--
(0, 0, 1) -> 2
(0, 0, 2) -> 4
(0, 1, 0) -> 6 <--
(0, 1, 1) -> 8
(0, 1, 2) -> 10
(1, 0, 0) -> 1 <--
...
(1, 1, 2) -> 11

It appears that for the last stride, we can choose 2 - so far so good. But for the second to last stride, indicated with the left arrows, there seems to be a problem: there isn't a single stride that gets us through the sequence 0, 6, 1, ... Hence, in this case, reshape first copies the original tensor to a contiguous tensor, and then does a no-copy reshape.

Unary, binary, and reduce

Unary and binary operations are pretty straightforward. They always allocate a new buffer for the result. Reduce operations are more involved - the best way to think about reduce is that it's the opposite of expand. expand takes dimensions of length 1 and replicates them to dimensions of any length, reduce operations max and sum collapse some or all dimensions to length 1 by applying the appropriate accumulation function. The details are tricky but don't reveal any additional insights.

Phew. That was quite the ride.

Eying matrix multiplication

Two examples of how to build interesting operations out of the basic operations.

Creating an eye matrix

An eye matrix of a given dimension d is a d×d matrix with 1s on the diagonal and zeros elsewhere. You can follow along with this via the eye.rs example.

>>> let dim = 3
>>> &Tr::eye(dim)
[1 0 0]
[0 1 0]
[0 0 1]

Here's how it is created - this approach is copied from tinygrad. eye could also be easily created using new, but this is more fun:

>>> let t = &Tr::scalar(1.0)
[1]

>>> let t = t.pad(&[(0, dim)])
[1 0 0 0]

>>> let t = t.reshape(&[1, dim + 1])
[1 0 0 0]

>>> let t = t.expand(&[dim, dim + 1])
[1 0 0 0]
[1 0 0 0]
[1 0 0 0]

>>> let t = t.reshape(&[dim * (dim + 1)])
[1 0 0 0 1 0 0 0 1 0 0 0]

>>> let t = t.crop(&[(0, dim * dim)])
[1 0 0 0 1 0 0 0 1]

>>> let t = t.reshape(&[dim, dim])
[1 0 0]
[0 1 0]
[0 0 1]

Matrix multiplication

You may have been surprised that matrix multiplication is not part of the basic operations in RawTensor. For efficiency, it probably should be, but again inspired by tinygrad, Tensorken favors small over fast: we don't have to implement matrix multiplication separately for each accelerator.

As a secondary reason, showing how to create matrix multiplication out of the existing elementwise operations produces some interesting insights.

Here's the high-level interface to matrix multiplication, via matmul. Follow along via the matmul.rs example

>>> let l = Tr::linspace(0.0, 11.0, 12).reshape(&[3, 4])
[0 1 2 3]
[4 5 6 7]
[8 9 10 11]

>>> let r = Tr::linspace(12.0, 23.0, 12).reshape(&[4, 3])
[12 13 14]
[15 16 17]
[18 19 20]
[21 22 23]

>>> &l.matmul(&r)
[114 120 126]
[378 400 422]
[642 680 718]

Buckle up, we're going to use the combined firepower of reshape, permute, and broadcasting. Hopefully, aiming away from our feet.

The idea is to implement matrix multiplication using broadcasted elementwise multiplication, and then sum the appropriate dimensions.

First, let's massage the left matrix's shape:

>>> let s = l.shape()
[3, 4]
>>> let l_shape = [&s[..s.ndims() - 1], &[1, s[s.ndims() - 1]]].concat()
[3, 1, 4]
>>> let l = l.reshape(&l_shape)
+-----------------------------------------------+
| |
| [0 1 2 3] [4 5 6 7] [8 9 10 11] |
| |
+-----------------------------------------------+

Then, the right matrix:

>>> let s = r.shape()
[4, 3]
>>> let r_shape = [&s[..s.ndims() - 2], &[1], &s[s.ndims() - 2..]].concat()
[1, 4, 3]
>>> let r =
r.reshape(&r_shape).transpose(r_shape.ndims() - 1, r_shape.ndims() - 2)
+-------------------+
| |
| [12 15 18 21] |
| [13 16 19 22] |
| [14 17 20 23] |
| |
+-------------------+

We now have shapes as follows:

l's shape: [3, 1, 4]
r's shape: 1, 3, 4

These two shapes match per the broadcast rules: they have the same number of dimensions, and where the lengths differ one of them is 1.

As if by magic, if we elementwise broadcast-multiply l and r, we're multiplying the correct elements:

>>> let prod = &l * &r
+--------------------------------------------------------------+
| |
| [0 15 36 63] [48 75 108 147] [96 135 180 231] |
| [0 16 38 66] [52 80 114 154] [104 144 190 242] |
| [0 17 40 69] [56 85 120 161] [112 153 200 253] |
| |
+--------------------------------------------------------------+

This tensor has shape [3, 3, 4]. We're multiplying a 3×4 matrix with a 4×3 matrix, so we're expecting a 3×3 matrix. We sum over the last dimension to get us closer to that shape:

>>> let sum = prod.sum(&[prod.shape().ndims() - 1])
+------------------------+
| |
| [114] [378] [642] |
| [120] [400] [680] |
| [126] [422] [718] |
| |
+------------------------+
>>> let s = sum.shape()
[3, 3, 1]

This matrix has the right numbers! But its shape [3, 3, 1] is still slightly wrong. That part at least is easy - we "squeeze" out the last dimension:

>>> sum.reshape(&s[..s.ndims() - 1])
[114 120 126]
[378 400 422]
[642 680 718]

Mind blown yet?

Improvements

I've hopefully qualified my explanations enough that you by now understand that Tensorken has many possible improvements - without even mentioning extending it to run on GPU or adding automatic differentiation.

I'll list a few I have thought of, but the main idea of Tensorken is that it's a testbed for your ideas. Clone it and hack away!

Functionality

One RawTensor operation I haven't implemented is strides which allows setting the strides directly. In addition, strides can be negative numbers as well. Both of these are needed to implement operations like flip that re-order axes back-to-front. Also, it allows slicing using a step, e.g. slicing every second element out of a row vector. I don't think this comes with any deep changes, but it is probably somewhat fiddly. For example, if a stride is negative you'd have to complexify the index to buffer mapping further by adding offsets for the inverted axes.

Ergonomics

Tensorken now panics whenever something is wrong, for example, if you try to add two tensors with incompatible shapes. I initially took this approach because it seemed simpler (fine fine, it was laziness). But it makes Tensorken pretty hard to use in an interactive setting, like a Jupyter notebook, because every panic takes out the kernel and you lose all state. It's worth trying to return Result instead from tensor operations that may fail.

Secondly, the slicing API is limiting. On the one hand, there are some issues with using Rust's built-in indexer, because the trait to implement for that returns a reference to the result, while all tensor operations return a new struct by value. Additionally and annoyingly, quite a few movement operations on tensors are relatively nice to express in Python's more expressive slicing syntax. Especially the lack of a "reverse index" like Python's -1 to indicate the last element of a list, is quite limiting. Rust's ndarry has a purpose-built macro for slicing, which I think is overkill for Tensorken. Perhaps there is a nice compromise to be found.

Efficiency

As the previous section makes clear, Tensorken's high-level API is stitched together from the low-level RawTensor operations. Ideally, we should be able to do this with reckless abandon, performance-wise. This is currently certainly not the case: for example, in matmul there is at least one buffer allocation too many (the allocation of the intermediate elementwise multiplication, which is immediately reduced).

In a better world, Tensorken would fuse operations in a single compute kernel, allocating a single new buffer (if necessary) for the result. Fusing is even more important for GPU operations, as it avoids allocations as well as the launch of multiple compute kernels. GPUs are heavily optimized for bandwidth, so it's extra important to have chunky data and compute.

This could be done by writing a virtualized RawTensor implementation that lazily executes operations: we only need to compute the result if someone calls ravel or to_cpu. Once we have a graph of operations, we can normalize and fuse it however we wish, at least in principle. For example, if we can figure out that the result of a unary operation is used only by a subsequent unary operation, we could only allocate a single buffer and fuse the two operations.

This is a good chunk of work to be sure, but perhaps not as insurmountable as may seem.

One key idea is that it's straightforward to track shapes and strides through all operations, without executing them. When we know the expected shape and strides of the result, we can then work backward to find the elements from the input tensors to combine. We can work backward because the tensor to buffer index function 𝚋 is invertible - i.e. given a buffer index 𝚎, we can figure out which tensor index or indices map to it:

𝚋⁻¹(e) = (..., ((𝚎-𝚘) ÷ sᵢ) % 𝚕ᵢ, ... )

Recall 𝚘 is the offset, and sᵢ and 𝚕ᵢ are the stride and length for dimension 𝚒 respectively.

In the implementation, it'd be nice to do this all in the final style, somewhat similar to how we pushed down negations in the interpreters from my last post.

You made it

My sincere congratulations for making it this far.

In future - hopefully shorter! - posts, I plan to give an overview of the GPU support via WebGPU and more precisely wgpu, which is Mozilla's implementation of the emerging WebGPU standard. I also want to add automatic differentiation to Tensorken - I have an advanced prototype in a branch, but it's not ready for prime time.

I hopefully made clear I'm not an expert in any of this, but I've made a significant effort to learn how existing libraries work. If you know better, please let me know any comments, suggestions, or improvements!

Thanks for reading! For more, subscribe to my newletter and follow me on Twitter or Mastodon.

References

tinygrad: a wonderful, small deep learning framework in Python. Currently the main inspiration and guide for Tensorken. The fundamental operations in RawTensor are tinygrad's, as well as the separation in unary, binary, reduce, and movement operations.
micrograd: an even tinier implementation of a toy neural net library by Andrej Karpathy. Only works on scalars on the CPU, but regardless very readable and hackable.
Neural Networks: Zero to Hero: A course by Andrej Karpathy on building neural networks, from scratch, in code. Not "from scratch" enough for me apparently, because for the real work he uses PyTorch. I've begun reproducing this course with Tensorken, and it drives features for the moment.
An Illustrated Guide to Shapes and Strides: a detailed, in-depth, and beautifully illustrated look at shapes and strides in NumPy arrays.
MiniTorch: Yet another small neural network library implementation, in Python, but this one focuses on the internals so is truly from scratch. The lesson notes have a decent overview of tensors and broadcasting. I didn't look at the code for this much, but the explanations were useful.
Online NumPy playground from w3schools: for quickly running some numpy functions.
NumPy: Tensors in Python.
ndarray: Tensors in Rust.

Efficient, Extensible, Expressive: Typed Tagless Final Interpreters in Rust

Kurt — Wed, 22 Mar 2023 10:38:43 +0000

An explanation of typed tagless final interpreters by Carette et al, with examples in Rust. The main contribution of this post is to explain what "typed tagless final" means, and show that Rust with generic associated types (GATs) is a good fit for writing interpreters in the final style.

A cool looking machine without obvious purpose. Image generated by Midjourney.

Programming involves using and creating APIs: for iterators (map, filter, collect, ...), plotting, data frames, testing, validation or simply formatting strings. When we're creating a new API, we typically think in terms of defining functions to allow users to work with our library. Our job as library implementers is to implement the functions and define the abstractions.

A different perspective is to think of these APIs as little languages, or domain-specific languages. As opposed to general-purpose languages, domain-specific languages or DSLs focus on solving one particular problem well.

From the DSL perspective, we're thinking in terms of defining language constructs, instead of functions. And our job as language implementers is to write an interpreter for the language. More specifically, we can think of it as embedding a DSL, say for formatting strings, into a general-purpose host language like Rust. With this perspective, the first argument to format!("Hello, {}!", "world"); is a small language to specify how to format values of various types: it even has a syntax.

In this post we'll focus extensively on the embedded language perspective - just keep in the back of your head that the applicability of these ideas is broader than interpreters, strictly speaking. As illustration, in the last part we describe a safe, type checked implementation of string formatting: just like with Rust's format strings, the arguments to the format specification are type checked, but we don't use macros or compiler extensions. Additionally, the same format specification can be used to do scanf, or parsing.

Full Rust code available in the companion repo.

Before we begin

This post contains dense Rust code, and not much in terms of visualizations. I assume proficiency with Rust, and familiarity with enums, generics, traits, and associated types.

The build-up and code examples follow Oleg Kiselyov's lecture notes on typed tagless final interpreters closely. All I did was translate the examples to Rust. As it turns out since Rust gained GATs (generic associated types) it's a pretty good vehicle for embedding languages in the final style.

This post has three parts.

Warming up: An introduction to writing interpreters for a simple language with addition, negation, and integers. We build up to the final style, address some of its apparent shortcomings, and show its extensibility advantages.
Higher-order fun: Writing an interpreter for a higher-order language, with functions, increases the complexity but also shows an important advantage of the final style.
Applying the final style - a translation of Kiselyov's type-safe formatting to Rust.

Part 1: Warming up

Consider first the typical way of defining a simple expression language as an embedded DSL. For reasons that will become clear later, the language we'll look at is just slightly more expressive than Hutton's razor: we'll write a calculator that can only add or negate integers. We'll introduce the language using the familiar style which Kiselyov calls "initial". Then, we'll build up to the final style, and discuss its trade-offs.

Expressions, initial style

We start with defining the abstract syntax tree for our language using a Rust enum:

enum Expr {
    Lit(i32), // integer literal
    Neg(Box<Expr>), // negation
    Add(Box<Expr>, Box<Expr>), // addition
}

With some helpful constructors:

impl Expr {
    fn lit(i: i32) -> Expr {
        Expr::Lit(i)
    }

    fn neg(r: Expr) -> Expr {
        Expr::Neg(Box::new(r))
    }

    fn add(r1: Expr, r2: Expr) -> Expr {
        Expr::Add(Box::new(r1), Box::new(r2))
    }
}

We can now define simple expressions in our language.

Expr::add(
    Expr::lit(8),
    Expr::neg(Expr::add(Expr::lit(1), Expr::lit(2))),
)

We can't do anything with those expressions yet. Let's try evaluating them by writing an interpreter.

impl Expr {
    fn eval(&self) -> i32 {
        match self {
            Expr::Lit(i) => *i,
            Expr::Neg(r) => -r.eval(),
            Expr::Add(r1, r2) => r1.eval() + r2.eval(),
        }
    }
}

Easy enough - we pattern match on Expr and recursively call eval. Calling eval on the example yields 5, as expected.

It's also straightforward to interpret the same expression differently. For example, we can view it:

impl Expr {
    fn view(&self) -> String {
        match self {
            Expr::Lit(i) => i.to_string(),
            Expr::Neg(r) => format!("(-{})", r.view()),
            Expr::Add(r1, r2) => format!("({} + {})", r1.view(), r2.view()),
        }
    }
}

Again, pattern matching is very useful here. Calling view on the example yields "(8 + (-(1 + 2)))".

Kiselyov calls this style of writing EDSL the "initial" style. Loosely speaking, the language is defined as an abstract syntax tree using an enum and interpreted using pattern matching on expressions.

Expressions, final style

Suppose we just want to evaluate expressions. We could do that more simply by working with a representation of integers in the host language Rust directly.

type Repr = i32;

fn lit(i: i32) -> Repr {
    i
}

fn neg(r: Repr) -> Repr {
    -r
}

fn add(r1: Repr, r2: Repr) -> Repr {
    r1 + r2
}

Compared with the initial style, we now have a function corresponding to each enum variant. Writing expressions in this style looks very similar to writing expressions in the initial style:

add(lit(8), neg(add(lit(1), lit(2))))

Calling this function immediately evaluates the expression. The expression is the evaluator is the interpreter. There is no intermediate representation for us to pattern match. The interpreter uses the host language operations (addition, negation) and data types (integers) directly.

However, it is not flexible enough. We can't pretty-print expressions, or manipulate them in any way except evaluate them. What we'd like to do is overload these functions so we can define new implementations, and choose a different representation. That's exactly the kind of thing traits are for.

trait ExprSym {
    type Repr;

    fn lit(i: i32) -> Self::Repr;
    fn neg(r: Self::Repr) -> Self::Repr;
    fn add(r1: Self::Repr, r2: Self::Repr) -> Self::Repr;
}

All we've done is moved the top-level functions to a trait and added an associated type for the representation.

We can now re-implement the evaluator, as an implementation of this trait:

struct Eval;
impl ExprSym for Eval {
    type Repr = i32;

    fn lit(i: i32) -> Self::Repr {
        i
    }

    fn neg(r: Self::Repr) -> Self::Repr {
        -r
    }

    fn add(r1: Self::Repr, r2: Self::Repr) -> Self::Repr {
        r1 + r2
    }
}

Building an expression now needs access to a type parameter E: ExprSym - the counterpart for Expr in the initial style - so I'll show the full function. Other than that, it looks remarkably similar to the initial approach:

fn tf1<E: ExprSym>() -> E::Repr {
    E::add(E::lit(8), E::neg(E::add(E::lit(1), E::lit(2))))
}

And we can implement a viewer now:

struct View;
impl ExprSym for View {
    type Repr = String;

    fn lit(i: i32) -> Self::Repr {
        i.to_string()
    }

    fn neg(r: Self::Repr) -> Self::Repr {
        format!("(-{})", r)
    }

    fn add(r1: Self::Repr, r2: Self::Repr) -> Self::Repr {
        format!("({} + {})", r1, r2)
    }
}

Again, compared to the initial style, there is no intermediate representation. Integers in the embedded language are Rust integers. Addition is Rust addition. The final style certainly looks like a zero-cost abstraction, while the initial style may not be - or at least it'd need much more compiler optimizations to get rid of the enum variants and pattern matching.

A Rust-specific snag

The eagle-eyed among you may have noticed I haven't defined a final counterpart for eval and view yet. With the code above, you don't need it - evaluate using tf1::<Eval>() and view using tf1::<View>(). The type argument selects the interpreter.

Still, it'd be nice to be able to write eval(tf1()) or view(tf1()) instead. Let's try:

fn exprsym_eval(e: i32) -> i32 {
    e
}

fn exprsym_view(e: String) -> String {
    e
}

These trivial implementations reveal the lack of intermediate representation in the final style. The idea is to tell the compiler which interpreter (Eval or View) we want by constraining the type of representation. Once we do that, the act of calling tf1 with the right interpreter has already done all the work, and we can just return the result. Alas:

exprsym_eval(tf1());

> error[E0282]: type annotations needed
> cannot infer type of the type parameter `E` declared on the function `tf1`

Rust can't figure out E from an E::Repr - in the type signature fn tf1<E: ExprSym>() -> E::Repr, the function exprsym_eval constrains E::Repr to be i32, but the Rust compiler is not smart enough to see that the only possible E for E::Repr == i32 is Eval.

It would be cool if the compiler could do this. In the general case, there isn't always a one-to-one mapping of an associated type E::Repr to an E: ExprSym, because there may be multiple implementations of ExprSym that have the same Repr, in which case the compiler would have the ask for a type annotation to disambiguate.

We can give the compiler a hand by making the reverse mapping from Repr to ExprSym ourselves. Unfortunately, that means some boilerplate:

// map from E::Repr to E
trait HasExprSym {
    type ES: ExprSym;
}

// i32 -> Eval
impl HasExprSym for i32 {
    type ES = Eval;
}

// String -> View
impl HasExprSym for String {
    type ES = View;
}

// Introduce new type argument T = E::Repr, and wire everything up.
fn tf1<E: ExprSym<Repr = T>, T: HasExprSym<ES = E>>() -> T {
    E::add(E::lit(8), E::neg(E::add(E::lit(1), E::lit(2))))
}

With that, exprsym_eval(tf1()) compiles and runs without trouble.

Finally solving the expression problem

The expression problem is an (in)famous extensibility problem in statically typed programming languages. There are a few subtleties, but the long and the short of it is that it's hard to come up with an abstraction that is easily extensible with both behaviors and representations. To illustrate, let's consider the initial style once again.

Adding new behaviors in the initial style corresponds to adding new interpreters of our language. Easy! We can for example write a new function Expr.count that counts the number of sub-expressions, similar to Expr.view and Expr.eval. However, suppose we want to support multiplication in the EDSL. Now we have to rewrite our enum to add a new variant and modify all existing interpreters to deal with the new variant. If we can't access the source code for the enum or the interpreters, we're stuck. Object-oriented approaches to this problem have a complementary problem: they are easy to extend with new representations, like multiplication, but adding new behaviors requires modification of all classes.

In the final style, this tension does not exist. We've already shown new interpreters are easy to add, by implementing the ExprSym trait. Adding multiplication can be done without changing existing code, and in a type-safe way.

First, define a sub-trait of ExprSym which adds the new operations, and implement it for all interpreters that need to support the new operation:

trait MulExprSym: ExprSym {
    fn mul(r1: Self::Repr, r2: Self::Repr) -> Self::Repr;
}

impl MulExprSym for Eval {
    fn mul(r1: Self::Repr, r2: Self::Repr) -> Self::Repr {
        r1 * r2
    }
}

That's it. We can now use the new operation and interpreter:

fn tfm1<E: MulExprSym<Repr = T>, T: HasExprSym<ES = E>>() -> T {
    E::add(E::lit(7), E::neg(E::mul(E::lit(1), E::lit(2))))
}

let final_style = exprsym_eval(tfm1());
assert_eq!(5, final_style);

We haven't implemented View for MulExprSym yet. Crucially, the compiler will tell us as much:

exprsym_view(tfm1())

> error[E0277]: the trait bound `View: MulExprSym` is not satisfied

This is an equivalent of pattern match exhaustiveness checking that comes with the final style for free, i.e. without any special-purpose compiler check or linter.

We never have to modify the original trait definition or implementations. They could live in another crate, we can still extend them without modifying the source. This is one of the main advantages of the final style over the initial style. Soon, we'll move on to higher-order languages, where we'll see another major advantage of the final style.

But what about pattern matching

Before we do that though, it's worth exploring whether the final style can express all of the things that the initial style can. We're especially worried about the loss of pattern matching in the final style.

For example, let's consider the case where we want to push down negation to literals, getting rid of double negation. You can think of this as an example of an optimization or rewriting pass.

impl Expr {
    fn push_neg(self) -> Expr {
        match &self {
            Expr::Lit(_) => self,
            Expr::Neg(content) => match content.as_ref() {
                Expr::Lit(_) => self,
                Expr::Neg(c) => c.to_owned().push_neg(),
                Expr::Add(r1, r2) => Expr::add(
                    Expr::Neg(r1.clone()).push_neg(),
                    Expr::Neg(r2.clone()).push_neg(),
                ),
            },
            Expr::Add(r1, r2) => Expr::add(r1.to_owned().push_neg(), r2.to_owned().push_neg()),
        }
    }
}

push_neg transforms an expression like (8 + (-(1 + 2))) into (8 + ((-1) + (-2))). The nested pattern match reveals a context dependence: we can no longer use the simple compositional structure of eval, by calling eval on sub-expressions and putting the results back together.

A compounding problem is that the result of push_neg is a new expression, which we can then interpret further in many ways, like view or eval. Can we write push_neg in the final style, without pattern matching, and produce a new expression that can then be further interpreted?

The only thing we can do in the final style is writing an interpreter by implementing the ExprSym trait, so that's what we'll do.

The key to this seemingly unsolvable problem is to make context dependence explicit. In this case, the interpreter must keep track of whether an expression occurs as part of another negation. I'll show the code first, then discuss it.

enum Ctx {
    Pos,
    Neg,
}

struct CtxFun<TRepr>(Box<dyn Fn(&Ctx) -> TRepr>);

struct PushNeg<T>(PhantomData<T>);
impl<T: ExprSym> ExprSym for PushNeg<T>
where
    for<'a> T: 'a,
{
    type Repr = CtxFun<T::Repr>;

    fn lit(i: i32) -> Self::Repr {
        CtxFun::new(move |ctx| match ctx {
            Ctx::Pos => T::lit(i),
            Ctx::Neg => T::neg(T::lit(i)),
        })
    }

    fn neg(r: Self::Repr) -> Self::Repr {
        CtxFun::new(move |ctx| match ctx {
            Ctx::Pos => r.0(&Ctx::Neg),
            Ctx::Neg => r.0(&Ctx::Pos),
        })
    }

    fn add(r1: Self::Repr, r2: Self::Repr) -> Self::Repr {
        CtxFun::new(move |ctx| T::add(r1.0(ctx), r2.0(ctx)))
    }
}

fn exprsym_push_neg<S: ExprSym>(e: CtxFun<S::Repr>) -> S::Repr {
    e.0(&Ctx::Pos)
}

We can verify the result:

exprsym_view(tf1())
> (8 + (-(1 + 2)))

exprsym_view(exprsym_push_neg(tf1()))
> (8 + ((-1) + (-2)))

One way to think about it is that PushNeg interprets an expression into a Rust function. The Rust function takes a context and produces a new expression, which can then be further interpreted. This fact is visible in the type type Repr = CtxFun<T::Repr>, as well as in the implementation of lit. To get out the new expression in the final style, exprsym_push_neg calls the function with Ctx::Pos.

As an intermediate step, it might help to write a version of the initial version that similarly takes a Ctx argument - the structure is then similar to the final style, with pattern matching on enum variants taking the place of implementing lit, neg and add trait methods.

The implementation type PushNeg<T> includes a type parameter. Eval and View have no such parameter. This type parameter is necessarily an ExprSym and is needed to produce a new final style expression that we can re-interpret. We gain another complementary insight by expanding a few functions:

// note this is a tf1 version that does not use HasExprSym.
// ctx_fun: Fn(&Ctx) -> String
let ctx_fun: CtxFun<String> = tf1::<PushNeg<View>>();
let view: String = ctx_fun(&Ctx::Pos)

The type arguments to tf1 can be understood as specifying a stack of interpreters: first, push negations down, then view the result. What tf1 returns is a function of those type arguments - in this case it's a CtxFun because of PushNeg, which returns a String, because of View. At run time, just like in the initial style, there are two passes: first tf1 is executed, producing a CtxFun. Second, the CtxFun is executed, which produces a String via the View interpreter.

Arguably, the context-dependence in the final style is much better visible - Ctx shows exactly the bit of context we need, not more, not less. It's also clearly total: it doesn't fail and terminates - in fact, there are no recursive calls at all.

Side note: it is once again annoying to help Rust infer the ExprSym from the representation. The trick with HasExprSym I described earlier results in much more boilerplate still, because Rust does not allow "unconstrained type parameter T" in trait implementations. The full code is in the repo but is otherwise not very illuminating.

Time for a recap

This section introduced the mind-boggling but wonderful final style. It will become truly mind-boggling and more wonderful in the next sections.

For now, let's summarize the salient points.

The final style is more direct than the initial style in using the host language, Rust. There is no intermediate representation.
Final style has significant extensibility advantages over the initial style. It effectively solves the expression problem.
The loss of pattern matching is not the restriction it may first appear. We can make the necessary context explicit, and write a correct, total interpreter.

Homework: it is straightforward to implement a final style interpreter that produces an initial style Expr, and vice versa. That shows that they are essentially equivalent. If this is not immediately clear to you, I recommend trying it - you will learn much more than by reading how it's done.

Part 2: Higher-order fun

All in all, the final style doesn't look so different from approaches like overloading numerical operators (e.g addition and multiplication) using the Add and Mul traits. One syntactic difference, that I was using an associated type instead of just implementing a trait on the representation type directly, seems trivial. However, for higher-order languages, which we'll discuss now, we need associated types.

What makes a language higher-order is that it treats functions as values. This makes it possible to pass functions as arguments to other functions and other fun stuff. Higher-order operations are quite common in embedded languages - for example, any data frame library worth its salt will have functions to map a given column, filter values using a predicate, and so on. Those are all higher-order operations.

Revisiting the initial style

Once again, let's start by adding functions to our little language in the initial style. I've removed addition and negation because they don't add anything new - as a result, we have a very minimalistic language that just has literals, functions, and function applications.

enum Expr {
    Var(VarId), // variable - to refer to function arguments in function body
    Int(i32), // literal integer
    Lam(Rc<Expr>), // function with given body
    App(Rc<Expr>, Rc<Expr>), // apply the given function to the given argument
}

Some people may recognize this as the lambda calculus. With a few helper constructor functions, which I've omitted, here's a simple expression in our language.

// (\× -> x) 1
Expr::app(
    Expr::lam(Expr::var(0)),
    Expr::int(1)
)

This applies the identity function to argument 1, so we'd expect it to evaluate to 1.

(I'm not sure if the reference counting Rcs are needed, but after struggling with lifetimes, ownership, and the borrow checker in the evaluator I caved. Since this is irrelevant to the topic of this post, I won't come back to it. I'd be interested if you can get rid of them, though.)

Now, if you think about writing an evaluator for this language, there are a few complications that we didn't have with the simple language from the previous section. First of all, it's possible to write nonsensical expressions:

// 1 2
Expr::app(
    Expr::int(1),
    Expr::int(2)
)

You can only apply functions, it's an error to apply an integer to anything. Even assuming the user only writes correct expressions, what is the result type of eval(Expr) -> ...? Since we've introduced a new type of value - a function - users can write:

// \x -> x
Expr::lam(Expr::var(0))

We have no choice but to introduce a new enum for eval's return type. Evaluation can produce an integer or a function:

enum Val {
    Int(i32),
    Fun(Rc<dyn Fn(Val) -> Val>),
}

type Env = Vec<Val>;

impl Expr {
    fn eval(env: Env, expr: Expr) -> Val {
        todo!()
    }
}

Just like evaluating an expression to an integer means evaluating to a Rust i32, so does evaluating to a function mean evaluating to a Rust Fn. The evaluator also needs an environment, to look up the values of any variables captured in closures. I'll just dump the code here without explanation - the details are not all that relevant:

fn eval(env: Env, expr: Expr) -> Val {
    match expr {
        Expr::Var(id) => env[id].clone(),
        Expr::Int(i) => Val::Int(i),
        Expr::Lam(e) => Val::Fun(Rc::new(move |x| {
            let mut envr = env.clone();
            envr.push(x);
            Expr::eval(envr, e.as_ref().clone())
        })),

        Expr::App(e1, e2) => {
            let eval_e1 = Expr::eval(env.clone(), e1.as_ref().clone());
            let eval_e2 = Expr::eval(env, e2.as_ref().clone());
            match eval_e1 {
                Val::Fun(f) => f(eval_e2),
                _ => panic!("Expected function"),
            }
        }
    }
}

What is important to note is that there are two ways this evaluator can fail:

There is an obvious panic when trying to apply a non-function.
Lookup of variables in the environment may fail - they must be numbered correctly.

If we'd like to make our language type-safe, we'd have to implement a type checker ourselves, turning our language into the simply typed lambda calculus.

This is all very annoying - especially given that Rust itself, our host language, already has a type system, functions, and all that jazz, which prevent all these errors at compile time, on the Rust level. Is there no way to leverage Rust's type system to make our embedded language type safe too?

Indeed there is - using the final style.

Finally higher-order

We'll proceed much as in the first-order case - instead of using an enum to define our language, we'll use a trait. The extra (and last) ingredient in the final style is that we'll use a generic associated type (GAT), which has been made stable since Rust 1.65. In Haskell, Kiselyov uses higher-kinded types instead, but as we'll see, they're pretty straightforward to translate. (And yes, this is the reason I used associated types for the first-order case too, although it wasn't necessarily the right approach there.)

type Fun<A, B> = Box<dyn Fn(A) -> B>;

trait ExprSym {
    type Repr<T>;

    fn int(i: i32) -> Self::Repr<i32>;
    fn add(a: &Self::Repr<i32>, b: &Self::Repr<i32>) -> Self::Repr<i32>;
    fn lam<A, B, F: Fn(Self::Repr<A>) -> Self::Repr<B>>(f: F) -> Self::Repr<Fun<A, B>>
    where
        for<'a> F: 'a;
    fn app<F: Fn(A) -> B, A, B>(f: Self::Repr<F>, arg: Self::Repr<A>) -> Self::Repr<B>;
}

(I had a few shenanigans with functions which makes this a bit more complicated than I would like. My first approach was to use impl Fn(A) -> B everywhere, instead of the Boxed function and the F type arguments, but Rust doesn't (yet?) support those in the places I need them.)

You'll probably want to stare at that for a bit. The salient point is that the Repr type now takes a generic parameter T. This effectively co-opts Rust's type system to work in the embedded language as well. add is an operation that takes two integers and returns an integer. app takes a function and an argument and returns the result of the function. Adding two functions, or applying an integer is disallowed by Rust's type system.

fn tf2a<T, E: ExprSym>() -> E::Repr<T> {
    // error[E0277]: expected a `std::ops::Fn<(_,)>` closure, found `i32`
     E::app(E::int(2), E::int(3))
}

At the risk of belaboring the point, in the final approach, there's no intermediate representation. Integers are Rust integers, variables are Rust variables, functions are Rust functions, and types are Rust types. The evaluator looks like the identity function.

struct Eval;
impl ExprSym for Eval {
    type Repr<T> = T;

    fn int(i: i32) -> Self::Repr<i32> {
        i
    }

    fn add(a: &Self::Repr<i32>, b: &Self::Repr<i32>) -> Self::Repr<i32> {
        a + b
    }

    fn lam<A, B, F: Fn(Self::Repr<A>) -> Self::Repr<B>>(f: F) -> Self::Repr<Box<dyn Fn(A) -> B>>
    where
        for<'a> F: 'a,
    {
        Box::new(f)
    }

    fn app<F: Fn(A) -> B, A, B>(f: Self::Repr<F>, arg: Self::Repr<A>) -> Self::Repr<B> {
        f(arg)
    }
}

A few examples of the evaluator in action:

fn th1<E: ExprSym>() -> E::Repr<i32> {
    E::add(&E::int(1), &E::int(2))
}

th1::<Eval>()
> 3

fn th2<E: ExprSym>() -> E::Repr<Fun<i32, i32>> {
    E::lam(|x| E::add(&x, &x))
}

th2::<Eval>()(2)
> 4

In th2, the interpreter produces a Rust function that we can call.

The implementation of a viewer is similar. The representation we choose is Fn(VarCounter) -> String so that we can generate unique names for variables. The full implementation is in the repo, here are some examples:

th1::<View>()(0)
> "(1 + 2)"

th2::<View>()(0)
> "(\\x0 -> (x0 + x0))"

Higher-order recap

This second part shows how the final style maximizes the effectiveness of the host language in the embedded language. It succeeds in co-opting most if not all of the constructs of the host language. That's great because it's exactly why we are embedding the DSL, instead of writing a new, separate language.

Furthermore, all the advantages of the previous section like extensibility carry over. Try adding multiplication as an exercise, by defining a new MulExprSym sub-trait of ExprSym.

I can also now explain what tagless means. Tagging refers to how, in the initial style, the enum "tags" different cases at runtime, and pattern matches on these tags at runtime. In the final style, there are no tags, we never introduce them. Instead, we rely on overloads via traits that are resolved at compile time.

As a result, the final code is easier to optimize and should compile as if the additions and functions were written in Rust directly. It's conceivable that Rust/LLVM is smart enough to compile to similar code from the initial style - though that seems to require more advanced optimizations.

Whatever the case, tagging has the annoying side-effect that types in the embedded language are erased in the host language - every expression becomes of type Expr, and so when writing expressions in the embedded language, the host language's type system becomes useless. In the final style, in contrast, the host language's type system is leveraged directly.

Part 3: Typed final tagless formatting

For the final part, we'll apply the final style to a more realistic problem: type-safe printing and parsing based on a format specification. I'll just quote Oleg Kiselyov's lecture notes here:

The typed formatting problem is to write type-safe versions of the familiar C functions printf and scanf. The polyvariadic function sprintf should take the formatting specification (the formatting pattern) and the values to format, and return the formatted string. The types and the number of sprintf’s arguments have to match the formatting specification. The typed sscanf takes the input string, the format specification, and the consumer function. It parses data from the string according to the formatting specification, passing them to the consumer. The number and the types of the arguments for the consumer function have to match the formatting specification. Since parsing is necessarily partial, sscanf should return the result of the consumer function in the Maybe monad.

I translated Kiselyov's Haskell implementation to safe Rust. In contrast with the remark from std.fmt.Arguments:

This structure represents a safely precompiled version of a format string and its arguments. This cannot be generated at runtime because it cannot safely be done, so no constructors are given and the fields are private to prevent modification.

A format specification and its arguments can be safely generated at runtime.

I'll first show a few examples of type-safe printing and scanning in Rust, and then explain how it works. The interface is not ergonomic at all and needs a macro or two to be fit for actual use. However, the macro would be thin - there'd be no need to validate the format specification. That the format pattern and arguments match is statically checked by the Rust type system. It doesn't need compiler plugins or elaborate macro-level checks.

The simple formatting language supports three constructs:

lit for strings,
char for characters,
int for integers,
compose to append two format specs.

It's bare-bones but implemented in the final style so easily extensible, without changing existing interpreters.

As another direct consequence of using the final style, the same pattern can be used for formatted printing or scanning - we write one interpreter for printing, and a separate one for scanning.

Here's the simplest pattern, just a string literal:

fn fmt1<F: FormattingSpec, A>() -> F::Repr<A, A> {
    F::lit("Hello, ")
}

sprintf(fmt1::<Print, _>())
> "Hello, "

sscanf("Hello, ", fmt1::<Scan, _>(), ())
> Some(())

sscanf("World", fmt1::<Scan, _>(), ())
> None

Taking it up a notch:

fn fmt2<F: FormattingSpec, A>() -> F::Repr<A, Fun<char, A>>
where
    for<'a> A: 'a,
{
    F::compose(F::lit("Hello, world"), F::char())
}

sprintf(fmt2::<Print, _>())('!')
> "Hello, world!"

sscanf("Hello, world?", fmt2::<Scan, _>(), new_fun(|x| x))
> Some('?')

And finally:

fn fmt3<F: FormattingSpec, A>() -> F::Repr<A, Fun<char, Fun<i32, A>>>
where
    for<'a> A: 'a,
{
    F::compose(
        F::lit("The value of "),
        F::compose(F::char(), F::compose(F::lit(" is "), F::int())),
    )
}

sprintf(fmt3::<Print, _>())('C')(67)
> "The value of C is 67"

sscanf(
    "The value of C is 67",
    fmt3::<Scan, _>(),
    new_fun(|c| new_fun(move |i| (c, i))),
);
> Some(('C', 67))

Note how the compiler can infer from the pattern fmt3 that sprintf takes a character and an integer - as expected, trying to pass anything else is an error. Similarly, the result of sscanf is an option containing a typed char and integer. You can also see those types in the return type of fmt3.

If you've never seen this before I bet it is quite surprising that it's possible to do type-safe formatting at all. Almost all languages either use untyped format strings or have special compiler plugins to make the format strings type-safe. The insight there is not related to the tagless final style - it has a rather long history, and you don't need advanced type system tricks to achieve it. Kiselyov's lecture notes have the necessary references. I'll give the highlights of the implementation here.

Here is the final style specification of the format specification:

// NOTE: lifetime constraints omitted for brevity
trait FormattingSpec {
    type Repr<A, B>;

    fn lit<A>(s: &str) -> Self::Repr<A, A>;
    fn int<A>() -> Self::Repr<A, Fun<i32, A>>;
    fn char<A>() -> Self::Repr<A, Fun<char, A>>;
    fn compose<A, B, C>(f: Self::Repr<B, C>, g: Self::Repr<A, B>) -> Self::Repr<A, C>;
}

One way to think of the types for the primitives lit, int, and char is that they take an "input" type A and if necessary append the type they represent. The crucial function compose then combines two such functions by "appending" function arguments from both its arguments. The formatting specification builds a nested list of functions - the function arguments are a polyvariadic list of sorts. (A polyvariadic list means a heterogeneous list: the heterogeneity of tuples with the ability to add as many elements as you like from Vec).

While the details of the implementation are certainly mind-expanding, they are not related to the final style per se. See the companion repo for more details and examples.

While typed sprintf was a solved problem, casting it to the final style allowed Kiselyov to implement typed scanf as well, sharing the same format specification.

The formatting specification is extensible with width specifiers, other primitives, and so on. We can also straightforwardly write new interpreters, for example, to output a more traditional format string from the specification. Finally, unlike Rust's format strings, the format specifications are first-class values: they can be combined at runtime, and even loaded from a file.

Thanks for reading! For more, subscribe to my newletter and follow me on Twitter or Mastodon.

References

Oleg Kiselyov's website: https://okmij.org/ftp/tagless-final/ is a treasure trove.
I recommend starting with the lecture notes, on which pretty much this entire post is based. I couldn't find all the code examples for the lecture notes, but a Very Nice Person put them all on GitHub: https://github.com/michaelt/tagless.
Companion repo: https://github.com/kurtschelfthout/finally-tagless

A Nibble of Quadtrees in Rust

Kurt — Wed, 22 Feb 2023 18:33:24 +0000

Nibble: a small piece of food bitten off. In computing: half a byte of information. Every nibble explains a computing science or software engineering idea in five minutes.

That’s right, I just typed quad and tree in the stock image search. Photo by Appic on Unsplash

Quadtrees are a tree data structure in which each non-leaf node has exactly four children. They are related to binary trees and frequently represent properties of two-dimensional space, such as point locations or regions.

All code for this post is here: https://github.com/kurtschelfthout/quadtrees.

The basics of quadtrees

As a data structure, quadtrees are straightforward trees where each branch node has exactly four children.

We can store a variety of information in such a tree, typically related to two-dimensional data, such as the locations of objects in a plane, a summary of a part of an image like its average color, or information about lines or shapes in a region of a drawing. Why are binary trees, quadtrees, and octrees used more often than 5-ary trees or 12-ary trees? Because of the relation to one, two, and three-dimensional space. It takes two line segments to partition a (one-dimensional) line segment, four squares to partition a square, and eight cubes to partition a cube in three-dimensional space.

Like all data structures, quadtrees are used primarily for performance reasons. For example, when simulating a large number of moving objects, naively checking for collisions means comparing the position of each object with every other. If we store the objects' positions in a quadtree, we can limit collision detection to a few regions of interest, greatly reducing the work needed.

In this post, we'll look specifically at region quadtrees. A region quadtree partitions a 2D space into regions - each node in a region quadtree represents a rectangular region. Regions can be split further into four sub-regions, which can be split in turn, until some desired resolution is reached.

Representing images with quadtrees

I'll be using quadtrees to generate lower-resolution versions of an image. The idea is to repeatedly divide the image into four regions. Each region, which can contain many pixels, is approximated by a single value - a "big pixel". I'll take the mean of the RGB values in the region as the color of that big pixel.

As a first attempt at this, I used a complete quadtree - meaning that each level of the tree is fully populated and the tree is perfectly balanced. The leaf nodes represent one pixel each, their parents represent a two-by-two pixel region, and so on. To get a lower-resolution version of the image, we can cut off the tree at the desired level.

Here's an illustration. The original four-by-four image is in the bottom left, and the lowest level of the tree represents it exactly.

Color-challenged people will have to trust me that the colors match up.

Further up in the tree, in the middle level, each region is two by two pixels and is only an approximation of the original. In the top left and bottom right corners, the approximation happens to be exact. At the root level, there is just a single big pixel.

Thanks to the magic of WebAssembly, which compiled my Rust code below to something that runs in the browser, you can play around with quadtrees on a more realistic image:

Playground for complete quadtrees applied to HAL 9000 image

Here's a screenshot of the playground in action. The original image on the left, the image as represented by an adjustable level in the quadtree on the right.

Do these look like those robo-boobs from Austin Powers? Asking for a friend.

That works reasonably well for symmetrical images like the one above, but most images have parts we care less about. It'd be nice if we could focus the quadtree to divide just the interesting parts of the image into smaller regions. To do that, we can no longer have a complete quadtree - interesting regions will gain more levels than uninteresting ones, and the tree will be unbalanced.

Assuming that's fine, how should we determine which region to focus on? The idea is to work iteratively. To start with, we approximate the entire image with a single region of the image's mean color. At each step, we calculate an error value for each leaf region in the quadtree. The error indicates how well the region represents the corresponding region in the original image. I used the mean squared difference between the mean color for that region and the actual colors.

While the error for a particular region is greater than we tolerate, keep subdividing it into four more sub-regions. We'll end up with an approximation of the original image that creates more, smaller regions in interesting parts of the image, and includes fewer, larger regions in less interesting parts.

Here is a demo of region quadtrees in action. You can interactively adjust the desired error and the minimum region length.

Playground for region quadtrees applied to owl image

On the left is the original image and on the right is the image represented by a quadtree. With a higher error tolerance, you'll see fewer and bigger regions, concentrated on busier parts of the image:

Quadtrees laugh at your thousands of years of natural selection for camouflage.

Implementing region quadtrees in Rust

That's it for the demos, let's look at some code for region quadtrees - the second approach.

The representation of a region quadtree is straightforward:

struct Region {
    x: usize,
    y: usize,
    width: usize,
    height: usize,
}

enum RegionQuadTree {
    Leaf(Region, Rgba),
    Branch([Box<RegionQuadTree>; 4]),
}

A quadtree is either a leaf or a branch. A leaf stores the region it represents and its color as an RGBA (Red+Green+Blue+Alpha) value. I chose RGBA simply because that's how HTML's canvas represents a pixel of image data. You can store any information about a region in leaf nodes. A branch only has references to its four children. It usually makes sense to store more information on branches, such as the region they apply to, but I didn't need it here.

The driver for the demo is the following function:

pub fn subdivide_until(&mut self, error_threshold: f32, min_region_length: usize) {
    loop {
        let new_quadtree =
            self.quadtree
                .subdivide(&self.image, error_threshold, min_region_length);
        match new_quadtree {
            Some(qt) => self.quadtree = qt,
            None => break,
        }
    }
}

The function subdivides a quadtree until it reaches a small enough error, or until the length of its regions' sides becomes too small. The latter to avoid continuing indefinitely and underflow problems. The subdivide function has the following signature:

fn subdivide(
    &self,
    image: &Image,
    error_threshold: f32,
    min_region_length: usize,
) -> Option<RegionQuadTree>

It returns a new, subdivided quadtree, or None if either the error threshold or the min region length is met:

fn subdivide(
    &self,
    image: &Image,
    error_threshold: f32,
    min_region_length: usize,
) -> Option<RegionQuadTree> {

    let region = self.region();
    if self.get_error(image) < error_threshold
        || region.height <= min_region_length
        || region.width <= min_region_length
    {
        return None;
    }

    match self {
        RegionQuadTree::Leaf(region, _) => {
            /// ...
        }
        RegionQuadTree::Branch(children) => {
            /// ...
        }
    }
}

The implementation for the cases is straightforward, if verbose.

Subdividing a leaf means splitting it into four regions, taking care not to miss any pixels, and then creating a new branch to hold the four new children:

match self {
    RegionQuadTree::Leaf(region, _) => {
        let half_width_l = (region.width as f64 / 2.0).floor() as usize;
        let half_width_r = (region.width as f64 / 2.0).ceil() as usize;
        let half_height_up = (region.height as f64 / 2.0).floor() as usize;
        let half_height_dwn = (region.height as f64 / 2.0).ceil() as usize;

        let children = [
            Box::new(RegionQuadTree::leaf(
                region.x,
                region.y,
                half_width_l,
                half_height_up,
                image,
            )),
            Box::new(RegionQuadTree::leaf(
                region.x,
                region.y + half_height_up,
                half_width_l,
                half_height_dwn,
                image,
            )),
            Box::new(RegionQuadTree::leaf(
                region.x + half_width_l,
                region.y,
                half_width_r,
                half_height_up,
                image,
            )),
            Box::new(RegionQuadTree::leaf(
                region.x + half_width_l,
                region.y + half_height_up,
                half_width_r,
                half_height_dwn,
                image,
            )),
        ];
        Some(RegionQuadTree::Branch(children))
    }
    RegionQuadTree::Branch(children) => {
        // ...
    }
}

Subdividing a branch means subdividing each of its children. We avoid creating a new tree if all children report they don't need to be subdivided further (by returning None). We also have to put the tree back together in case some children subdivide while others do not.

match self {
    RegionQuadTree::Leaf(region, _) => {
        // ...
    }
    RegionQuadTree::Branch(children) => {
        // call subdivide on all children - zip with existing children to use 
        // as default later.
        let sub_children: Vec<_> = children
            .iter()
            .map(|child| child.subdivide(image, error_threshold, min_region_length))
            .zip(children)
            .collect();

        // all children returned None - avoid returning a superfluous new tree.
        if sub_children.iter().all(|c| c.0.is_none()) {
            return None;
        }

        // replace any None result on children with the "old" child.
        let children = sub_children
            .into_iter()
            .map(|(nc, oc)| Box::new(nc.unwrap_or(*oc.clone())))
            .collect::<Vec<_>>()
            .try_into()
            .unwrap();
        Some(RegionQuadTree::Branch(children))
    }
}

Those are the most important bits - you can look at the rest of the code in the repo.

There's also an implementation of a complete quadtree. The main difference with a region quadtree is that we can figure out the number of nodes beforehand. For example, a complete quadtree with three levels (not counting the root) has 1+4+16+64 nodes. Like with a binary tree, this can be leveraged to store a complete quadtree in an array, and use indices to represent the parent-child structure: children of the node at index i' are stored at index 4*i + 1 to 4*i + 4.

What does this have to do with z-order?

In the last nibble I said Z-order curves are useful to create quadtrees. I haven't used them here, but it's worth exploring the connection. If you squint at the creation of four new leaf nodes in the RegionQuadTree::Leaf case of subdivide above, you'll see that the leaves are created in z-order: top-left, bottom-left, top-right, bottom-right. That's true recursively, as creating new branches keeps the same order as the existing children. It's slightly clearer in a complete quadtree because all its regions have the same size.

If the regions to insert are known, you can avoid overhead while creating the tree if you feed the arguments to the tree creation processor in z-order. It's also possible to parallelize quadtree creation that way.

In conclusion

Wikipedia lists interesting applications of quadtrees, including image representation and processing, collision detection, and spatial indexing for location queries. At its heart, a quadtree is simply an extension of binary trees, with each node having four children instead of two. However, just like binary trees, this simple structure allows many interesting uses.

Thanks for reading! I write one nibble every month. For more, subscribe to my newletter and follow me on Twitter or Mastodon.

References

Code for this post: https://github.com/kurtschelfthout/quadtrees.
Segmenting images with quadtrees: https://jrtechs.net/photography/segmenting-images-with-quadtrees
Computer art based on quadtrees: https://github.com/fogleman/Quads

A Nibble of Geohashes in Go

Kurt — Wed, 28 Dec 2022 11:13:51 +0000

Latitude and longitude as a locality-preserving string

Nibble: a small piece of food bitten off. In computing: half a byte of information. Every nibble explains a computing science or software engineering idea or system in five minutes.

Photo by June on Unsplash

Geohashes encode longitude and latitude in a simple URL-safe string for geohash.org. They're interesting because two geohashes that share the same prefix are pretty close together. For example, the Alexanderplatz in Berlin has geohash u33dc1r4. Nearby Brandenburger Tor has geohash u33db2jx - note the shared prefix u33d.

As an aside, I have not programmed in Go before and just ran hello world. I wasn't going to leave that alliteration on the table!

What is a geohash?

A geohash encodes a latitude and longitude into a string of digits and letters. It’s base-32 encoded so uses 32 possible characters: 10 digits 0-9 and 22 letters a-z excluding a, i, l, and o.

For example, the geohash sunny decodes to 23.7 42.5, or N 23°42.000' E 42°30.000' in Saudi Arabia. Most geohashes aren't such memorable words, but if you're bored you can check where words or names are geolocated. My name is near the coast of Madagascar.

For readers who are geographically challenged, like me: when expressed as decimals, positive latitude is north of the Equator, and negative to the south. Latitude ranges from -90° to 90° at the south and north pole respectively. Positive longitude is east of the prime meridian (through Greenwich, UK), and negative is west. Longitude ranges from -180° to 180°. Good ol' Encyclopaedia Britannica has some nice visualizations.

As we'll see, by construction geohashes have a few interesting properties.

Two geohashes that share a common prefix have coordinates that are close together.
The longer the geohash, the more accurate it is. Longer geohashes describe smaller regions.
A geohash that adds characters to another geohash, is a sub-region of that geohash.

On the other hand, it is NOT necessarily the case that two nearby locations share a geohash prefix. Around the Equator, Prime Meridian, and the Poles, nearby locations can have completely different latitudes and longitudes, which result in different geohashes.

Z-order curves

Geohashes map latitude and longitude on a z-order curve. A z-order curve is a general mapping from a 2-dimensional coordinate to a one-dimensional coordinate - a point on a line. The mapping preserves locality: if two 2D coordinates are near, their 1D coordinates are also near. Z-order curves are a particular type of space-filling curve: a curve whose range contains an entire 2-dimensional square. In simple terms: a contiguous line through a table that passes through all the table's cells once.

Z-order curves are simple to construct: interleave the bits of the x and y coordinates to get the 1D coordinate. If you lay out your axes right, that creates a recursive Z-shaped curve in the 2-dimensional plane:

The table has the x coordinates 0-3 horizontally at the top, and the y coordinates vertically on the left. The dashed line shows the resulting z-order, starting from the top left.

Z-order curves can be generalized to more than two dimensions and are useful when multidimensional data needs to be laid out sequentially, typically for performance reasons. Examples are:

Database indexes for locations. Allows searching for nearby locations by looking for similar prefixes.
In-memory layout of texture maps in GPUs. Optimizes memory layout to reduce cache misses.
Efficient matrix multiplication, to reduce cache misses.

Decoding a geohash

First, let's try to decode a given geohash to latitude and longitude.

The easy part is finding the bit representation of a geohash string like "gbsuv". Since geohash uses base-32 encoding, each character corresponds to 5 bits. All we need is a constant containing the allowable characters. The index of the character in the string contains the 5 bits we're interested in:

const base32 = "0123456789bcdefghjkmnpqrstuvwxyz"

c := strings.IndexRune(base32, letter)

Now how to process those bits into latitude and longitude?

First, we know that a geohash is a z-order curve, so its bits are longitude and latitude interleaved. The even bits in a geohash are longitude, the odd bits are latitude.

Second, a geohash does not describe an exact location. The coordinate geohash.org returns is really the mid-point of a "rectangular" region. Rectangular is in scare quotes because it's a rectangle projected on a sphere - at the poles, it's a triangle! The region can be described by a minimum and maximum longitude and latitude, for the two corners of the “rectangle”.

Here's an animation of how 10 bits converge to a smaller and smaller region (in orange) - each bit moves one of the corners:

The empty geohash represents the entire Earth. The first bit determines whether the location is in the eastern or western hemisphere. The second bit determines whether the region is in the northern or southern hemisphere - after two bits we know in which quadrant of the Earth the location is. Each bit approximately halves the region further, along east-west or north-south.

The longer the geohash, the smaller the region. To calibrate your intuition, a geohash of 8 characters describes a region of 40m by 40m.

Here's the Go code to decode a geohash:

const latMin, latMax = -90.0, 90.0
const lonMin, lonMax = -180.0, 180.0

func decode(geohash string) (lat, lon float64) {
    curMinLat, curMaxLat := latMin, latMax
    curMinLon, curMaxLon := lonMin, lonMax
    bitIsLongitude := true

    // loop through each character in the geohash from left to right
    for _, character := range geohash {

        // decode the base32 representation
        nibblebit := strings.IndexRune(base32, character)

        // each character represents 5 bits
        for b := 4; b >= 0; b-- {
            // get the bit at position b
            bit := nibblebit & (1 << uint(b))
            if bitIsLongitude {
                if bit == 0 {
                    curMaxLon = (curMinLon + curMaxLon) / 2
                } else {
                    curMinLon = (curMinLon + curMaxLon) / 2
                }
            } else {
                if bit == 0 {
                    curMaxLat = (curMinLat + curMaxLat) / 2
                } else {
                    curMinLat = (curMinLat + curMaxLat) / 2
                }
            }
            bitIsLongitude = !bitIsLongitude
        }
    }

    // we now have a region - return the midpoint
    return (curMinLat + curMaxLat) / 2, (curMinLon + curMaxLon) / 2
}

Encoding a geohash

Encoding a geohash from a latitude and longitude works similarly.

Again we start with a region that encompasses the entire Earth. First, we determine whether the longitude is in the western or eastern hemisphere, and add a 0 or 1, respectively. Then we look at latitude to determine the next bit, halving the region again.

func encode(lat, lon float64, precision int) string {
    var geohash []rune

    if precision > 12 {
        precision = 12
    }

    curMinLat, curMaxLat := latMin, latMax
    curMinLon, curMaxLon := lonMin, lonMax
    bitIsLongitude := true
    nibblebit_idx := 4
    nibblebit := 0

    for len(geohash) < precision {

        if bitIsLongitude {
            mid := (curMinLon + curMaxLon) / 2
            if lon > mid {
                nibblebit |= 1 << uint(nibblebit_idx)
                curMinLon = mid
            } else {
                curMaxLon = mid
            }
        } else {
            mid := (curMinLat + curMaxLat) / 2
            if lat > mid {
                nibblebit |= 1 << uint(nibblebit_idx)
                curMinLat = mid
            } else {
                curMaxLat = mid
            }
        }
        bitIsLongitude = !bitIsLongitude

        nibblebit_idx--
        if nibblebit_idx == -1 {
            // we have a full base32 character
            geohash = append(geohash, rune(base32[nibblebit]))
            nibblebit_idx = 4
            nibblebit = 0
        }
    }

    return string(geohash)
}

A speed bump was that I didn't realize when to stop - every region can be divided further in half forever (until you hit the limit of floating point precision). To fix that, I've added a precision argument to specify the number of characters the resulting geohash should have.

That's a wrap

That concludes geohashes and z-order curves. Z-order curves will be useful to create quadtrees, a data structure with further geographical and computer geometry applications. I'll discuss quadtrees in the next nibble, so stay tuned.

A note about Go: I understand why it's popular. It immediately felt familiar and avoids the ridiculous explosion of features some other languages have accumulated. I also like that it's garbage collected. GC is perfect for loads of applications, take it from John Carmack!

Thanks for reading! I write one nibble every month. For more, subscribe to my newletter and follow me on Twitter or Mastodon.

References and Acknowledgements

Geohash.org
Encode and decode geohashes in the browser and reference JavaScript code
GitHub's CoPilot generated a lot of the Go code. Memorable instances: I typed base32 and it filled in the constants. I typed decode(geohash and it filled in quite a bit of the decode function, including the outline of the algorithm. The code did have several bugs and issues.
Illustrations created with Manim community.

A Nibble of Git's Object Store

Kurt — Sat, 10 Dec 2022 15:04:39 +0000

Power and efficiency through content-addressable storage and delta compression

Nibble: a small piece of food bitten off. In computing: half a byte of information. Every nibble explains a computing science or software engineering idea or system in five minutes.

Git, created by Linus Torvalds in 2005, has become ubiquitous. This nibble describes the architecture of Git's object store, where Git stores your files, directories, and commits, right there in the .git folder. The underlying ideas work together beautifully.

Image by author via Stable Diffusion

A content-addressable object store

Git's object store keeps arbitrary pieces of data, called objects. Objects are just bytes - the store doesn't care about format of the data.

The store has two operations. You can store objects using git hash-object -w <filename>. This command calculates the SHA-1 hash of the file's content and stores the compressed object in the object store using the hash as the key. You can retrieve the object using git cat-file -p <hash>.

For example:

$ echo "some text" > file.txt

$ git hash-object -w file.txt
7b57bd29ea8afbdeb9bac64cf7074f4b531492a8

$ git cat-file -p 7b57bd29ea8afbdeb9bac64cf7074f4b531492a8
some text

Even if you use Git every day, you probably haven't used these commands. You can use them to store pretty much anything in a Git object store.

The idea of storing values by their hash is called content-addressable storage, which also made an appearance in A Nibble of Content-Defined Chunking. It's a powerful idea because it allows deduplication. If you add two identical files to Git, or rename or move an existing file, the file's content is only stored once.

Blob, tree, and commit objects

Git's more familiar commands, like git add and git commit, support common source control operations, like committing files and reverting files to a previous state. They all work by reading and writing to the object store. Git uses a few different types of objects to track file and directory contents, and metadata like commit messages.

First, blob objects hold the contents of files. The content of file.txt above was stored as a blob. Blobs do not store the name of the file, or any other information about it.

Second, tree objects represent a directory. A tree object lists the file and directory names it contains. For each file it additionally lists the hash of the blob object where the file's content is stored. For each folder it lists the hash of another tree object. Similar to a blob, which is a snapshot of a file, a tree object allows Git to recreate a snapshot of a directory.

Third, commit objects store information such as author, date, and commit message, and the hash of a tree object that represents the snapshot of the directory at that commit.

When committing a single file to an empty repository, Git creates one object of each type in the store:

$ git add -A

$ git commit -m "first commit"
[master (root-commit) 212c57e] first commit
 1 file changed, 1 insertion(+)
 create mode 100644 file.txt

Note Git shows the hash of the commit object. The commit object points to a tree object, which points to a blob object. We can inspect the objects using git cat-file:

$ git cat-file -p 212c57e
tree fc5a436ddf54bd82f2da31dde8898cc56b51ee7a
author Kurt <email> 1670074483 +0000
committer Kurt <email> 1670074483 +0000

first commit

$ git cat-file -p fc5a
100644 blob 7b57bd29ea8afbdeb9bac64cf7074f4b531492a8 file.txt

$ git cat-file -p 7b57
some text

As you add commits, files and directories over time, the objects create a directed acyclic graph, or DAG, which holds the history in commit objects, directories in tree objects, and file contents in blob objects.

Here's an animation to illustrate. On the left is a checked out directory, which contains just a README initially. Then two changes are committed. On the right is the commit graph. Every vertex in the graph corresponds to one object in the store - red vertices are commits, green are trees, and blue are blobs. Thanks to content-based hashing, each commit reuses as many objects as possible from previous commits.

On-disk layout of the store

Objects in the object store are kept in the .git/objects directory. Each object is stored as a separate file, and the SHA-1 hash is encoded in the path. The .git/objects directory is a shallow tree structure:

$ tree .git/objects
.git/objects
├── 21
│ └── 2c57e34401cc2c6a36191bdda3d8d3a71de4ec
├── 7b
│ └── 57bd29ea8afbdeb9bac64cf7074f4b531492a8
├── fc
│ └── 5a436ddf54bd82f2da31dde8898cc56b51ee7a

The first two letters of each key are a directory name, and the rest make up the file name. Concatenate them together to get the hash of the object.

Why not store the objects as a flat list of files in .git/objects? That spells trouble if there are a large number of objects. Some file systems are slow to list directories with large numbers of files. Many file systems have a hard limit on the number of files in one directory. To sidestep these problems, Git uses a shallow tree instead.

Delta compression in pack files

As explained, git hash-file stores the content of a whole file into an object. Thanks to content-based addressing, files with identical content are stored only once, regardless of their name or location in the directory tree. But what about nearly identical files? It's likely that large parts remain unchanged. As a result, the object store will amass lots of duplicate data over time. Git has one last trick up its sleeve to deal with this.

When you first commit files that are nearly identical to existing files, (or use git hash-file) they are simply added as new objects. Git uses the term loose objects for objects stored in .git/objects as separate files.

Here's a demonstration - first, let's add a larger file file.txt to a new Git repository, to make it easier to see what's going on.

$ head -c 10K </dev/urandom > file.txt

$ git add file.txt

$ git commit -m "Initial commit"
[master (root-commit) d946bf7] Initial commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file.txt

$ tree -h .git/objects/
.git/objects/
├── [4.0K] 7e
│   └── [53] 83c181b79bb609bae262bb5fcd2a12407d9a32
├── [4.0K] c2
│   └── [10K] d27d0d659b876b59ca77023b063f9cdf736dbd
├── [4.0K] d9
│   └── [136] 46bf71a38c30d03e088f2ddfe320cdd0850a85
├── [4.0K] info
└── [4.0K] pack

As before, the new file adds three objects: a commit, a tree and a blob (c2d27). After a small change to the file, Git stores three entirely new objects:

$ echo "new line" >> file.txt

$ git add file.txt

$ git commit -m "Appended a line"
[master f70c00e] Appended a line
 1 file changed, 0 insertions(+), 0 deletions(-)

$ tree -h .git/objects/
.git/objects/
├── [4.0K] 01
│   └── [10K] d35f8a800ada8dfccfd54a5615624614bf9656
├── [4.0K] 7e
│   └── [53] 83c181b79bb609bae262bb5fcd2a12407d9a32
├── [4.0K] c2
│   └── [10K] d27d0d659b876b59ca77023b063f9cdf736dbd
├── [4.0K] c9
│   └── [53] 73906cc48da970575264bc9725a1d80042f82f
├── [4.0K] d9
│   └── [136] 46bf71a38c30d03e088f2ddfe320cdd0850a85
├── [4.0K] f7
│   └── [167] 0c00eec3a5741d5f04e31e5274597b2b39bc0c
├── [4.0K] info
└── [4.0K] pack

Even though the new version of file.txt is almost the same as the previous, there are two 10KB blob objects in the store (c2d27 from before and 01d35).

To deal with this, Git periodically packs objects together in a pack file. Pack files end up in .git/objects/pack. The loose object files are deleted after packing, as they are no longer necessary.

When packing Git searches for nearly identical objects, and stores them using delta compression in the pack file. Delta compression stores differences, or deltas, between objects instead of complete snapshots. If the difference between two objects is smaller than their size, this saves space.

Git uses delta compression by picking a base object, which is stored in its entirety. Then nearly identical objects are stored as a series of “insert bytes” and “append bytes” operations on top of the base object. Git tries various combinations of base and derived objects, and keeps the combination that results in the least amount of storage space.

Git packs:

before you push, to make data transfer efficient;
when the number of loose objects in the .git/objects directory reaches a threshold;
when you trigger it manually using git gc.

Let's run git gc:

$ git gc
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 12 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0)

$ tree -h .git/objects/
.git/objects/
└── [4.0K] pack
    ├── [1.2K] pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.idx
    └── [10K] pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.pack

All the loose objects are gone, and we now have some files under .git/objects/pack. Since the pack file is only 10KB, Git did a good job of detecting and removing the duplicate content. You now understand what the output of git gc means: Git found 6 objects in the store, used 12 threads to delta compress them, and found one object that was stored as a delta to an existing object.

Let's see what's in the pack file:

$ git verify-pack -v .git/objects/pack/pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.pack
f70c00eec3a5741d5f04e31e5274597b2b39bc0c commit 254 160 12
d946bf71a38c30d03e088f2ddfe320cdd0850a85 commit 205 131 172
01d35f8a800ada8dfccfd54a5615624614bf9656 blob 10249 10263 303
c973906cc48da970575264bc9725a1d80042f82f tree 36 47 10566
7e83c181b79bb609bae262bb5fcd2a12407d9a32 tree 36 47 10613
c2d27d0d659b876b59ca77023b063f9cdf736dbd blob 6 17 10660 1 01d35f8a800ada8dfccfd54a5615624614bf9656
non delta: 5 objects
chain length = 1: 1 object
.git/objects/pack/pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.pack: ok

Great, we still have 6 objects, except now they're stored more efficiently, both in the number of files and the file size. The line for blob c2d27, the previous version of file.txt, shows that it's stored as a delta to 01d35, the latest version of file.txt. The third column indicates object size, making it clear that only 01d35 is stored in its entirety.

Recap

Git's object storage system uses the following key ideas:

Content-addressable storage to deduplicate identical content.
Delta compression to deduplicate nearly identical content.
A DAG of commit, tree and blob objects to store history of directories and files.

Thanks for reading! I write one nibble every month. For more, subscribe to my newletter and follow me on Twitter or Mastodon.

References

Git internals: Git Objects and Pack files

Git compression of Blobs and Pack files

A Nibble of Lazy Evaluation

Kurt — Sun, 13 Nov 2022 18:47:19 +0000

Nibble: a small piece of food bitten off. In computing: half a byte of information. Every nibble explains one computing science or software engineering idea in five minutes.

Every programming language needs to choose in which order to evaluate expressions. Almost all use strict evaluation: before evaluating an expression, all sub-expressions are evaluated first. For example, arguments to a function are evaluated before the function. Many people use eager evaluation as a synonym for strict evaluation.

A minority, most prominently Haskell, uses non-strict evaluation. Non-strict evaluation is defined by what it is not: any evaluation strategy that doesn't evaluate all sub-expressions first, or at all, is non-strict. Lazy evaluation is a particular strategy of non-strict evaluation.

Let's try to clear up the confusion with a nibble of lazy evaluation.

I envy koalas for their ability to be supremely comfortable in the most awkward of places. Photo by David Clode on Unsplash

Strict evaluation

Like the majority of languages, python uses strict evaluation for function applications:

def log(value):
    if log_level == INFO:
        print(value)

log(42 + 33)

First, the sum is calculated, then it's logged. While intuitive to understand, the example shows a potential downside of strict evaluation: if the argument ends up being unused, if log_level is WARN, then it's calculated needlessly.

A simple way to spot strict evaluation is to imagine what would happen if one sub-expression gets into an infinite loop. If the top-level expression never gets evaluated, then evaluation is strict.

Strict evaluation doesn't restrict in which order sub-expressions are evaluated, just that they're all evaluated before the top-level expression. It can be left to right, like in Python; right to left, like in OCaml; or left undefined, like in C.

Non-strict evaluation

Non-strict evaluation is not as unusual as it may appear at first - if...else... for example is non-strict.

Let's apply our trick to spot strict evaluation. The following Python program terminates even though the else branch loops forever:

if True:
    print("I am True")
else:
    while True:
        print("I never terminate")

The else branch is not evaluated at all, so if...else... is not strict in python.

Short-circuited boolean operations are also non-strict. The following or expression terminates.

True or infinite_loop()

Non-strict DIY

That's nice for built-ins, but if you want something done right...

We can simulate non-strict evaluation using thunks: functions without arguments. Thunks delay evaluation until needed.

def log(value_thunk):
    if log_level == INFO:
        print(value_thunk())

log(lambda: 42 + 33)

Thunks avoid unnecessary work - if the value to log is expensive to compute, the log function only evaluates it when logging is enabled.

Lazy evaluation

An annoying problem with thunks is that they are re-evaluated each time they are called.

def log(value_thunk):
    if log_level == INFO:
        print(value_thunk())
        send_to_log_aggregator(value_thunk())

To solve that, we can keep the thunk's result the first time it's evaluated, and re-use it. This evaluation strategy is called lazy evaluation. It combines two optimizations:

It never performs unnecessary work, by delaying evaluation until needed.
It never repeats work, by caching the first result.

There is a catch: if the thunk has side effects, such as writing to disk, it becomes hard to understand when or if the side effect occurs. For this reason, lazy evaluation assumes thunks have no side effects.

The functional programming language Haskell gets away with using lazy evaluation by default because it is pure: Haskell functions do not have side effects.

Strict languages like OCaml, F#, and C# support lazy evaluation using a lazy keyword or Lazy object.

Here's an illustrative implementation of a Lazy object in python:

class Lazy:
    def __init__ (self, thunk):
        self._thunk = thunk
        self._is_cached = False
        self._value = None

    @property
    def value(self):
        if not self._is_cached:
            self._value = self._thunk()
            self._is_cached = True
        return self._value

Why lazy evaluation matters

Strict evaluation is more common because it's easier to understand, and is more straightforward to debug. That said, non-strict evaluation is indispensable in some situations.

Doug McIlroy's Power Serious illustrates this beautifully: a complete, 10-line implementation of infinite power series in Haskell.

Here's one line from it:

series f = f : repeat 0

In Haskell, one of the fundamental data structures is a lazy list. The function series creates such a list. The first element of the list is the number f, and the rest are zeroes. The list constructor : prepends a list with one element. The standard function repeat s creates an infinite list of s.

Concretely, series 13 represents the infinite list [13, 0, 0, 0,...]. But since it's lazy, the elements are only calculated when needed. One way to ask for elements is to use the function take n, which takes n elements from the front of a list. In other words, take 4 (series 13) yields [13, 0, 0, 0].

At this point, you may think that lazy lists are just iterators. Like lazy lists, iterators are computed on demand and can be used to represent infinite data structures. Here's series as a python generator:

def series(f: int) -> Iterable[int]:
    yield f
    while True:
        yield 0

The difference is that lazy lists cache elements of the list as they are evaluated. A program like:

myList = series 10 --list thunk is created
some = take 5 myList --eval first 5 elements
somemore = take 20 myList --eval 15 more elements

evaluates only 20 values in total. Iterators sometimes can't be reset, and if they can the whole computation is redone from scratch. Not so with lazy lists!

Back to Power Serious. Doug McIlroy uses lazy lists to represent the coefficients of an infinite power series. In other words, the list [1, 2, 3, 4, 0, ...] represents the power series 1 + 2𝑥 + 3𝑥² + 4𝑥³ + ...

Here's addition of two power series:

 (f:ft) + (g:gt) = f+g : ft+gt

Haskell's syntax is terse. In this one line both the : and + symbols are overloaded. The : symbol on the left of the equals sign means list deconstruction. On the right, it means list construction. Here's an image that color-codes the different meanings of each symbol to clarify:

Hold on - a recursive definition without a base case? That's only possible thanks to lazy evaluation. When the first element of a sum of two series is needed, the expression f+g in turn needs f and g, which triggers evaluation of the first elements of the left and right list via the lazy pattern matches (f:ft) and (g:gt). Further evaluation of ft+gt is delayed until more elements are needed. This process repeats when a second element is asked for, and so on.

Lazy infinite lists simplify everything: no need to check when the list ends because it doesn't, and no recursive base case is needed because we only recurse when needed.

Recap

This nibble discussed a bunch of terms.

I divided evaluation strategies into two categories: strict and non-strict.

In strict evaluation, all of an expression's sub-expressions are evaluated first. Non-strict evaluation encompasses all strategies that do anything else, including not evaluating some arguments at all.

Strict evaluation is sometimes called eager evaluation, which sounds like it's the counterpart of lazy evaluation. However, lazy evaluation is just one of many possible non-strict strategies. Lazy evaluation stands out because it additionally caches results, but that's only possible if the expressions don't have side effects.

Finally, strict and non-strict evaluation is commonly used together in a single programming language, and even in the same expression. Hence the intersection in the image. An example is if...else...: evaluation of the condition is strict, but evaluation of the two clauses is non-strict.

Thanks for reading! I intend to write one nibble every month. For more, subscribe to my newletter and follow me on Twitter

A Nibble of Content-Defined Chunking

Kurt — Fri, 14 Oct 2022 11:01:02 +0000

Nibble: a small piece of food bitten off. In computing: half a byte of information. In every nibble, I explain one idea from computing science or software engineering in five minutes of reading time.

Consider the problem of backing up large binary files, like virtual machine images, zip files, images, or videos. Because typically only a small fraction of the data changes, it is terribly slow and expensive to transfer each and every file again and again, with every backup. Ideally, we'd transfer only the changes since the last backup, and nothing else. Of course, we can't afford to miss something either!

Photo by Mae Mu on Unsplash

A naïve approach would be to find the changed parts by comparing the current version of the files with the previously backed-up version. However, that requires a complete transfer of each file, which defeats the purpose.

Fixed-size chunking

A better solution is to make the backup content addressable. Split each file into parts of fixed size called chunks and compute a hash of each chunk's content. Initially, we transfer each chunk to the backup store keyed by its hash and store a file with the chunk names that make up every file in the backup.

Afterward, if disaster strikes, we can recover files by downloading their list of chunk names, downloading the chunks, and then simply joining them together.

This approach is called fixed-size chunking.

As a bonus, fixed-size chunking de-duplicates: if one or more chunks are identical by hash, they're transferred only once. It doesn't matter if the duplicate chunk is in the same or another file.

Chunking addresses incremental transfer nicely. If a chunk changes, only that chunk needs to be transferred:

Are we done? Not quite. With fixed-size chunking, a few bytes inserted or removed at the beginning of a file means all of the following chunks get a different hash, even though they haven't changed! That significantly affects our ability to de-duplicate and transfer incrementally.

Content-defined chunking

The solution is to use content-defined chunking. Instead of splitting the file into fixed chunks, we let the file content determine where to split. We can do this by computing a rolling hash and splitting when the rolling hash satisfies a condition - typically, when some number of bits in the hash is zero.

First, let's understand what a rolling hash is. A rolling hash has a fixed window size, say 64 bytes. For each window in the file, the hash is computed, so the first hash is bytes 0 to 63, then 1 to 64, then 2 to 65, and so on. A rolling hash is efficient because it can be incrementally updated, by removing the contribution of the oldest byte and adding the new byte.

Now, to find chunk boundaries, we check whether the lowest N bits of each hash are all zeros. If so, we have a new chunk boundary. The all-zeroes condition is arbitrary - may as well be all ones, or two, or spell your name. More importantly, by choosing N we can target a chunk size of 2ᴺ bytes on average. That's because the hash's bits are uniformly distributed, so a string of N bits occurs with a probability of 1 in 2ᴺ.

The rest is like fixed-size chunking: hash the content of each chunk, and use that to transfer and de-duplicate.

Content-defined chunking neatly solves the problem of inserting or removing data, because the chunk boundaries are no longer offset-based. Now, if data is inserted anywhere outside the chunk boundaries, the boundaries "move" with the data, and most chunks stay the same.

Thanks for reading the first-ever nibble! I intend to write one nibble every month. Think of them as the amuse-bouches of Get Code.

For more, subscribe to my newletter and follow me on Twitter

Automatic Differentiation: From Forward to Reverse in Small Steps

Kurt — Fri, 23 Sep 2022 12:40:43 +0000

An in-depth explanation of differentiable programming, for programmers

I associate derivatives with practicing differentiation rules on increasingly implausible looking functions. If you have similar memories, you'll remember it's a mechanical, rather boring process. Boring for humans is good for computers though, because that means it's amenable to automation. So automate it we will, using a technique called Automatic Differentiation (AD). What's surprising about AD is that it generalizes so well: AD not only calculates the derivative of mathematical functions, but also of programs using constructs like conditionals, loops, recursion and higher-order functions. To highlight this, people have coined differentiable programming as an umbrella term for the various uses and applications AD enables.

What a neuron looks like according to Stable Diffusion.

AD has become a widely used technique, as it plays a pivotal role in deep learning. Gradient descent-based optimization - how the billions of weights in neural networks are trained - needs derivatives, and AD calculates them extremely efficiently. It's up to 6 or 7 orders of magnitude faster than naïve approaches. That's the difference between one day to train a model and 2,738 years! With the current explosion of interest in deep learning, AD is a hot research topic as well. That said, it has many other applications, notably computer graphics, and risk analysis in finance.

This post explains how AD works in-depth, both its forward and somewhat baffling reverse mode. It is aimed at programmers - I'll assume you remember some notions of what differentiation is and some rules, but will link to further information to explore at your leisure. Examples are in Rust, so some experience with typed, functional languages is helpful but not necessary. The implementation does not rely on Rust-specific features.

All code is on GitHub.

What are derivatives, really?

Derivatives are typically explained in visual terms using graphs of functions. For a scalar function ℝ→ℝ, the derivative at a particular point is represented by the tangent of the function at that point. This interpretation builds useful intuition, and I recommend 3blue1brown's excellent visual series if you're unfamiliar or need a refresher. The visual interpretation works for functions that are two or three dimensional - for example a function 𝑓:ℝ²→ℝ describes the height of an imagined terrain as a function of (x,y) coordinates. The derivative of such a function has two components, one for each input: ∂𝑓/∂𝑥 and ∂𝑓/∂𝑦. These are called the partial derivatives, and together describe the slope of the terrain at a particular (x,y) coordinate.

A tangent line at x=3

However, no one can imagine a four-dimensional space, let alone GPT-3's 175 billion-dimensional space of weights. To deal with that, another perspective is that the derivative is a measure of the sensitivity of a function to its inputs. (I'll use the terms inputs and outputs of functions rather than parameters or domain, because it's more familiar to programmers.)

This view is common in risk analysis. Say you have a function that, based on various observable values, calculates today's value of an investment. The observables could be interest rates, inflation rates, foreign exchange rates, anything that affects the value of the investment. To gauge how risky the investment is, we'd like to know how its value changes if the observables change. If the value of the investment changes significantly for a small change in say interest rate, then that indicates it is particularly sensitive to interest rates. If we don't like such a high exposure to interest rates, we may want to hedge this risk.

It turns out that the derivative is exactly a measure for the sensitivity of the output(s) of a function to each of its inputs. More accurately, for a function 𝑓: ℝ³→ℝ, i.e. a function with three inputs and one output, the derivative function 𝑑𝑓: ℝ³→ℝ³ has three partial derivatives at each input value. The partial derivative is a measure of how the value of 𝑓 changes, if the corresponding input value changes - in other words, how sensitive 𝑓 is to changes in the value of each of its inputs. What's neat about partial derivatives is that you can linearly combine them to get the sensitivity to several inputs as well (for well-behaved functions, at least). The partial derivatives are exactly the quantities you need to figure out everything there is to know about the sensitivity of the function to input changes.

But how is this measure interpreted and where does it break down? It turns out that the derivative is just an approximation. To be exact, the derivative is the best linear approximation of the function.

In practice linear approximations of non-linear functions are only good close to the point where they are evaluated. If our investment is worth $1000 today, at an interest rate of 1%, and sensitivity to interest rate is $10 per 0.01% at 1% interest rate, then a reasonable estimate of the value of the investment at 1.01% interest rate is $1010. However it is just an estimate, and the further away from the interest rate of 1%, the more inaccurate the estimate will be. The exception to this is if the function is itself linear - in that case a linear approximation is exact.

To change tack, having rid ourselves of the necessity to imagine 15-dimensional landscapes, we can now also understand how deep learning networks are trained through an optimization process called gradient descent. A neural network can be understood as a machine that has potentially billions of parameters, called weights, which need to be adjusted so that the machine produces the desired outputs. During training, we give the machine an input and it produces an output. We compute the error - the difference between the machine's output, and the desired output. We'd like to make this error as small as possible. Because the error is a function of the weights of the machine, we can calculate how sensitive the error is to small changes in each of the weights by calculating the partial derivative of the error function to all the (potentially billions) of weights. Once we know how the error changes with the weights, we can adjust the weights to make the error smaller - if the sensitivity is positive, we'll make that weight a bit smaller, if it's negative, we'll make it a bit bigger. Lather, rinse, repeat, and before you know it you can generate some amazing images.

What differentiable programming means

Let's start with a simple example - a function that takes and returns a single floating point number.

fn f(t: f64) -> f64 {
    t.powi(2) + t + 1.0
}

Let's say we're interested in having a function df which represents the derivative of f.

How can we figure out df from f automatically? Besides automatic differentiation, there are two other approaches: symbolic and numeric differentiation. I'll briefly discuss them to better understand what makes AD so effective.

Symbolic differentiation

An exhaustive set of rules exists to calculate df from f, so by applying these rules or asking something smarter than me, we can write df directly:

fn df_sym(t: f64) -> f64 {
    2.0 * t + 1.0
}

This approach is called symbolic differentiation. It's easy to do for this simple example, but with common programming constructs like shared variables, iteration, recursion and conditionals, it can be very hard to come up with an efficient closed formula for the derivative.

Numeric differentiation

An alternative is numeric approximation, which follows from the definition of derivative:

fn df_num(t: f64, h: f64) -> f64 {
    (f(t + h) - f(t)) / h
}

For some small number h. Numeric differentiation, while simple to implement, is not without problems: choosing an h that is just right, and doesn't lead to numerical instabilities is tricky. There are approaches that solve some of these problems, but simplicity is lost. Also this needs two evaluations of f, which is not ideal.

Comparing symbolic and numeric differentiation:

let t = 5.0;
println!("Symbolic: {}", df_sym(t));

let h = 0.000001;
println!("Numeric: {}", df_num(t, h))

Results in:

Symbolic: 11
Numeric: 11.000001002514637

Automatic differentiation

The idea of automatic differentiation is to compute the derivative at the same time as the function itself, by applying differentiation rules one operation at a time. Let's do that manually first.

We're aiming for a function f_ad(t_dual: (f64, f64)) -> (f64, f64). The argument t_dual is a tuple with the value of t as the first element, and its derivative as the second element. The combination of these two is typically called a dual number. We need t_dual to be a dual number because the input to f_ad might be the result of another automatically differentiated function. The result of f_ad is also a dual number: it returns the value f would normally compute, as well as its derivative.

Here's the implementation in AD style:

fn f_ad(t_dual: (f64, f64)) -> (f64, f64) {
    let (primalt, tangentt) = t_dual;
    let (primal0, tangent0) = (primalt.powi(2), 2.0 * primalt * tangentt);
    let (primal1, tangent1) = (primal0 + primalt, tangent0 + tangentt);
    let (primal2, tangent2) = (primal1 + 1.0, tangent1);
    (primal2, tangent2)
}

Each operation in the original f is on a separate line. On each line, the regular value, the primal, is updated, along with the tangent, which represents the derivative. The calculation of the tangent is the result of applying differentiation rules, most importantly the chain rule. It's worth dwelling on the the chain rule for a bit, because it's essential to AD.

The chain rule tells us how to calculate the derivative of function composition - a fancy name for passing the result of a function e into a function f, i.e. f(e(t)). This is written as 𝑓 ∘ 𝑒, where ∘ denotes function composition. (I used e here instead of the more typical g, because e comes before f). The chain rule is used when calculating tangent0.

The first component of tangent0 is 2.0 * primalt, because of the power rule. Then that's multiplied by tangentt because of the chain rule, which comes into play because t_dual may be the result of applying another function e. t_dual's tangent is the derivative of e(t). tangent0 then needs to determine the derivative of e(t).powi(2), in other words the composition of e and powi. Hence the chain rule tells us to derive powi, and then multiply that by the derivative of e, which is tangentt.

Note that calculating the primal at each step only needs previously calculated primal values - this seems obvious but is not true for the tangent, which needs both primal and previous tangent values. tangent0 for example uses the value 2.0 from its own primal, as well as primalt and tangentt.

Finally, here's how to call f_ad:

println!("Automatic: {}", f_ad((t, 1.0)).1);

Why is the initial tangent 1.0? Because that's the derivative of the identity function 𝑖𝑑(𝑥)=𝑥.

This outputs:

Automatic: 11

Have a go yourself at the Rust playground.

Towards automatic differentiation

The next sections develop a minimal implementation of AD. It proceeds in steps: first, we make the manual implementation of AD actually automatic. Then, we'll investigate the efficiency of the implementation, and iteratively improve it. Each step reveals an insight into how and why AD works.

Generally, AD can be implemented in two ways: source to source transformation, or via operator overloading. Source to source transformation analyses the source of the original program f and generates the code for f_ad. Depending on the constructs the transformation supports, it is entirely transparent to the programmer. However, it needs a lot of machinery for parsing and producing efficient code, which is not relevant to the core ideas of AD. I'll use operator overloading here, but the ideas are transferable.

Actually automatic through operator overloading

First let's refactor the tuples into a proper type of dual numbers: Dual. The idea is to define operators like sum and product on Dual using operator overloading, so that we can write programs that work with dual numbers as if we're dealing with floating point numbers. Except now functions compute the value and derivative at the same time! This process is somewhat similar to introducing a library for complex or rational numbers. Here's the type for dual numbers:

struct Dual {
    primal: f64,
    tangent: f64,
}

Primitive Dual values are constant and var:

impl Dual {
    fn constant(value: f64) -> Self {
        Dual {
            primal: value,
            tangent: 0.0,
        }
    }

    fn var(value: f64) -> Self {
        Dual {
            primal: value,
            tangent: 1.0,
        }
    }
}

Both constant and var have a user-defined primal value, and only differ in their derivative. constant represents the constant function f(x) = C. Constant functions have a derivative that's 0 everywhere. var represents the identity function f(x) = x, with a derivative that's 1 everywhere. (Why call it var and not id? Because typically it's used to define new variables we'd like to calculate the derivative of.)

Next is overloading addition and multiplication for Dual by using the sum and product rules in tangent:

// Add trait omitted
fn add(self, rhs: Self) -> Self::Output {
    Dual {
        primal: self.primal + rhs.primal,
        tangent: self.tangent + rhs.tangent,
    }
}

// Mul trait omitted
fn mul(self, rhs: Self) -> Self::Output {
    Dual {
        primal: self.primal * rhs.primal,
        tangent: rhs.tangent * self.primal + self.tangent * rhs.primal,
    }
}

And also define powi for Dual:

impl Dual {
    fn powi(self, exp: i32) -> Self {
        Dual {
            primal: self.primal.powi(exp),
            tangent: f64::from(exp) * self.primal.powi(exp - 1) * self.tangent,
        }
    }
}

The example function f now becomes:

fn f_ad_dual(t: Dual) -> Dual {
    t.powi(2) + t + Dual::constant(1.0)
}

// call site:
f_ad_dual(Dual::var(3.0))

It looks like the original function f, with a bit of syntactic noise. Some languages allow defining new literals, or implicit conversions, which allows getting rid of the call to Dual::constant. Another way to achieve that would be to define overloads for all operators for Dual and f64 in all possible combinations. In unityped languages with operator overloading, there isn't even be a type signature to change, so f may not need to change at all.

I'll be using graphs throughout this post to illustrate the calculation of the tangent, which is straightforward now but will become more involved. Here's a graph representing the calculation of df(3), i.e. calculation of Dual.tangent at t=3:

On the left is the tangent of the input t, shown as Δ𝑡. Edges scale the tangent with a factor, in this case because of the chain rule. If no factor appears on the edge, it is equal to one. At each + node, tangents are added.

Here's how the calculation proceeds - from left to right, a detail that will become important.

Intermezzo: Dual numbers

Mathematically, Dual numbers are expressions of the form a+bϵ where a and b are real numbers, and ϵ is a "small quantity" such that ϵ²=0. Think of complex numbers but ϵ instead of 𝑖.

What's interesting about dual numbers is that we can think of the first part of the dual number as the primal, and the second part as the derivative, and end up with rules for differentiation that are the same as the rules for symbolic differentiation. For sums and products this needs only simple arithmetic:

Sum: (a+bϵ) + (c+dϵ) = (a+b) + (c+d)ϵ.
Product: (a+bϵ)(c+dϵ) = ac + adϵ + bcϵ + bdϵ² = ac + (ad + bc)ϵ since ϵ²=0.

Compare this with the implementation of Dual for sum and product: the first part of the dual number corresponds to the primal, the second part to the tangent. A chain rule with a similar shape can be derived as well.

Generalizing dual numbers

We left the previous section with a Dual type that consists of two floating point numbers. However, the derivative of a function with multiple inputs has multiple components: the partial derivative with respect to each input. This means that a single float may not be enough for the tangent field. To address that, let's refactor Dual. At this point some of this might seem abstraction for the sake of abstraction, but it will come in handy later.

First redefine Dual as follows:

struct Dual<T> {
    primal: f64,
    tangent: T,
}

The primal is still a f64 value, but tangent is left generic. We can recover our old Dual by instantiating T as f64, so this is strictly a generalization. However, just any T won't do - we must be able to apply differentiation rules to it.

It turns out that to calculate derivatives, T only needs a zero element, an addition and scale operation - which corresponds to T forming a vector space. I don't know how to explain intuitively why this is true! But you can check the differentiation rules to see that it is. An accurate statement and proof can be found on the Linearity of differentiation Wikipedia page.

Anyway, trusting the math hasn't failed me yet, so we can represent this requirement on T by a Rust trait:

trait VectorSpace {
    fn zero() -> Self;
    fn add(&self, rhs: &Self) -> Self;
    fn scale(&self, factor: f64) -> Self;
}

And implement VectorSpace for floats:

impl VectorSpace for f64 {
    fn zero() -> Self {
        0.0
    }

    fn add(&self, rhs: &Self) -> Self {
        self + rhs
    }

    fn scale(&self, factor: f64) -> Self {
        self * factor
    }
}

Now we can overload operators on Dual<T> for any T which is a VectorSpace:

impl<T: VectorSpace> Dual<T> {
    fn constant(primal: f64) -> Self {
        Dual {
            primal,
            tangent: T::zero(),
        }
    }

    fn add_impl(&self, rhs: &Dual<T>) -> Dual<T> {
        Dual::new(self.primal + rhs.primal, self.tangent.add(&rhs.tangent))
    }

    fn mul_impl(&self, rhs: &Dual<T>) -> Dual<T> {
        Dual::new(
            self.primal * rhs.primal,
            rhs.tangent
                .scale(self.primal)
                .add(&self.tangent.scale(rhs.primal)),
        )
    }
}

I'm omitting the ceremonial implementations of the Add and Mul traits. It's now possible to write f with our new Dual<T> type:

fn f<T: VectorSpace>(x: &Dual<T>) -> Dual<T> {
    t.powi(2) + t + Dual::constant(1.0)
}

fn df(t: f64) -> f64 {
    let res = f(&Dual::new(t, 1.0));
    res.tangent
}

Dual<T> can now be used like Dual could, and we seem to be back where we started. We have gained some power though - let's actually leverage it next.

Forward mode and Reverse mode

AD comes in two flavours, called modes: forward mode and reverse mode. These two modes can also be mixed. So far we've implemented forward mode, which is the easiest to understand. It's called forward mode because the computation of the derivative flows in the same "forward direction" as the evaluation of the main program. In reverse mode, part of the calculation will happen in the opposite direction - we'll build up to how exactly that works. For the moment, let's consider our current implementation more in depth, to understand what its limits are.

For functions that have a single input, but many outputs, forward mode works efficiently. More concretely, calculating the derivative only adds a constant factor overhead to each operation - a few more additions and multiplications. Here's an example of a function with multiple outputs, and how we can get its derivatives:

fn f_2out<T: VectorSpace>(x: &Dual<T>) -> (Dual<T>, Dual<T>) {
    let ref y = x.sin() * x.sin();
    (y + x * Dual::constant(10.0), x + y * Dual::constant(20.0))
}

fn df_2out<T: VectorSpace>(x: f64) -> (f64, f64) {
    let (dx1, dx2) = f_2out(&Dual::new(x, 1.0));
    (dx1.tangent, dx2.tangent)
}

df_2out returns the derivative of each of its outputs with respect to x, and needed just one call to f_2out to achieve that. There are two derivative values now, but forward mode AD can calculate both of them with a constant factor overhead.

This is great news, however a function with multiple inputs tells a different story.

fn f<T: VectorSpace>(x: &Dual<T>, y: &Dual<T>) -> Dual<T> {
    x * y
}

fn df_v1(x: f64, y: f64) -> (f64, f64) {
    (
        f(&Dual::new(x, 1.0), &Dual::constant(y)).tangent,
        f(&Dual::constant(x), &Dual::new(y, 1.0)).tangent,
    )
}

Because we need to calculate the partial derivative to each of the two inputs separately, we need two calls to f. This is bad for efficiency: for gradient optimization problems in deep learning with billions of inputs, we'd need billions of calls to the function! This approach will certainly not do.

Let's visualize what's going on. Here's a graph representing the calculation of the derivative of df(3, 2), i.e. calculation of Dual.tangent at x=3 and y=2:

On the left are the tangents of x and y, shown as Δ𝑥 and Δ𝑦. The factors on the edges from Δ𝑥 and Δ𝑦 are because of the product rule: (𝑥𝑦)′ = 𝑥𝑦′ + 𝑦𝑥′. This shows visually why we had to evaluate twice, once with Δ𝑥=1 and once with Δ𝑦=1: a single evaluation of f with both Δ𝑥=1 and Δ𝑦=1 would give the total derivative. The total derivative is the sensitivity of the function when all its inputs change at once; but we're not interested in that. We need the partial derivatives, so we get the sensitivity to each input individually.

First fix: many calls to one call

If you've ever dealt with a function that has significant overhead on each call - say, it accesses a database - then you'll have learnt that it's not a good idea to call that function multiple times in a loop. It's much better to pass all the inputs at once, and re-implement the function to deal with multiple values more efficiently by accessing the database just once.

We can apply a similar idea here. We're calling f multiple times, each time with different tangent values. Instead we can make the tangent T an array, to pass them all at once into a single call.

This is where the Dual<T> abstraction pays off. No need to change the Dual type - just implement VectorSpace for arrays:

impl<T: Copy + VectorSpace, const N: usize> VectorSpace for [T; N] {
    fn zero() -> Self {
        [T::zero(); N]
    }

    fn add(&self, rhs: &Self) -> Self {
        let mut result = [T::zero(); N];
        for (i, (l, r)) in self.iter().zip(rhs).enumerate() {
            result[i] = l.add(r);
        }
        result
    }

    fn scale(&self, factor: f64) -> Self {
        let mut result = [T::zero(); N];
        for (i, v) in self.iter().enumerate() {
            result[i] = v.scale(factor);
        }
        result
    }
}

Arrays are added element by element (point-wise), and scaling applies to each element.

Keeping f the same, but changing df:

fn df_v2(x: f64, y: f64) -> [f64; 2] {
    f(&Dual::new(x, [1.0, 0.0]), &Dual::new(y, [0.0, 1.0])).tangent
}

This works out better: the function is only called once. The following animation visualizes the computation now:

Evaluation still proceeds forwards through the graph, as before. The array representation allows us to evaluate the function only once, and still get all the partial derivatives.

Alas: memory usage for the arrays is quadratic in the number of inputs! That's workable for a low number of inputs, but is intractable for billions.

Second fix: introduce an intermediate representation

Those arrays certainly look sparse. Perhaps a sparse encoding is a way forward? Unfortunately, that won't work, because as the computation proceeds they don't remain sparse for long, as the visualization shows. Instead, let's use a related idea, and replace the arrays with an intermediate representation. This splits the calculation in two passes: the first builds an intermediate representation, the second uses the intermediate representation to compute the derivatives.

For the intermediate representation, we introduce a new Delta type:

enum Delta {
    Zero,
    Var(usize),
    Scale(f64, Rc<Delta>),
    Add(Rc<Delta>, Rc<Delta>),
}

If you're not familiar with Rust: Delta is a union type with four possible cases. Rc is a reference counted smart pointer - for this post it's safe to just squint and pretend the Rcs are not there.

Delta is a reified representation of VectorSpace operations - and captures the structure of the derivative computation as a directed acyclic graph. Var represents an input variable like x or y; the argument is a unique number to identify the variable.

To use Dual<Delta>, we need to make Delta a VectorSpace:

impl VectorSpace for Rc<Delta> {
    fn zero() -> Self {
        Rc::new(Delta::Zero)
    }

    fn add(&self, rhs: &Self) -> Self {
        Rc::new(Delta::Add(Rc::clone(self), Rc::clone(rhs)))
    }

    fn scale(&self, factor: f64) -> Self {
        Rc::new(Delta::Scale(factor, Rc::clone(self)))
    }
}

The last piece of the puzzle is evaluating the intermediate representation to the partial derivatives.

pub fn eval_delta(scale_acc: f64, delta: &Delta, result: &mut Vec<f64>) {
    match *delta {
        Delta::Zero => (),
        Delta::Var(i) => result[i] += scale_acc,
        Delta::Scale(factor, ref d2) => eval_delta(scale_acc * factor, d2, result),
        Delta::Add(ref l, ref r) => {
            eval_delta(scale_acc, l, result);
            eval_delta(scale_acc, r, result);
        }
    }
}

eval_delta is memory-efficient: it updates a single array in place. It iterates depth-first through the graph described by Delta, and accumulates scale factors along each path from the root to the Var or Zero leaves. When it encounters a Var case, which is necessarily a leaf, it just adds the accumulated scale factor to the right place in the result array.

Putting all of that together, the third version of df becomes:

fn df_v3(x: f64, y: f64) -> Vec<f64> {
    let dual_delta = f(
        &Dual::new(x, Rc::new(Delta::Var(0))),
        &Dual::new(y, Rc::new(Delta::Var(1))),
    );
    let mut result = vec![0.0, 0.0];
    eval_delta(1.0, &dual_delta.tangent, &mut result);
    result
}

This looks much better! The first pass, executing f, returns a Delta intermediate representation, and then in a second pass, eval_delta accumulates the scale factors "in reverse" with respect to the usual evaluation order of the program. This reversal is visible as the scale_acc parameter of eval_delta. The evaluation process can be visualized as follows, as before for df(3,2). Note that we now have an explicit representation of this graph as a result of the forward pass, for this example Add(Scale(2.0, Var(0)), Scale(3.0, Var(1))).

In other words, this is a "beta version" of reverse mode AD - there is still an ingredient missing, as we'll see - but it explains where the name comes from. Using a single depth-first traversal of the Delta graph, it's possible to get the partial derivatives with respect to all the inputs in one reverse sweep. And importantly, we don't need an amount of memory that's the square of the number of inputs anymore.

Finally efficient: dealing with sharing

Now there is one remaining problem - if the original program contains shared variables, eval_delta does too much work. Here's a problematic example:

fn f_sharing_bad<T: VectorSpace>(x: &Dual<T>, y: &Dual<T>) -> Dual<T> {
    let ref s = x * y;
    s + s
}

The s variable is used twice in the final result, and as expected the primal of s is only calculated once and then reused in the addition. But, the calculation of the tangent of s is done twice in eval_delta. Here's the Delta graph for x=3 and y=2:

We want our derived program to have asymptotically the same complexity as the main program, and not respecting sharing can easily blow up the complexity.

This smells like a dynamic programming problem: parts of the graph are shared, but they're being evaluated repeatedly. To achieve the same asymptotic performance as the main program, we have to make sure that each node in the Delta graph is evaluated only once. As usual in dynamic programming, part of the answer is memoization: as we calculate scale factors, we'll need to store for each node what the factor is up to that point. To guarantee we visit every node only once, and we have accumulated the necessary factors, we have to visit the graph in reverse topological order. Luckily, we already have such an order: it's just the reverse order of the forward pass!

To accomplish all this, we need to make two changes:

Introduce a Trace type to record the evaluation order of the operations in the forward pass; and
Use Delta::Var to represent intermediate variables explicitly. Since we can't detect which variables are shared and which ones are not, we assign the result of each operation to a temporary Var in the trace. Put another way, each node in the graph is represented by a Var.

Here's the salient parts of the implementation. First, the Trace type.

struct Trace {
    trace: RefCell<Vec<Rc<Delta>>>,
}

Ignoring the RefCell and Rc, this is just a resizable array (Vec in Rust) of Delta values. We're going to treat this Trace as a stack: in the forward sweep, each tangent operation gets pushed on it, and in the reverse sweep, we'll be popping operations to calculate derivatives. The stack ensures we are visiting each node just once and in reverse topological order.

Then the Dual type needs to be augmented with a Trace, so we can record each operation in it:

struct DualTrace<'a> {
    trace: &'a Trace,
    dual: Dual<Rc<Delta>>,
}

This means we need to redo the addition and multiplication overloads we did earlier for DualTrace. For each operation we can delegate to the overload for Dual, but need to do two extra things:

Push the operation on the Trace;
Return a new Var to represent this operation. This intermediate variable will then used to store and share the result, so it is only evaluated once.

Remember way back in the introduction, we rewrote f to look like this:

fn f_ad(t_dual: (f64, f64)) -> (f64, f64) {
    let (primalt, tangentt) = t_dual;
    let (primal0, tangent0) = (primalt.powi(2), 2.0 * primalt * tangentt);
    let (primal1, tangent1) = (primal0 + primalt, tangent0 + tangentt);
    let (primal2, tangent2) = (primal1 + 1.0, tangent1);
    (primal2, tangent2)
}

That's very similar to how we're using Var now: each operation becomes an implicit let Var_i = Add(...) in the trace.

Here's what that looks like for addition:

impl Trace {
    fn push(&self, op: Dual<Rc<Delta>>) -> Dual<Rc<Delta>> {
        let mut trace = self.trace.borrow_mut();
        let var = Dual {
            primal: op.primal,
            tangent: Rc::new(Delta::Var(trace.len())),
        };
        trace.push(op.tangent);
        var
    }
}

Trace.push adds a new operation (Scale, Add or Zero) to the end of the trace, and returns a new Delta::Var to represent the result of that operation. Now addition:

impl<'a> DualTrace<'a> {

    fn delta_push(&self, op: Dual<Rc<Delta>>) -> DualTrace<'a> {
        let dual = self.trace.push(op);
        DualTrace {
            trace: self.trace,
            dual,
        }
    }

    fn add_impl(&self, rhs: &DualTrace<'a>) -> DualTrace<'a> {
        let op = &self.dual + &rhs.dual;
        self.delta_push(op)
    }
}

And the same for other operations. This maintains the invariant that the result of any operation always has a Var as tangent. We'll leverage this fact when doing the reverse sweep, as it allows us to detect sharing explicitly.

Here's the latest and greatest eval. It uses eval_delta which is unchanged from before.

pub fn eval(inputs: usize, dual_trace: &DualTrace) -> Vec<f64> {
    let mut result = vec![0.0; dual_trace.trace_len()];
    let mut trace = dual_trace.trace_mut();

    eval_delta(1.0, &dual_trace.dual.tangent, &mut result);

    while trace.len() > inputs {
        let deltavar = trace.pop().unwrap();
        let idx = trace.len();
        if result[idx] != 0.0 {
            eval_delta(result.pop().unwrap(), &deltavar, &mut result);
        }
    }
    result
}

Let's see how to use all this, then dig into some details.

fn f_sharing_fixed<'a>(x: &DualTrace<'a>, y: &DualTrace<'a>) -> DualTrace<'a> {
    let ref s = x * y;
    s + s
}

fn df_sharing_fixed(x: f64, y: f64) -> Vec<f64> {
    let trace = Trace::new();
    let x = &trace.var(x);
    let y = &trace.var(y);
    let dual_trace = f_sharing_fixed(x, y);
    eval(2, &dual_trace)
}

For this example with x=3 and y=2, the trace after the forward pass is an array of 5 elements:

[ Var 0, // x
  Var 1, // y
  Add(Scale(2.0, Var 0), Scale(3.0, Var 1)), //let s = 2x + 3y
  Add(Var 2, Var 2) // let r = s + s
  Var 3 ] // r

(I've omitted some parentheses to make it easier to read)

The trace makes sharing explicit. Each Var's id is a valid index in the array, which refers to where that "variable" is defined. In eval, a result array is allocated for each element of the trace, and this array is where the derivative for the corresponding variable is accumulated. The trace is also used to make sure each node is visited only once. Here eval's reverse sweep is animated, always for x=3 and y=2. The result array is shown at the top.

And there you have it! From forward to efficient, reverse mode AD in four steps.

If you read this far, you may as well consider subscribing to my newsletter, and/or follow me on Twitter.

Acknowledgements and further references

The main inspiration and origin of this post is the paper "Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation" by Faustyna Krawiec, Neel Krishnaswami, Simon Peyton Jones, Tom Ellis, Andrew Fitzgibbon, and Richard Eisenberg, along with Simon Peyton Jones' presentation at Haskell Exchange 2021. You can find paper and presentation here.

Conal Elliot's papers were also very helpful and influential: Beautiful differentiation and The simple essence of automatic differentiation.

It helps having things explained from different perspectives. Here are a few other good links on the topic:

Calculus on Computational Graphs: Backpropagation by Christopher Olah has more of a deep learning focus.
Differentiable Programming from Scratch by Max Slater has a more mathematical focus, has inline JavaScript graphs you can play with, and shows how AD is used to de-blur images.

Any errors and omissions are entirely my responsibility.

Property-based Testing #6: Random All the Way Down

Kurt — Sun, 31 Jul 2022 17:01:14 +0000

This is the sixth post in a series about property-based testing. This post describes "random shrinking", the third and last implementation of shrinking I know of. It keeps all the advantages of internal shrinking, which we discussed in part 5, and is much simpler.

The complete code for this post can be found on GitHub - in particular example.py and random_based.py.

Photo by Free Walking Tour Salzburg on Unsplash

The last posts discussed two different approaches to shrinking failing test cases, aiming to make the failing example easier to understand and debug. The first approach, direct shrinking, makes values smaller by directly changing them. The second approach, internal shrinking, instead tries to change the sequence of choices that are made during random generation. These choices are like the DNA of a generated value, and choices can be edited to make the resulting value smaller.

When we discussed the advantages of internal shrinking vs direct shrinking, we also noted that editing the choices can be tricky. Significant engineering effort is needed to make it work well. What if we could have our cake and eat it too - an approach to shrinking that keeps all the advantages of internal shrinking, but is much simpler to implement?

The unreasonable effectiveness of randomness

The idea is simple: instead of coming up with a deliberate algorithm to make values smaller, we just - in technical terms - throw shit to the wall and see what sticks. That is, we're already counting on random generation to find bugs, let's also randomly try to find smaller values that still make our test fail.

Unlike the shrinking strategies we've seen so far, which produce a limited number of smaller values to try, this new approach can keep trying forever. If the random generator is "random enough", a good shrink should show up at some point.

In practice, that can take too long, especially if the test itself is slow. However, we can avoid running a test, or even generating an actual value, by calculating the size of the value that we're about to generate beforehand. We'll keep the size of the smallest value that fails the test so far, and then randomly generate another and estimate its size while we generate. If that size is greater than that of the current smallest value, we're not interested in running the test with it, or creating the rest of the value. Instead we'll make the generator short-circuit - basically abort and move on to the next value. This avoids a lot of useless work.

I learned of this beautiful idea via CsCheck by Anthony Lloyd, and to the best of my knowledge he is also the inventor of it. As we'll see it's simple to implement - the shortest reference implementation of all - and it's effective. As with previous implementations, it's worth noting that CsCheck's approach is slightly more complex than what I'll describe here - I'm just capturing the main idea.

Matters of size

The implementation contains all the same concepts as the previous posts.

We make two modifications to Random. In addition to generating a random value, we also return the size of the generated value. For our purposes, the size is simply represented by a positive integer. The smaller the integer, the smaller the size. Additionally, to enable short-circuiting, the generator takes in the current minimal size, min_size.

Size = int

class SizeExceeded(Exception):
    pass

class Random(Generic[T]):
    def __init__ (self, 
        generator: Callable[[Optional[Size]], Tuple[T, Size]]):
        self._generator = generator

    def generate(self, min_size: Optional[Size] = None) -> Tuple[T, Size]:
        return self._generator(min_size)

If min_size is None, we're in the random generation phase of the test run - meaning we haven't found a failing test yet and so the generator should never short-circuit. If min_size is a Size value, we are shrinking, and min_size is the current smallest value for which the test fails. If a generator exceeds that size, there's no point in going on, and the generator should throw SizeExceeded to short-circuit.

Don't worry if that doesn't make sense yet - examples below.

Beginning with constant, as usual:

def constant(value:T) -> Random[T]:
    return Random(lambda: (value, 0))

A constant value has size 0 - no amount of random generation is going to make the constant value change, so if we have found a value with size 0 we can just stop shrinking.

Our old friend int_between is a bit more tricky.

def int_between(low: int, high: int) -> Random[int]:
    def zig_zag(i: int):
        if i < 0:
            return -2*i - 1
        else:
            return 2*i
    def generator(min_size: Optional[Size]):
        value = random.randint(low, high)
        size = zig_zag(value)
        # short circuit - implementation below
        dec_size(min_size, size)
        return value, size
    return Random(generator)

int_between uses so called ZigZag encoding to calculate the size of an int. ZigZag encoding is also used in Protocol Buffers. The gist of it is that each signed integer gets a positive size, by zigzagging between negative and positive integers. Unlike the implementation here, with a limited type like int32 or int64, you only need a few bit shifts and exclusive or to compute the size, which makes it pretty fast in practice. The idea is that integers with greater absolute value have greater size.

I've also introduced a dec_size method (for "decrease size"), which makes sure we short-circuit generation as soon as possible.

def dec_size(min_size: Optional[Size], 
             decrease: Size)
    -> Optional[Size]:

    if min_size is None:
        return None
    smaller = min_size - decrease
    if smaller < 0:
        raise SizeExceeded()
    return smaller

If min_size is not None, meaning we're currently shrinking, dec_size subtracts the size of the currently generated value from the current minimum size, and checks if that is still positive. If it's not, it throws SizeExceeded to indicate that the value we're about to generate is already greater than the best, minimum size - thereby short-circuiting. The less work we do, the more values we can generate in shorter time, and the more opportunities we have for finding smaller values. We're not yet using the result of dec_size but will do soon.

map is uninteresting - it just maintains the size:

def map(func: Callable[[T], U], 
        gen: Random[T])
    -> Random[U]:

    def generator():
        result, size = gen.generate()
        return func(result), size

    return Random(generator)

The size of mapN is the sum of the sizes of all its inputs:

def mapN(func: Callable[...,T], 
         gens: Iterable[Random[Any]])
    -> Random[T]:

    def generator(min_size: Optional[Size]):
        results: list[Any] = []
        size_acc = 0
        for gen in gens:
            result, size = gen.generate(min_size)
            min_size = dec_size(min_size, size)
            results.append(result)
            size_acc += size
        return func(*results), size_acc

    return Random(generator)

Both map and mapN assume that the size of the output value decreases as the size of the input values decreases. Reasonable, though it's easy to imagine a counterexample. Here's where the result of dec_size is used: as values for the inputs are generated, mapNsubtracts their size from min_size. This ensures that, if say we have 10 inputs, and we already exceed the min_size by the third generator, we short-circuit as soon as possible.

Finally, bind:

def bind(func: Callable[[T], Random[U]], 
         gen: Random[T])
    -> Random[U]:

    def generator(min_size: Optional[Size]):
        result,size_outer = gen.generate(min_size)
        min_size = dec_size(min_size, size_outer)
        result,size_inner = func(result).generate(min_size)
        size = size_inner+size_outer
        return result, size

    return Random(generator)

The size of bind is interpreted as the sum of the size of both inner and outer generators.

And that is pretty much it! We can literally reuse our implementation of Property and for_all, which I've repeated here as a reminder.

@dataclass(frozen=True)
class TestResult:
    is_success: bool
    arguments: Tuple[Any,...]

Property = Gen[TestResult]

def for_all(gen: Gen[T], 
            property: Callable[[T], Union[Property,bool]])
    -> Property:

    def property_wrapper(value: T) -> Property:
        outcome = property(value)
        if isinstance(outcome, bool):
            return constant(
                TestResult(
                    is_success=outcome, 
                    arguments=(value,)
                )
            )
        else:
            return map(
                lambda inner_out: replace(
                    inner_out, 
                    arguments=(value,) + inner_out.arguments
                ),
                outcome
            )
    return bind(property_wrapper, gen)

def test(property: Property):
    def find_smaller(min_result: TestResult, min_size: Size):
        ... # the only new bit

    for test_number in range(100):
        result, size = property.generate()
        if not result.is_success:
            print(f"Fail: at test {test_number} with arguments {result.arguments}.")
            find_smaller(result, size)
            return
    print("Success: 100 tests passed.")

Now, find_smaller is new but is hardly difficult. The idea is to keep generating random values until we find one that is both smaller than the current smallest known size, and that still fails the test:

def find_smaller(min_result: TestResult, min_size: Size):
    skipped, not_shrunk, shrunk = 0, 0, 0
    while skipped + not_shrunk + shrunk <= 100_000 and min_size > 0:
        try:
            result, size = property.generate(min_size)
            if size >= min_size:
                skipped += 1
            elif not result.is_success:
                shrunk += 1
                min_result, min_size = result, size
            else:
                not_shrunk += 1
        except SizeExceeded:
            skipped += 1

    print(f"Shrinking: gave up at {min_result.arguments}")
    print(f"{skipped=} {not_shrunk=} {shrunk=} {min_size=}")

We stop shrinking after trying 100,000 values, or if we somehow found the smallest possible size of 0. For each try, there are three possible outcomes:

"skipped". This means that during or after generation of the value, we detected that its size is greater than our current minimum size.
"shrunk". Successful shrink - we've found a smaller value that still fails the test.
"not_shrunk". Unsuccessful shrink - we've found a smaller value, but it passes the test.

Before we check how that works out in practice, let's take a break and have a healthy snack.

Photo by Robson Melo on Unsplash

Putting it to the test

Now the question is - does this even work? It seems naively simple. Only one way to find out - using our canonical prop_wrong_sort_by_age example. Note once again we didn't have to change generators or test code, and so I won't repeat that code here.

> test(prop_wrong_sort_by_age)
Fail: at test 0 with arguments ([Person(name='pjahgc', age=27), Person(name='gndrlt', age=70), Person(name='qukflk', age=79), Person(name='dknczu', age=2), Person(name='jqtqgr', age=8), Person(name='xlcxhk', age=22), Person(name='wotanl', age=5), Person(name='nxkupy', age=99), Person(name='pxngky', age=31)],).

Shrinking: gave up at arguments ([Person(name='etkdgb', age=8), Person(name='haagjh', age=5)],)

skipped=81616 not_shrunk=18373 shrunk=12 min_size=2502

> test(prop_wrong_sort_by_age)
Fail: at test 0 with arguments ([Person(name='cigepg', age=69), Person(name='lqkqmp', age=100), Person(name='nlgbbl', age=33), Person(name='xrnzfq', age=76), Person(name='ujjnfz', age=34), Person(name='xlcvxf', age=19)],).

Shrinking: gave up at arguments ([Person(name='adcyxh', age=1), Person(name='hrclbb', age=0)],)

skipped=81931 not_shrunk=18055 shrunk=15 min_size=2530

Each of these examples took a couple of seconds to generate. The second example is one of the best results I've seen, the first is more typical. A few notes:

It's alive! I've only tried a dozen or so times, but it always manages to shrink to a list of two Persons, with relatively low ages.
This approach has more problems with shrinking lists of letters - it's not even clear that there is a pseudo-random seed that generates the letters "aaaaaa" and "aaaaaab" in succession, which is what would have to happen for those "minimal" names to show up.
The stats in the last line of each example are typical. skipped happens about 80% of the time, not_shrunk about 20% of the time. The number of shrunk values is almost insignificant! These numbers are similar if I try 1,000,000 times instead of 100,000 times. In fact, the number of successful shrinks remains around 10-15 even with 10x more tries. This indicates that for this case at least, it's unlikely to help much if we keep shrinking for longer.

Picking a winner

This third and last approach to shrinking sounds almost too simple to work. But it does, and pretty well at that. We've removed all human cleverness from the shrinking process, and as a result we gain a straightforward implementation, while keeping an integrated generation and shrinking API. Shrinking works just as well for mutable as for immutable values, and we can reproduce any run with just a single seed value for the pseudo-random number generator (typically a couple int64 values).

Now, I've emphasized that this approach is by far the simplest to implement, and simplicity is its own reward. But does it also buy us anything? Yes it does - besides having all the advantages of internal shrinking, it's also trivial to parallelize. All you need is a pseudo-random number generator that can generate several separate streams of random numbers. Secondly, if you have the code and the random seed, it's easy to pause and resume the shrinking process at a later time.

After all that, you may be left with a question: what's the best shrinking strategy? As usual, I would say "it depends", and rephrase the question: if you're writing a property-based testing library from scratch, which approach should you choose?

The answer to that question is random shrinking, in my view. Faced with such decisions, it's tempting to find the "best" solution, by some measure. Instead, I think it's better to find the solution that gets people something useful, in the least amount of time, and with the highest optionality - meaning, while keeping as many options open as possible.

In this case, random shrinking is certainly going to produce something useful (i.e. it will make counterexamples smaller). It is by far the simplest and thus fastest solution to implement. And finally, we can move to another approach without bothering our users much, which keeps our options open. Also, it's feasible to tack on any of the other shrinking methods after random shrinking: for example, we can try to find the smallest example using random shrinking, and then try and make that value even smaller by any of the two other methods.

Finally

Phew! When I started this series I didn't think I'd get to six posts on property-based testing, and the worst is that I'm not even done - for example there's approaches that allow exhaustive generation of values as input to a property-based test. That said, I'm going to take a break (of currently unknown duration) from writing about property-based testing, because I have a bunch of ideas for other topics I'm excited to write about.

Until next time, and as always, happy to hear from you in the comments or on Twitter @kurt2001.

Forem: Kurt

From Basic to Fancy Indexing

Previously in Tensors From Scratch: tensor basics, GPU acceleration, and automatic differentiation

Basic indexing

Fancy indexing

Indexing with int tensors

Multiple fancy indexes

Outer indexing - an extension of slicing

Vectorized indexing - the NumPy way

Broadcasting refresher

From outer to vectorized indexing

Indexing with masks

Implementation

Recap - where Tensorken is at

The Big Refactor: Bringing int and bool tensors to Tensorken

Implementing slicing

Adding fancy indexing

Implementing outer indexing

Implementing vectorized indexing

Implementing masking

Conclusion

References

Beyond Backpropagation - Higher Order, Forward and Reverse-mode Automatic Differentiation for Tensorken

Previously in Tensors From Scratch: neural networks, matrix multiplication, and GPU acceleration

How to train your neural network

The Autodiff Cookbook in Tensorken

Gradients

Evaluate a function and its gradient using value_and_grad

Checking against numerical differences

Jacobians using jacfwd and jacrev

A tale of two functions

From scalars to tensors

JVP for forward-mode AD

VJP for backward AD

Interpreters all the way down

User-facing layer: matrix multiplication in terms of Diffable

The first Diffable implementation: Tensor

From Diffable to RawTensor

Forward-mode AD with Forward

Forward<Forward<T>> for higher order derivatives

Reverse-mode AD with Reverse

Un-blowing up matmul, again

I love it when a plan comes together

References

Massively Parallel Fun with GPUs: Accelerating Tensors in Rust

Previously in Tensors From Scratch: neural networks and matrix multiplication

GPU and me

You want the threads? You can't handle the threads

Everything, everywhere, all at once

Wgpu 101: Accelerating unary operations

Can we run something now?

256 by 256 is enough for everyone

Making shaders generic

Non-contiguous unary operations

Binary operations

Movement operations

Reduce operations

Parallelizing reduce

Towards matrix multiplication

Finally efficient with fused multiply-add

Conclusion

References

Fun and Hackable Tensors in Rust, From Scratch

What does it take to build a neural network?

Ten-what?

Don't just stand there, tensor, operate

Unary operations

Binary operations and broadcasting

Reduction operations

Movement operations

It's a tensor, Jim, but not as we know it

Tensors, raw

Strides, shapes, and shenanigans

Crop and pad

Expand

Permute

Reshape

Unary, binary, and reduce

Eying matrix multiplication

Creating an eye matrix

Evaluate a function and its gradient using `value_and_grad`

Jacobians using `jacfwd` and `jacrev`

User-facing layer: matrix multiplication in terms of `Diffable`

The first `Diffable` implementation: Tensor

From `Diffable` to `RawTensor`

Forward-mode AD with `Forward`

`Forward<Forward<T>>` for higher order derivatives

Reverse-mode AD with `Reverse`