Forem: fab2s

I wanted to describe a network, not assemble it: FlowBuilder in flodl

fab2s — Mon, 04 May 2026 19:56:48 +0000

Last post was why. This one is what it looks like.

The thing I said at the end of last post was: with flodl I don't rewrite when I pivot. I add or remove a graph member. This post is the primitive that makes that sentence true. Meet FlowBuilder. It's a declarative graph DSL for neural networks, and it's the API I'd find hardest to give up.

The gap

By my third Python pivot, the wiring code outweighed the model. Freezing submodules, loading partial checkpoints, rerouting a tensor through a newly-inserted path, unfreezing for a finetune: each of these was three to ten lines of procedural glue that had nothing to do with the architecture. The model was in there somewhere, but finding it meant reading past everything else first.

What I wanted was simple. I wanted to describe the network. What's its structure? What's tagged? What's frozen? What loads from where? And then I wanted the framework to handle the wiring.

Procedurally assembling a network from module instances and class hierarchies is fine when the shape stays stable. Mine wasn't. A shape that pivots every two days and nests frozen subgraphs inside other frozen subgraphs doesn't want to be a script. It wants to be a graph.

What FlowBuilder looks like

Here's a small model with a tagged hidden activation and a residual connection:

let model = FlowBuilder::from(Linear::new(784, 128)?)
    .through(GELU)
    .tag("hidden")
    .through(LayerNorm::new(128)?)
    .also(Linear::new(128, 128)?)
    .tag("residual")
    .through(Linear::new(128, 10)?)
    .build()?;

Top to bottom, the architecture is visible in the code. No construction state to hold in your head; the structure is the text.

The method names carry the intent:

from(...) starts the flow with an entry module.
through(...) chains a module in series. Stream in, stream out.
tag("name") marks the current stream position for later reference: observation, freezing, checkpoint loading.
also(Linear::new(...)) adds a residual: output = stream + module(stream).
build() finalizes and validates. Unmerged streams and cycles surface as errors at build time, not at forward time.

There's more in the vocabulary. fork for side-branches that don't disturb the main stream. split with merge(MergeOp::Add) or merge(MergeOp::Mean) for parallel branches that recombine. switch and gate for routing. loop_body for iteration. map for applying a body across slices or tagged collections. The thing I care about is that the builder stays flat. A complex graph is more lines, not more indentation.

When I pivot a shape, I add or remove lines. The rest of the build doesn't move.

The graph renders itself

A Graph carries enough structural information to draw itself. One method call:

graph.svg(Some("model.svg"))?;

That writes an SVG with modules as nodes and stream connections as edges. Tags appear annotated on the nodes they marked, and parallel-execution levels are grouped as clusters. For training loops, svg_with_profile(...) colors nodes by measured forward-pass time so the hot path is visible instead of guessed at.

Layout runs through graphviz. It works well up to a few dozen nodes. Past that the visualization gets dense and I start squinting. That's one of the edges I'm still working on. The DOT output is available raw for people who want to pipe it through their own tooling.

Graph trees

Here's the part I think matters most, because no other DL framework I know has it at this shape.

A Graph implements Module. That means a Graph can be fed into another FlowBuilder anywhere a module is expected:

let encoder = FlowBuilder::from(Linear::new(32, 64)?)
    .through(GELU)
    .tag("hidden")
    .label("encoder")
    .build()?;

let model = FlowBuilder::from(encoder)
    .through(Linear::new(64, 10)?)
    .build()?;

.label("encoder") registers the inner graph as a named child of the outer. Once composed, the inner graph's structure is addressable from the outer scope via label paths:

model.freeze("encoder")?;                              // freeze every parameter in the subgraph
model.load_subgraph_checkpoint("encoder", path)?;      // load weights into just that subgraph
model.tagged_at("encoder.hidden")?;                    // read the tagged activation across the boundary
model.subgraph("encoder")?;                            // recover the child Graph

Nesting composes. An encoder inside a classifier inside a multi-head pipeline gives you paths like head.classifier.encoder.hidden, and everything addressable by label keeps working at any depth. Freeze, thaw, load, observe, swap.

This is the primitive FBRL needed. A trained letter reader is a frozen Graph inside the word reader. A trained word reader is a frozen Graph inside the line reader. Each level addressable by name, each level's checkpoint loadable independently, gradients cleanly blocked at the frozen boundary.

Transfer learning. Multi-phase pretraining. Anywhere you're stitching trained components into larger architectures and you want the composition to stay legible in a year when you come back to it.

What it isn't yet

One real edge: when the wiring is wrong, the error messages are functional but not great. If you merge two branches with mismatched shapes, you get a shape-mismatch error. You do not get told which branch of which split produced the offender. For a short graph you eyeball it. For a deep graph you add prints. I have a list of places where the errors need to carry more structural context back out to the user. That's next-round work.

I flag this one because it's the rough edge I touch most often. The shape of the API is right. The ergonomics of the error path is what needs sharpening.

What hooked me (again)

I started flodl because FBRL needed a composition primitive Python didn't give me cleanly. By the time FlowBuilder was working, I'd noticed I was solving a framework problem I cared about for its own sake. Ergonomics pulled me in first.

Then performance. Then distributed training. Then convergence under heterogeneous compute.

This is the part of the journey I didn't expect. I'll walk through the rest of it post by post.

Previous post: Why I built a Rust deep learning framework (and what I got wrong twice first).

Next post: how flodl actually benchmarks against PyTorch on real architectures, and what the libtorch FFI bet from last post actually buys you.

flodl: flodl.dev · github.com/flodl-labs/flodl · @flodl_dev

Why I built a Rust deep learning framework (and what I got wrong twice first)

fab2s — Thu, 23 Apr 2026 17:03:40 +0000

The Python script that made me give up. It had more boilerplate for freezing and re-composing submodules than it did for the actual model. I'd already pivoted three times. The next pivot was going to cost another rewrite. That's the day I decided to build this in Rust.

I should say upfront: before this, I had never trained a deep learning model. The path to here was unusual. A theoretical physics degree (with a PhD grant I turned down), then a long detour through documentary film and independent cinema, then self-taught software engineering and twenty years architecting scalable data systems through the startup-scaling era. Pattern recognition across domains is the thing I trust most in my own thinking. A wide-focus lens.

What I'm building flodl for is research called FBRL, Feedback Recursive Loops. It started as a hobby to explore the field. For now the shape is what matters. Mixing modalities. Feedback loops: images read, classified, and reproduced to force honest attention. Composition that goes letter, then word, then line, then paragraph. Each level frozen and used as an oracle for the level above. The vision was always nested, always partially-frozen, always graph-shaped. That shape is what broke Python for me.

The Python dead end

I started with sound and vision mixed together. That failed. I reframed to a foveal approach: the model reads letters by attention, and at each step it also tries to reproduce what it read. The reproduction forces more abstract latent representations. The letter model is the most developed part of this work so far.

Composition was always the next step. Read a letter. Then read a word that reuses the frozen letter reader. Then read a line that reuses both. Each level adds capability while the frozen levels below stay reliable.

Before I even tried to write the composition code, the Python script for the letter model alone had exploded in complexity. Every architectural pivot, and there were many, added more boilerplate than it removed. Per-op dispatch overhead was biting on top of that, especially in the recurrent attention loop where I was making hundreds of small kernel calls per training step.

What I actually wanted was a way to describe the network. Not procedurally assemble it from instances and module hierarchies. Describe it. What's its structure? What's tagged? What's frozen? What loads from what checkpoint? Looking at the kind of object I needed to express, the answer was obvious. This is a graph, not a script.

I had a prior on this. A few years back I'd built a graph library for a data-processing project, and I knew what the graph-shaped headspace felt like. The itch was familiar.

Two false starts

Python failed me first, Go second. The project was called goDL. It taught me more than it shipped.

The thing that killed it was the GC trap. Garbage collection plus GPU memory ownership do not compose. You end up with tensors the garbage collector thinks are dead but that the GPU is still using, or the inverse, tensors the GPU is done with but that the GC won't clean for another generation. You can layer manual lifetime management on top, but at that point you've reinvented Rust ownership in a language that's actively fighting you about it.

So: Rust. Then flodl.

The libtorch FFI bet

Several Rust deep learning frameworks exist already. Pure-Rust GPU paths are real and the people building them are doing serious work. None of them, when I started, gave me what I wanted.

The bet I made for flodl was libtorch FFI through a thin C++ shim. It is not pure Rust. It does not run on every backend. It inherits libtorch's memory footprint. What I get in exchange is CUDA parity today. NCCL today. Tensor Cores today. Mixed precision, CUDA Graphs, fused multi-tensor optimizers. Not in six months. Now.

I came to programming through the startup-scaling era, where the daily question was how to architect systems that hold up at volume. Shipping production-grade systems is what I do know. The deep learning math I'm still learning as I build. When I chose libtorch FFI, it was the shipping instinct talking: stand on a battle-tested C++ library, and you get production-grade performance today rather than hoping a pure-Rust kernel path catches up over the next few release cycles.

That bet pays off in measurable ways. There's a benchmark suite that compares flodl to PyTorch on ten architectures, more on that later. For now the point is just: libtorch FFI was a deliberate choice with known costs, not a shortcut.

What flodl looks like today

flodl has, today:

Tensor and autograd backed by libtorch. 100+ tensor operations, 90+ differentiable.
nn modules at rough PyTorch parity: activations, losses, optimizers (SGD, Adam, AdamW, RMSprop, RAdam, NAdam, with fused CUDA kernels), conv (1d/2d/3d, transposed), recurrent (GRU and LSTM cells and full sequences), attention, normalization (batch, layer, group, instance, RMS), pooling, dropout variants, embedding.
A declarative graph DSL called FlowBuilder, with visualization.
Hierarchical graph composition with selective freeze and partial checkpoint loading. The thing the FBRL composition shape needs.
Transparent multi-GPU training on heterogeneous hardware. One training loop, one or N GPUs.
Production niceties: mixed precision, CUDA Graphs, fused optimizers, async data prefetch.

Saying that as a list does not do the work. The point is that flodl has crossed the line from "can I express what I want" to "does the framework hold up under real workloads I care about." It does.

flodl is two months old. The velocity comes from AI collaboration on implementation. I'm the architect and the decision-maker. The bets, the API shape, the priorities, the truth-discipline this series will hold itself to: those are mine. Many of the lines of code: not. The pace is what AI partnership makes possible, and I'll be explicit about that throughout the series.

What hooked me

I started flodl because FBRL needed it. Building it pulled me into questions I didn't expect: ergonomics, performance, distributed training, convergence under heterogeneous compute. That is the rest of this series. Walking through what got built and why.

For now the simplest thing I can say about why flodl exists is the thing that has stayed true through three Python rewrites, one failed Go attempt, and the Rust work that became flodl:

With flodl I don't rewrite when I pivot. I add or remove a graph member.

Next post: FlowBuilder, and what a declarative graph DSL actually looks like in Rust.

flodl: flodl.dev · github.com/fab2s/flodl · @flodl_dev