Forem: David Aronchick

For Londoners, a Roman Bridge Still Determines Your Commute

David Aronchick — Fri, 08 May 2026 00:16:16 +0000

Around 50 CE, give or take a few years, a group of Roman military engineers picked a spot on the River Thames and bridged it. They picked the place they did because it was the narrowest practical crossing downstream of the marshes that lined the river's lower reach, with banks workable enough to anchor timber piers and high enough not to wash out at high tide. The bridge they built ran roughly 280 metres across the water on at least nineteen wooden pilings. On the dry, slightly elevated north bank where it landed, an opportunistic trading settlement took root, attracted to the only place for miles where you could reliably get from the south side of the river to the north on foot. They called the settlement Londinium.

The bridge has been rebuilt many times. The Roman timber structure was patched and replaced for centuries, then collapsed into disrepair after the Romans withdrew from Britain in 410 CE. The city itself was largely abandoned for the next two and a half centuries, until Alfred the Great refounded London in 886. The first stone bridge was begun in 1176 by a priest and architect named Peter de Colechurch and finished in 1209, three years after his death. That bridge, the famous Old London Bridge with shops and houses built along its length, eventually rising several stories tall, with severed traitors' heads on spikes at the south gatehouse, stood for over six hundred years. It was replaced in 1831 with John Rennie's stone arch design, which was itself replaced in 1973 with the present concrete-and-steel span that most Londoners walk across without thinking about it.

Each rebuild was on essentially the same site. Once you have built a city around a bridge, the bridge isn't where the bridge is. The city is where the bridge is.

The bridge story is binding even when nobody knows it

Almost every long-arc fact about modern London cascades from that one Roman engineering decision. The City of London, the financial district, the square mile with the bowler-hat-and-pinstripe stereotypes, sits where it sits because that's where the original bridge landed and where Roman commerce piled up around the crossing. Westminster grew separately, two miles upstream, because the medieval kings wanted physical distance between their royal seat and the merchants. The Tower of London was built next to the bridge specifically to control it. The East End and West End divide settled where it did because the prevailing wind blew industrial smoke eastward over centuries, depressing property values on the downwind side. The London Underground, when it was built starting in 1863, inherited the medieval street grid. Your commute today routes through that grid. The grid exists because the city's center settled where the bridge crossed. The bridge crossed where it did because of a tide table from two thousand years ago.

You can describe London accurately without knowing any of this. The city works fine. The Northern line still runs. Property prices in Hampstead are still higher than property prices in Stratford, and you can look up by how much without ever asking why. Most of the people who walk across London Bridge twice a day on their way to and from Bank station have never thought about the Romans, the marshes, or Peter de Colechurch. They don't need to. The city's shape is taken as given.

You only run into the bridge story when you try to change something. Try to propose a new river crossing in central London and you discover that every site you could pick is constrained by infrastructure that was placed because of the original crossing. Try to redirect a Tube line and discover that the geology of the City rules out anything but the routes that already exist, because the ground was tunneled and stabilised for the routes that already exist, because the routes were drawn to serve the medieval street pattern, because the medieval street pattern formed around the crossing. Every constraint you hit is downstream of a decision nobody remembers being made.

That is what a millennium of accumulated technical context does to a city. Most of it is invisible from the street, and all of it is binding the moment you try to do anything new.

Every dataset has a Roman bridge

Every dataset you have ever worked with has the same structure. There is a column on a table somewhere in your enterprise that exists because of a regulation that was passed in 2017 because of an incident at a competitor in 2014 because of a Senate hearing in 2013 because of a complaint from a single advocacy group in 2011. Three job-changes later, nobody at your company remembers any of that. The column is still there. Every model trained on the table inherits the column. The model has no idea why the column exists, but it learns to weight it anyway, because the column is in the data.

A schema you work with today was probably designed by a team that no longer exists, to support a use case that no longer matters, against a constraint that has since been repealed. The schema is internally consistent. The data flowing through it is internally consistent with the schema. An LLM that reads the data produces internally consistent answers about it. None of that consistency tells you whether the conditions that justified the original design still hold, because the conditions are not in the data. They are in the meeting notes nobody saved, the email thread that got archived in 2018, the SOC 2 attestation that was renewed three times before anyone questioned the assumption underneath it.

Most of the time the cascading context doesn't matter. Like the Northern line, the system works. The model returns reasonable answers, the dashboards load, the forecast goes out to investors, and nothing breaks. Until something changes. A regulator asks why the model is treating two customer segments differently and the answer is that the segmentation was built in 2019 against demographic categories that have since been ruled discriminatory. A new product manager asks why the recommendation engine biases toward older content and the answer is that the original training data was filtered by an engineer who wanted to remove a category of spam in a way that incidentally also removed everything posted after a certain date. Every constraint you hit is downstream of a decision nobody remembers being made.

Stanford's 2026 AI Index puts hallucination rates across the leading frontier models in a band stretching from 22% to 94%, depending on the domain and the test. The reflexive industry response is to point at the model, ask for better fine-tuning, better RAG, better evals, and that response is wrong in roughly the same way as proposing a better Tube line without acknowledging the geology. The hallucinations aren't really coming from the model. They're coming from the fact that enterprise data is a city with no bridge story, and the model is being asked to interpret the city without the history. It does its best. Its best is wrong about a quarter of the time and sometimes very wrong. The fix isn't a smarter tourist. The fix is to keep the bridge story attached to the data while the data is moving through the pipeline, instead of stripping it at every hop.

The pipeline strips context aggressively

Consider what a normal AI pipeline does to historical context.

Source data is extracted from a system of record into a staging area. Whatever source-system-specific metadata existed (the user who created the row, the version of the upstream schema, the policy that required the field to be populated) is dropped or reduced to a vendor-neutral representation that loses most of the meaning. The data is transformed and joined with other extracts in the warehouse, where any conflicts between the sources are resolved silently by whichever ETL job runs last. The result lands in a feature table. The feature table is consumed by a model, or chunked and embedded for retrieval, or sometimes both. By the time the embeddings are sitting in the vector index, the only remaining pointer back to the original source is a row ID in a Parquet file in object storage. The historical conditions, the authority chain, the regulatory rationale, the schema evolution, all of it is gone.

When a user asks the model a question, the retriever fetches chunks based on geometric similarity in the embedding space. Geometric similarity does not preserve provenance. The retriever has no way to surface the fact that one chunk was authored by a regulator and another by an intern, or that the regulator's chunk supersedes the intern's chunk, or that the intern's chunk was written under a policy that was repealed three months ago. The model reads both, treats them as roughly equivalent inputs, produces an answer that averages them, and cites both chunks. The citation looks rigorous because the citation has nothing to check itself against.

This is the failure mode I wrote about in The Only Guarantee Is Your Catalog Will Be Wrong. Eventually. and again in The Missing Part of the Pipeline. The structural answer is to wrap the bridge story onto the data at the moment of ingest, with claim-level granularity, signed and immutable, and let it ride with the data through every downstream transform. Provenance has to be a property of the artifact, not a layer reconstructed afterward by a catalog crawling artifacts that have already lost their context. Every downstream consumer inherits the wrap for free. The model reading the data can tell that the regulator's chunk supersedes the intern's chunk because that fact is in the manifest the chunk carries with it. The SLSA specification defines this primitive for software builds. The same primitive is what the data world has been missing.

Cities are easier to read than data because they are physical

Cities have one big advantage over datasets, which is that they are physical. London Bridge is still there. You can see it. You can stand on it. You can look at it from the river and notice that the modern span is in suspiciously the same place as the medieval one and ask why, and the answer is right there for anyone who wants to follow the chain. Even if nobody bothered to write the bridge story down, the city wears it as a physical fact.

Data has the same kind of inheritance, but invisible. The schema does not announce that it was designed in 2017 against a regulation that no longer exists, the model weights do not announce that they were trained on a corpus a since-departed engineer happened to filter according to his strong opinions about spam, and the retrieval index does not announce that one of its chunks is six years stale and authored by somebody whose role got eliminated in the last reorg. The cascading historical decisions are still in there, still doing the work of constraining the system, but you cannot see any of it by looking at the system from the outside.

The only way to make the inheritance legible is to refuse to lose it in the first place. Wrap the bridge story onto the data while the data is being born. Sign the manifest. Carry it forward. When the data is consumed by an LLM, hand it the manifest along with the data, so the model can tell the difference between a current authoritative source and a stale auxiliary one. When the model produces an answer, have the answer cite not just which chunk it came from but which version of which source under which authority at which point in time. This is not abstract. The components for it exist as discrete primitives in modern data infrastructure. What's missing is the integrated layer that combines them into a continuous bridge story for every claim the system makes.

Brian Arthur's work on path dependence showed decades ago that systems with increasing returns tend to lock in early choices for centuries, sometimes longer. The Davis-Weinstein analysis of Japanese cities after WWII bombing showed that even when you flatten a city to rubble, it tends to grow back in roughly the same places, because the underlying locational logic that put the city there in the first place is still in force. London's bridge is that kind of artifact. The Thames is the width it is, the banks are the shape they are, the tides behave the way they behave, and the Romans noticed in 50 CE that all of those things together made one specific spot the only sensible place to bridge. That fact has been load-bearing ever since.

Your enterprise data has the same kind of underlying logic, except none of it is visible from the outside. The schemas, the tables, the dashboards, and the trained models were all designed for reasons that were correct at the time, by people who understood the constraints they were operating against, with a specific regulation in mind and a specific customer expectation in mind that made sense in the year the decision got made. The decisions persist long after the reasons stop applying, and the model trained on the result is operating on a city plan that does not include the bridge.

You can fix this. The components are sitting on the shelf in modern data infrastructure. Wrap the data at ingest, sign the manifest, carry the bridge story through every downstream transform, and the LLM finally reads a substrate that knows where it came from. Stop making AI guess at a city it cannot see.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!

Originally published at For Londoners, a Roman Bridge Still Determines Your Commute.

The Permission Problem

David Aronchick — Wed, 06 May 2026 18:27:58 +0000

Let's talk about Loudoun County, Virginia.

If you don't work in infrastructure, you've never heard of it. If you do, this is THE place. Somewhere between 60% and 70% of the world's internet traffic routes through facilities sitting in this one Northern Virginia county, and for most of the last decade Loudoun was also the most permissive jurisdiction in the country for hyperscale construction. The Board of Supervisors approved campuses on what was effectively a rubber-stamp cadence, the substation costs got socialized into the broader Dominion Energy rate base, the county collected the property tax revenue, and most of the AI buildout from 2022 through 2024 was financed against the assumption that Loudoun was the median, not the outlier.

This quarter, Loudoun has more active moratorium proposals on its docket than any comparable jurisdiction in the country. That reversal happened in eighteen months. It is happening for reasons that are going to keep happening, in places nobody had on their 2026 risk register.

The bottleneck nobody priced into a 2024 financial model was never going to be technical, even though most of the discourse acted like it would be. Compute is fine, NVIDIA shipped on time, the networking gear arrived, the cooling works in the lab and at scale, and not one of those things is what is going to keep your 2026 workload from coming online. The thing that is going to keep it from coming online is whether a county commissioner, a state public utility commission, or a regional grid operator will let you turn the racks on in the calendar year you actually need them turned on. That answer is moving in real time, and it is moving in the wrong direction for the buildout that just got financed.

A 2 GW campus is a financeable thing on paper. It is a five-year fight in dirt, and the fight is what is changing.

The grid-skeptics had the diagnosis right

The folks who have spent the last two years warning that centralized AI was going to slam into grid limits got the diagnosis exactly right, and at this point I would just like to say sorry to anyone I argued with about it in 2024. Data centers now account for roughly half of all new US electricity demand. Global AI-related electricity consumption is on track for around 1,000 TWh by the end of 2026, which is a midsize industrialized country's worth of electricity. Treating any of this as a footnote, which most of the industry did until about a year ago, was a category error.

But then the same crowd keeps handing me a prescription that does not survive contact with the actual political-economy environment, and that is where the conversation falls apart.

Will isn't the problem. Coalition is.

The fix-the-grid argument assumes that the engineering exists, the financing exists, and the only thing missing is political will. That is half right, in the way that "all I need to climb Everest is a good attitude" is half right. The engineering is mature, DOE has the modernization roadmap, FERC has the interconnection reform proposal, the transmission corridors are mapped, the storage technology is production-ready, demand-response is running in pilot, and none of this is mysterious. What is actually missing is not will. It is coalition, and coalition is the thing that does not show up in an engineering roadmap.

Transmission siting is a state-level fight in most US jurisdictions, and most of the relevant states do not have a clean coalition through it. New generation siting requires winning a NIMBY fight, an environmental review, in a lot of places a tribal sovereignty fight, and increasingly a ratepayer revolt, all stacked on the same project. Substation siting plays out at the county and municipal level, where the constituency that benefits from a hyperscale data center load (a hyperscaler in another time zone) has roughly zero votes in the relevant elections, and the constituency that absorbs the cost (the residential ratepayer, the school district worried about its assessment, the homeowner two miles upwind) has roughly all of them. That is not a failure of imagination. It is a vote count.

I have watched a lot of my friends in this industry get frustrated with voters about this, which I understand and which I also think is short-sighted. The voters are not being unreasonable. They are being asked to underwrite a buildout whose direct cost they pay, whose direct benefit they do not see, and whose externalities (the noise, the water, the visual blight, the higher bills) they live next to. Of course they are saying no. The surprising thing is not that they finally noticed, it is that we expected them not to.

State PUCs noticed. Several have moved hyperscale data center loads into separate rate classes specifically to insulate residential ratepayers from the cost of expansion, and almost no AI press picked it up, which is a miss because once the rate base splits, the financing model that made hyperscale campuses cheap collapses, the hurdle rate goes up, and the build cadence slows. None of this is anyone being unreasonable. It is the cost of asking anyone to be reasonable getting priced in.

"But hyperscalers will scale through it." Are you sure?

The other prescription, the one coming out of the labs and the hyperscalers themselves, is that announced capacity will roughly track delivered capacity, the way it did in 2018 and 2019 and 2020 and 2021. That was a defensible extrapolation through the end of 2024, stopped working visibly in 2025, and is now in open contradiction with the data.

Loudoun is not a one-off. Suburban Phoenix and several Texas exurbs have either passed moratoria or surfaced moratorium proposals into active hearings, and Northern Virginia substation projects that were routine approvals five years ago are now multi-year political fights. Two stories from the last couple weeks tipped me off that the political turn just crossed a threshold I had not seen yet. CNN ran a piece on April 23 titled "There are fixes for AI's toll on the power grid. Here's why they're not happening." Three days earlier, Fortune published polling showing Americans now associate AI infrastructure with rising electricity bills and have soured on AI as a category as a result. Read the bylines. Notice where those pieces ran, because CNN and Fortune are not Common Dreams, and the political turn against centralized AI just migrated out of the activist fringe and into the centrist business and political press, which is the part of the spectrum that most reliably foreshadows where actual zoning votes are going to land in the next election cycle.

So which side is right? Both, kind of, and also neither in the way that actually matters for the next twenty-four months. The fix-the-grid argument resolves on a presidential-term timeline. The hyperscaler-scale-out argument is hitting the wall right now. The actual workload, the stuff being trained and served and shipped by product teams who do not care about either argument, has to run somewhere, and where it runs is wherever the substation is already humming.

Telco edge: the buildout that already happened

There is already a huge pile of capacity sitting in places nobody is having moratorium fights about. Cell tower compounds. Regional colo footprints. Retired industrial facilities with their substation tie-ins still intact. Telco central offices that were sized in 1996 for a voice and broadband workload that has since contracted by an order of magnitude. The substations are paid for, the easements are paid for, the cooling is roughly fine, and the community license, which is the hard part of all of this, was granted thirty years ago by a public that has long since moved on to caring about other things.

NVIDIA put a number on this at GTC in March: roughly 100,000 telco data center sites globally, with 100 GW of spare capacity already energized. HPE's distributed AI factory rollout approaches the same problem from the enterprise side. The two of them are converging on the answer that has been sitting there the whole time, which the discourse has been politely ignoring because it is not as exciting as building a new 2 GW campus in cornfield Wisconsin.

There is a reflex in this industry, especially among people who learned their economics on the AWS scale curve, to dismiss the whole telco-edge category as too small, too inefficient, too logistically annoying to be the actual answer. That dismissal was defensible when the hyperscale alternative was clearing on a four-year cadence. It is not defensible when the alternative is a 2028 FERC study cycle and a Loudoun moratorium hearing on the same docket. A workload that runs at slightly worse unit economics on permitted infrastructure beats a workload that does not run at all, and the economics that looked bad against a 2024 spreadsheet look fine the moment the comparison case becomes zero.

I made the structural case for this in Six Million Cell Towers Walk Into a Data Center, and the incumbents have started making the argument for me, which has historically been the moment a position quietly graduates from contrarian to obvious.

So what do you actually do about any of this

For three years the AI infrastructure debate ran at the engineering layer, in some combination of NVIDIA versus AMD, hyperscaler versus neocloud, and centralized versus federated, in roughly that chronological order. None of those framings was wrong on its own terms. They were operating one layer above the constraint that is now actually binding.

The constraint that is now actually binding is whether your load gets to come online in the calendar year you need it, and the answer is unevenly distributed across US geography in ways the planning process has not caught up to. Some sites have the permission. Most do not. The architecture that wins the next two years is the one that pays attention to which is which and routes the workload accordingly.

If your 2026 inference is sitting in a FERC interconnection queue with a 2029 study cycle, that capacity does not exist for you, and the same is true if your training run is waiting on a substation transformer with an 18-month lead time. Anything else you have planned that depends on Loudoun voting yes is on a similar timeline, and the timeline is worse than your roadmap admits.

Loudoun is not voting yes.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

Originally published at The Permission Problem.

The Brownian Ratchet for Data

David Aronchick — Fri, 23 Jan 2026 00:35:31 +0000

Monday I wrote about how multiclaude and GasTown converged on nearly identical primitives for multi-agent orchestration. The key insight wasn't about prompts or models or agent personas. It was about infrastructure: CI is the ratchet. Let chaos reign. Multiple agents, overlapping work, duplicated effort, whatever. As long as you have a mechanism that only captures forward progress, you're good.

That phrase has been rattling around my head ever since. Because here's the thing: we have this for code. What's the equivalent for data?
The Missing Ratchet
CI transformed software development by giving us a one-way gate. Code either passes or it doesn't. No negotiations, no exceptions, no "we'll fix it later." The ratchet clicks forward, and it never clicks back.

Data has no such mechanism.

Oh, we have tools. We have great expectations (pun intended). We have dbt tests and schema validators and anomaly detectors. But none of them function as the arbiter-the single, uncompromising source of truth that says "this data is real now, and we're never going backward."

Instead, we have... hope? Process? Tickets that say "data quality issue" that sit in someone's backlog for three sprints while the dashboard keeps serving numbers that everyone knows are wrong but nobody can prove?
What Would a Data Ratchet Look Like?
Let's steal the multiclaude architecture and apply it to data:

Code Ratchet
Data Ratchet

CI tests
Schema validation + semantic checks

Passing tests
Data meeting quality thresholds

Merged PRs
Verified, immutable records

Git history
Data lineage with provenance

Multiple agents
Multiple validators / transformation paths

The principle is the same: chaos is fine, as long as we ratchet forward.

Multiple data sources can feed into your system. They can be messy, inconsistent, formatted in ways that make you question whether the upstream team has ever heard of ISO 8601. That's the Brownian motion: the random thermal energy of the real world generating data in a thousand incompatible ways.

But the ratchet, the verification layer, only lets validated data through. And once it's through, it's permanent. Immutable. Part of the record.
The Four Components
I think a data ratchet needs four things:

The Pawl: Schema as Contract JSON Schema (or Avro, or Protobuf, whatever floats your boat) isn't just documentation. It's the pawl that prevents backward motion. Data either conforms or it doesn't. No partial credit.

Here's what a schema-as-pawl actually looks like:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "SensorReading",
"type": "object",
"required": ["device_id", "timestamp", "value", "unit"],
"properties": {
"device_id": {
"type": "string",
"pattern": "^[A-Z]{2}-[0-9]{6}$"
},
"timestamp": {
"type": "string",
"format": "date-time"
},
"value": {
"type": "number",
"minimum": -273.15
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit", "kelvin"]
}
},
"additionalProperties": false
}

Notice additionalProperties: false. That's the pawl. You can't sneak extra fields through. You can't send "value": "hot" instead of a number. You can't omit the timestamp and promise to fill it in later.

But here's where most systems fail: they treat schema validation as a warning, not a wall. "Schema violation detected, logging and continuing." That's not a ratchet. That's a turnstile with a broken lock.

A real data ratchet rejects non-conforming data. Full stop. The data can go back to the source, get transformed, get remediated, whatever it needs to do. But it doesn't get through until it conforms.

The Wheel: Idempotent Checkpoints In multiclaude, git worktrees give each agent isolation. If an agent's work fails, it fails in its own branch. The main branch (the ratcheted progress) stays untouched.

Data pipelines need the same thing: checkpoints that are idempotent and isolated. If a transformation fails, you can retry from the last checkpoint without corrupting the verified data downstream.
class CheckpointedPipeline:
def init(self, checkpoint_store: str):
self.checkpoint_store = checkpoint_store

def process_batch(self, batch_id: str, records: list[dict]) -&gt; str:
    # Check if we already processed this batch
    checkpoint = self.load_checkpoint(batch_id)
    if checkpoint and checkpoint[&quot;status&quot;] == &quot;completed&quot;:
        return checkpoint[&quot;output_path&quot;]  # Idempotent: return existing result

    # Process in isolation (write to temp location)
    temp_path = f&quot;{self.checkpoint_store}/pending/{batch_id}&quot;
    validated = []
    for record in records:
        if self.validate(record):
            validated.append(record)
        else:
            self.quarantine(record, batch_id)  # Don&apos;t lose it, just don&apos;t let it through

    self.write_records(temp_path, validated)

    # Only after success: commit the checkpoint
    final_path = f&quot;{self.checkpoint_store}/verified/{batch_id}&quot;
    self.atomic_move(temp_path, final_path)
    self.save_checkpoint(batch_id, {&quot;status&quot;: &quot;completed&quot;, &quot;output_path&quot;: final_path})

    return final_path

The key moves: write to a temp location first, only move to the verified path after success, and the checkpoint makes retries safe. If the process dies mid-batch, we start over. No partial state leaking into the verified dataset.

Most pipelines I've seen treat state as something that happens to them rather than something they manage. They're stateless in theory and stateful in practice, which is the worst of both worlds.

The Arbiter: Automated Verification with Teeth Here's the multiclaude rule that matters: agents are forbidden from weakening CI to make their work pass.

Translate that to data: no one can weaken the validation rules to make bad data pass. Not the data team, not the business stakeholder with a deadline, not the executive who needs the dashboard updated yesterday.

What does "CI for data" actually look like? Something like this:

data-ci.yaml

name: Data Quality Gate

on:
data_ingestion:
sources: ["sensor-feed", "partner-api", "user-uploads"]

jobs:
validate:
steps:
- name: Schema Validation
run: |
jsonschema --instance ${{ inputs.data_path }} \
--schema schemas/${{ inputs.source }}.json
fail_on_error: true # This is the ratchet. No exceptions.

  - name: Semantic Checks
    run: |
      python checks/semantic_validator.py \
        --data ${{ inputs.data_path }} \
        --rules rules/${{ inputs.source }}.yaml
    # Example rules:
    # - timestamp must be within last 24 hours
    # - device_id must exist in device registry
    # - value must be within 3 std devs of rolling mean

  - name: Lineage Recording
    if: success()
    run: |
      record-lineage \
        --input ${{ inputs.data_path }} \
        --schema-version ${{ inputs.schema_hash }} \
        --validator-version ${{ github.sha }} \
        --output verified/${{ inputs.batch_id }}

on_failure:
steps:
- name: Quarantine Bad Data
run: |
move-to-quarantine ${{ inputs.data_path }} \
--reason "${{ job.failure_reason }}"
- name: Alert Source System
run: |
notify-upstream ${{ inputs.source }} \
--batch ${{ inputs.batch_id }} \
--errors ${{ job.validation_errors }}

The critical bit is fail_on_error: true with no escape hatch. No continue-on-error. No "warn and proceed." The data either passes or it goes to quarantine.

This is culturally difficult. It requires the same organizational commitment that "we don't ship if tests fail" required for software teams. But it's the only way the ratchet works.

Reproducibility: The Secret Ingredient There's one more piece that makes the code ratchet work: reproducibility. When CI fails, you can reproduce the failure. When it passes, you can reproduce the pass. Same inputs, same outputs, every time.

Data systems are notoriously bad at this. The pipeline that worked yesterday fails today because someone changed an upstream schema. Or because the source system had a hiccup. Or because Mercury is in retrograde. (I've debugged all three. The Mercury one was actually a timezone issue in a system named "Mercury." I wish I was kidding.)

A real data ratchet needs what I'd call a "usability signature":
{
"batch_id": "2026-01-22-sensor-feed-042",
"verified_at": "2026-01-22T14:32:01Z",
"input_hash": "sha256:a1b2c3d4...",
"schema": {
"name": "SensorReading",
"version": "2.1.0",
"hash": "sha256:e5f6g7h8..."
},
"validators": {
"semantic_checks": "v1.4.2",
"anomaly_detector": "v0.9.1"
},
"result": {
"status": "passed",
"records_in": 10482,
"records_verified": 10479,
"records_quarantined": 3
},
"output_path": "verified/2026-01-22/sensor-feed-042.parquet",
"output_hash": "sha256:i9j0k1l2..."
}

This signature is an artifact, not just a log line. You can take this signature, grab the input data by its hash, run the exact versions of the validators, and you'll get the same result. If you can't do that, you don't have a ratchet. You have a coin flip.
The Uncomfortable Implication
Here's what this means in practice: a lot of data that's currently flowing through your systems wouldn't make it through a real ratchet.

That's not a bug. That's the point.

The Brownian ratchet works because it's uncompromising. The pawl doesn't care that you really need this data for a quarterly review. It doesn't care that the source system "usually" sends valid records. It doesn't care about your deadline.

CI transformed software quality not by being smart, but by being stubborn. It created a culture where "works on my machine" stopped being an excuse because there was an objective arbiter that didn't care about your machine.

Data needs the same stubbornness. The same willingness to say "no" and mean it.
What This Looks Like in Practice
I've been thinking about this in the context of what we're building at Expanso: intelligent data pipelines that can process data at the edge. The edge is where the Brownian motion is strongest. Sensors, devices, user inputs, all generating data in a thousand formats with a thousand failure modes.

The traditional answer is to centralize. Pull everything to a data lake, clean it up, validate it there. But that's expensive, slow, and loses context. By the time you've moved the data, you've lost the ability to remediate at the source.

What if the ratchet lived at the edge? Validation happens where data is generated. Non-conforming data gets rejected immediately, while there's still context to fix it. Only verified data propagates upstream.

That's the vision. Not a single central ratchet, but a distributed network of ratchets. Each one small and stubborn. Each one clicking forward, never back.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at Distributed Thoughts.

The Brownian Ratchet and the Chimpanzee Factory

David Aronchick — Tue, 20 Jan 2026 00:34:00 +0000

Two weeks ago, Steve Yegge released GasTown, a multi-agent orchestrator he describes as "an industrialized coding factory manned by superintelligent chimpanzees." A few days later, Dan Lorenc quietly pushed multiclaude, built on what he calls the "Brownian Ratchet" principle: chaos is fine, as long as we ratchet forward.

While the projects are separate, Dan says he was deeply inspired by Gas Town. However, they, and many others like them, landed on almost identical foundational architecture: detached UI for observability, git worktrees for isolation, external state persistence, and CI as the final arbiter. That convergence tells us something important about where agent tooling is heading.
The Problem Both Solve
Running one Claude Code instance is straightforward, but running twenty in parallel on the same codebase is a distributed systems problem. The challenges are familiar to anyone who's operated infrastructure at scale: agent sessions crash, work gets duplicated, changes conflict, and state disappears when processes restart. Without proper isolation, a single runaway agent can corrupt shared resources, and without observability you can't debug what's happening. If state doesn't persist, progress evaporates the moment something fails.

In source code, people saw this same problem (lots of people working on the same thing), and solved it incrementally with things like SVN, and then Git (with many others as well).

Every multi-agent orchestration system has to answer these questions about multiple things working on a single thing, and what's interesting is watching how different systems answer them.
Two Philosophies, Same Primitives
GasTown takes the comprehensive approach, with seven distinct agent roles that divide responsibilities across the system. The Mayor coordinates overall work, Polecats handle ephemeral tasks, the Refinery manages merge queues, and so on through Witness, Deacon, Dogs, and Crew. Work flows through what Yegge calls the MEOW stack (Molecules, Epics, Work orders), with state persisting through git-backed "hooks" and the Beads memory framework. Agents maintain persistent identities that survive session crashes via the GUPP principle: "If there is work on your hook, YOU MUST RUN IT." This is ... confusing ... but i have to respect explicitly naming everything. Gives you an unequivocal way to know what (and who) is doing what, at the cost of a lot of UX affordance up front.

multiclaude takes the minimalist path with just three roles: a supervisor that coordinates, workers that execute tasks, and a merge-queue agent that handles CI. State lives in a JSON file and the filesystem, communication happens through simple message passing, and the philosophy is explicit: "Trying to perfectly coordinate agent work is both expensive and fragile. Instead, we let chaos happen and use CI as the ratchet that captures forward progress."

The rhetoric differs dramatically between the two projects. Yegge's documentation reads like a manifesto, complete with warnings that you shouldn't use GasTown if you "care about money" or are "more than 4 feet tall." Lorenc's README is Unix-philosophy spare, with clean diagrams and matter-of-fact explanations. But underneath the different personalities, you find the same primitives.

Primitive
GasTown
multiclaude

Process isolation
tmux windows
tmux windows

Code isolation
git worktrees
git worktrees

State persistence
Git-backed hooks
JSON + filesystem

Quality gate
CI (automated merging)
CI ("the ratchet")

Observability
Attach to any session
Attach to any session

Both folks recognized that you can't just spawn Claude instances and hope for the best. You need boundaries.
What the Convergence Reveals
When two experienced engineers arrive at the same architectural primitives, that's signal worth paying attention to. It suggests these aren't arbitrary choices but structural requirements for the problem space.

Isolation requires more than process boundaries. Both projects use git worktrees rather than just separate directories because a worktree gives each agent its own branch, its own working copy, and its own commit history. Conflicts become merge conflicts, which git already knows how to surface, and the blast radius of any single agent is bounded by what it can do to its own worktree.

Observability can't be an afterthought. Both chose tmux as the primary interface rather than a web dashboard or log aggregator. A terminal multiplexer lets you attach to any agent's session, watch it work, and intervene if needed. This is distinctly different from how most "AI agent frameworks" approach the problem, with their emphasis on structured outputs and API-driven orchestration.

State must survive failures. GasTown invests heavily in crash recovery through git-backed hooks while multiclaude keeps it simpler with filesystem persistence, but both reject the idea of ephemeral agent state. When a session dies, the work shouldn't die with it.

CI becomes the coordination mechanism. In both systems, CI isn't just a quality check but the arbiter of what counts as progress. Lorenc is explicit: "If it passes, the code goes in. If it fails, it doesn't. The automation decides." Yegge's Refinery agent serves the same function, and this approach shifts coordination from real-time synchronization (expensive, fragile) to asynchronous validation (robust, scalable).
The Deeper Pattern: Scoped Autonomy
Strip away the specific implementations and you find a design pattern emerging for AI agent systems: scoped autonomy with external persistence. Give agents freedom to act within clear boundaries, let them fail without cascading damage, capture successful outcomes permanently, and accept that coordination is expensive and often unnecessary if your ratchet mechanism is good enough.

This isn't a new idea. It's how we've learned to build reliable distributed systems over the past two decades, and the insight here is that agent orchestration is distributed systems with the same principles applying. Kubernetes asks "Is it running?" and reconciles toward desired state while GasTown asks "Is it done?" and persists completed work. Both are control loops operating over unreliable workers, and both accept that perfect coordination is impossible and design around it.
Where They Diverge: Single-Player vs. MMORPG
The most interesting philosophical difference isn't about orchestration complexity but about the human model.

Lorenc frames multiclaude explicitly: "Gastown treats agents as NPCs in a single-player game. You're the player, agents are your minions. multiclaude treats software engineering as an MMORPG. You're one player among many."

In multiclaude, your workspace persists. You spawn workers, go to lunch, come back, and check what merged while you were away. Other humans can have their own workspaces on the same repo, and the system keeps running when you're not watching.

GasTown is designed around intensive engagement. Yegge describes watching 20-30 agents in parallel, making $100/hour decisions about what work to greenlight, experiencing "palpable stress" as the system runs at speeds too fast to comprehend. It's a powerful multiplier for an engaged operator rather than a fire-and-forget system.

Neither model is wrong since they're optimizing for different workflows, but the MMORPG framing points toward something important: these systems need to work when humans aren't actively supervising.
What This Means for the Industry
We're watching the orchestration layer crystallize in real time, and the patterns that emerge now will shape how agent systems get built for years.

The "19-agent trap" (simulating an org chart with Analyst → PM → Architect → Dev → QA handoffs) is giving way to operational models where agents have specific, bounded roles. The emphasis shifts from elaborate prompting frameworks to infrastructure primitives: isolation, persistence, observability.

The tooling will mature as costs drop. Right now, GasTown burns $100/hour in tokens, partly because the models are expensive and partly because the coordination overhead is high. Both factors will improve, and the architectural patterns being established now will outlast the current pricing structure.

For teams thinking about agent infrastructure, the lesson isn't "adopt GasTown" or "adopt multiclaude" since both are weeks old and explicitly experimental. The lesson is to watch what primitives they converged on, because if you're building agent systems you'll probably need them too: git worktrees for isolation, something tmux-like for observability, persistent state that survives session failures, and CI or some equivalent as the ratchet that captures forward progress.

The chimpanzee factory and the Brownian ratchet arrived at the same answer. That's worth paying attention to.

Repos:
- GasTown: github.com/steveyegge/gastown
- multiclaude: github.com/dlorenc/multiclaude

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. [I'd love to hear your thoughts**](https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org)!**

Originally published at Distributed Thoughts.

The Upstream Problem: Why Context Graphs Are Starving

David Aronchick — Sat, 17 Jan 2026 00:33:27 +0000

Foundation Capital just published what they're calling AI's trillion-dollar opportunity: context graphs. They argue that enterprise value is shifting from systems of record (Salesforce, Workday, SAP) to systems of agents. The new crown jewel isn't the data itself. It's the context graph: a living record of decision traces stitched across entities and time, where precedent becomes searchable.

They're right about the destination. But Greg Ceccarelli's response on LinkedIn caught something important that their framing misses. Foundation Capital focuses on capturing decisions at execution time. That matters, but it's the last mile. The first mile is still bleeding out.
The Telephone Game (But With Developers)
Decisions don't originate at execution time. They originate in conversations.

A PM pattern-matches across customer interviews. Engineering debates constraints in Slack. A VP makes a call on a Zoom that nobody documents. By the time any of this hits a system of record, the context has been compressed, lossy-encoded, and re-interpreted three times. It's a game of telephone where the prize is a barely articulated card in your Kanban roadmap.

Recording meetings is table stakes now. The raw material exists. But most of it vanishes. It's searchable in theory and useless in practice. You can find that the decision was made, but you can't find why it made sense given everything else that was happening at the time.

Jamin Ball's piece "Long Live Systems of Record" pushed back on the "agents kill everything" narrative, arguing that agents don't replace systems of record, they raise the bar for what a good one looks like. I think he’s right, but the problem is no one is the voice for the downstream consumers. The reasoning, the exceptions, the context that justified a decision in the moment isn’t in any form that a human (let alone an agent) can find or consume. That's what's missing.

Context graphs need to be fed. The feed is conversations, and, right now, conversations evaporate.
The Same Problem, Worse, in Data
Greg's framing focuses on software development, and that's where his company SpecStory and their new Intent product are building. I think these are awesome, and deserve a lot of attention. In fact, so much so that I want to take it further. Software development is just one domain where decisions get lost upstream.

Data pipelines, our world, are another, and arguably worse.

When a data engineer decides which fields to drop during transformation, how to handle null values in a critical column, why a particular join strategy was chosen over another, what "clean" means for this specific dataset... where does that reasoning live? In a PR comment that gets archived. A Slack thread that disappears. Someone's head who leaves the company.

The data observability market has exploded. Gartner estimates data observability will be a $2.5B+ market by 2027. But all of it focuses on detecting problems after they happen. The upstream intent, why the pipeline was designed this way, what tradeoffs were considered, what the original constraints were, remains uncaptured.

Another favorite company of mine,Great Expectations, does a great job capturing what should be true. And dbt moves documentation closer to the code. And we have standards

, for example, captures the what of transformations. But almost nothing captures the WHY.

When an ML model makes a bad prediction, you can trace back to the training data. But can you trace back to why the training data was prepared that way? Who decided to impute missing values with medians instead of dropping rows? What was the conversation that led to that feature engineering choice? What did the team know at the time that isn't written down anywhere?

The decision trace doesn't exist because nobody captured it when it happened.
Intent Has Locality
This connects to something I've been thinking about for years. Intent has locality, just like data.

The richest context about a decision exists at the moment it's made, in the place it's made. Move it somewhere else, a different system, a different time, a summary written later, and you lose fidelity. This is true whether you're moving bytes across a network or moving reasoning into documentation.

Think about what happens when you try to document a decision after the fact. You're reconstructing. You remember the outcome but not the three alternatives you considered. You remember the constraint that mattered most but not the secondary factors that shaped the final call. You remember that someone raised an objection but not exactly what shifted the conversation.

The further you get from the moment of decision, the more context you lose. And unlike data, you can't just store a copy closer to where it's needed. The moment passes. The reasoning evaporates. What remains is the artifact without the intent.
What SpecStory Is Building
This is why what Greg and the SpecStory team are building with Intent matters. They started where decisions turn into code: the conversation between developers and coding agents. Intent records every exchange with Claude Code, Cursor, GitHub Copilot, Codex, Gemini. The transcript of how software actually gets built.

But as they asked where the intent came from, the answer kept pointing upstream. Team calls. Architecture discussions. Pairing sessions. The decisions that happen before anyone opens an IDE.

Their solution has three layers:

Capture: Every agent prompt and IDE session, recorded automatically. Not just the code that got written, but the back-and-forth that produced it.

Arena: Real-time collaboration with automatic decision extraction. Not verbose summaries nobody will read. The actual decision linked to the exact moment in the conversation.

Repo: Decisions versioned alongside your source code. Consumable by humans and agents. Searchable forever.

Full context lineage: Team discusses → Decision extracted → Agent builds → Session reasoning preserved → Code ships. Every line of code traceable back to the exchange where the decision was made.

That's the upstream feed layer context graphs need. The bridge from conversation to context to code.
The Parallel Problem Nobody's Solving for Data
The same pattern applies to data infrastructure, and the gap is even wider.

Here's an example I come back to constantly. You're looking at point-of-sale data from a retail chain, and one store shows zero transactions for six hours. What happened?

Maybe the system wasn't connected. Maybe there was a hurricane. Maybe it was midnight and the store was closed. Maybe there was a police action in the area. Maybe the pipeline is connected but stopped running. Maybe someone unplugged the wrong cable during a renovation.

The data looks identical in every case: zeros. But the appropriate response is completely different. If it's a hurricane, you adjust your forecasts and check on your employees. If it's a pipeline failure, you fix the pipeline and backfill the data. If it's midnight, you do nothing because everything is working correctly.

The "what" is the same. The "why" determines everything that matters.

This is the context gap in data infrastructure. Data lineage tools tell you what transformations happened. They don't tell you why someone chose that approach over alternatives. Data catalogs describe what datasets contain. They don't capture the discussions that shaped how those datasets were structured. Data quality tools flag when something looks wrong. They can't explain what "right" was supposed to mean based on the original requirements conversation.

Every data team has experienced this: you inherit a pipeline, something breaks, and you spend days reverse-engineering decisions that took the original author five minutes to make. The 2024 Stack Overflow survey found developers spend 30%+ of their time understanding existing code. For data engineers working with inherited pipelines, I'd bet that number is higher.

A few teams are starting to explore how to capture intent at the data layer, not just the code layer. The ones who figure out how to preserve decision context where data actually lives, at the edge, in pipelines, across distributed infrastructure, might be building something important. But right now, the tooling barely exists.
Why This Matters for AI Agents
Foundation Capital is right that agents need decision traces to exercise judgment. But consider what happens when we only capture traces at execution time.

An agent can follow a rule. It can look up a policy. But it can't understand why an exception was made last quarter unless someone captured the reasoning when it happened. It can see that a certain transformation was applied to a dataset but not why that approach was chosen over three alternatives that were discussed and rejected.

Research on AI decision-making keeps surfacing the same challenge: agents struggle with edge cases because they lack the contextual reasoning that humans use to navigate ambiguity. We've been trying to solve this with better prompts, more examples, refined guardrails. But the fundamental problem is upstream. The reasoning that would help agents handle edge cases was never captured in the first place.

Agents inherit our documentation debt. Every undocumented decision, every lost conversation, every piece of reasoning that exists only in someone's memory becomes a gap in the context graph. And agents can't exercise judgment across gaps.
The Compounding Problem
Context loss compounds in ways that aren't obvious until you're deep in a system you didn't build.

Every undocumented decision becomes a landmine for the next person (or agent) who encounters that code, that pipeline, that system. They see what was built but not why. So they either preserve it blindly (accumulating technical debt they don't understand) or change it without understanding the original constraints (breaking things the original author anticipated but never wrote down).

I've seen this pattern repeatedly. A team inherits a data pipeline with a seemingly arbitrary filter. They remove it because it doesn't match current requirements. Three months later, they discover it was preventing a subtle data quality issue that only surfaces under specific conditions. The original author knew about this. They even discussed it extensively with the team. But that conversation happened on a Zoom call that was never transcribed, and the person who made the decision left the company two years ago.

Multiply this across every team, every pipeline, every codebase. The DORA research shows that elite teams ship faster partly because they spend less time reverse-engineering past decisions. They've somehow preserved more context. Usually through heroic documentation efforts that don't scale.
The Path Forward
Foundation Capital's context graph thesis is right about the destination. Greg Ceccarelli and the SpecStory team are right about the first mile.

The platforms that win won't just capture decisions at execution time. They'll capture intent upstream, in the conversations, the debates, the reasoning that happens before anyone writes a line of code or builds a pipeline.

And they'll keep that intent close to where it matters. Versioned with the code. Traveling with the data. Available when someone (or something) needs to understand not just what happened, but why it was allowed to happen.

We're good at storing what happened. We're terrible at capturing why. The next trillion-dollar platforms will be the ones that figure out how to close that gap, not at execution time, but upstream, where the decisions actually get made.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at Distributed Thoughts.

The $1B AI Drug Lab That Can't Touch Its Own Data

David Aronchick — Tue, 13 Jan 2026 06:12:34 +0000

Nvidia and Eli Lilly announced a $1 billion AI drug discovery lab today at the J.P. Morgan Healthcare Conference. The press releases are full of the expected language: "reinvent drug discovery," "accelerate medicine development," "foundation models for biology." Lilly's CEO David Ricks said they're "combining our volumes of data and scientific knowledge with Nvidia's computational power."

But, MAN, there is a phrase in there that is doing an INSANE amount of hard word: Combining our volumes of data.

How, exactly?
The Missing Paragraph
The coverage has something conspicuously absent: any discussion of how pharma data actually moves. The lab will be in South San Francisco. Lilly's clinical trial data, compound libraries, and patient information live in facilities scattered across Indiana, Ireland, and dozens of research sites worldwide. The announcement talks about "lab-in-the-loop" systems linking wet labs and dry labs in "24/7 AI-assisted experimentation."

That's a lovely vision. It also assumes data flows like water between these locations. In pharma, it doesn't.
Why Pharma Data Is Different
Clinical trial data contains protected health information under HIPAA. Proprietary compound structures represent billions in R&D investment and competitive advantage. Manufacturing process data falls under FDA's 21 CFR Part 11, which mandates complete audit trails for every electronic record: who touched it, when, why, and what changed.

These aren't bureaucratic inconveniences that clever engineering can route around. They're structural constraints that exist because the consequences of failure are measured in patient safety and billion-dollar regulatory actions.

I've been talking to teams that operate in these environments. The pattern is consistent: they don't lack compute. They lack the ability to make their data usable without making it movable.
The Air Gap Paradox
Traditional security thinking offers two options. Lock data down completely in air-gapped environments where nothing gets out. Or open it up for analysis and accept the exfiltration risk. Pharma has mostly chosen the first option, which is why so much valuable data sits in protected directories that researchers can barely access.

The promise of AI drug discovery assumes you can train models on this data. But training requires moving data to compute, or moving compute to data. The first option triggers every compliance alarm in the building. The second option is what the press releases hand-wave past.

Security teams need something in between: protected environments where data scientists can actually work, but where every attempted data movement gets logged, analyzed, and blocked if it violates policy. Not just access controls (who can log in) but egress controls (what can leave). The ability to process data, transform it, analyze it, without ever letting raw records escape the protected perimeter.

This is remarkably hard to build. It's also not a GPU problem.
The Audit Trail Problem
21 CFR Part 11 requires that regulated companies maintain computer-generated, time-stamped audit trails recording every modification to electronic records.

Let’s say that again. Every Modification

The trail must include the operator identity, date/time, and the nature of the change.

Now imagine training a foundation model on clinical trial data. The model sees millions of records. It learns patterns. It generates new molecular structures based on those patterns. What's the audit trail for that? When the model suggests a compound, which training records influenced that suggestion? When a researcher uses an AI-generated insight to make a decision, how do you document the provenance?

These aren't hypothetical concerns. The FDA released draft guidance on AI in drug development in January 2025, outlining a risk-based credibility assessment framework for AI models across nonclinical, clinical, and manufacturing phases. Regulators are actively figuring out how to apply existing frameworks to machine learning systems. Companies that can demonstrate clean data lineage from source through model to output will have a structural advantage in regulatory discussions.
What Nvidia's Billion Dollars Actually Buys
Nvidia and Lilly aren't naive about these challenges. The announcement mentions that researchers will "generate large-scale data" in the lab itself, creating new datasets specifically designed for AI training. That sidesteps some of the legacy data problems.

The collaboration will (likely) use Nvidia's BioNeMo platform, an open framework for building and training deep learning models for drug discovery that's been adopted by over 200 techbios and large pharma companies. They're also focusing initial efforts on applications where data constraints are less severe: manufacturing optimization, process simulation, early-stage compound screening. These are real opportunities where GPU compute genuinely is the bottleneck.

But the highest-value problems in drug discovery involve the data that's hardest to access: longitudinal patient records from clinical trials, real-world evidence from treatment outcomes, proprietary biological assay results accumulated over decades of R&D. That data can't just be copied to a shiny new lab in South San Francisco. And the estimated $1-3 billion cost to develop a single new drug happens largely because of failures that better data access might prevent.
The Actual Hard Problem
The companies that figure out "compute over data" for regulated industries will eat this market. Not by building bigger GPU clusters, but by solving the governance layer that lets valuable data become usable without becoming vulnerable.

What does that look like in practice? Tagging data at the source with cryptographic fingerprints so you can always verify provenance. Processing pipelines that run inside protected perimeters with whitelist-only egress. Audit systems that log not just access but every transformation, every query, every attempted export. The ability to prove, at any point, exactly what happened to every record.

This is boring infrastructure work. It doesn't make for exciting keynote demos. But it's the actual constraint on AI-driven drug discovery, and throwing more GPUs at it doesn't help.
What I'd Watch For
If you're evaluating pharma AI investments, look past the compute announcements. Ask instead:
How does the company handle data that can't leave its current location?What's their approach to federated learning or on-premises model training?How do they maintain audit trails through AI-assisted workflows?What's their story for regulatory submissions that involve AI-generated insights?
The GPU buildout is the visible part of the iceberg. The governance layer underneath is where the actual differentiation happens.

Nvidia's bet will work for some use cases. Public datasets, synthetic data, newly generated experimental results. But the highest-value pharma AI problems live behind walls that compute power alone can't scale. The billion dollars is impressive. The missing paragraph about data governance is the real story.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at Distributed Thoughts.

Your 2026 Resolution: Add Context to Your Data (Before It Breaks You)

David Aronchick — Sun, 11 Jan 2026 06:11:42 +0000

Last week I sat in an executive review where two teams spent forty minutes arguing about "active users." Not about strategy. Not about growth. About what the number meant.

One team counted anyone who logged in. The other excluded users who bounced in under 30 seconds. Neither knew which experiment flags were active when the data was pulled. The dashboard just showed a number. No definition. No lineage. No context.

This happens constantly. And it's about to get significantly worse.

Gartner predicts that 60% of AI projects will be abandoned by 2026 because organizations lack "AI-ready data." Not because models failed. Not because compute was too expensive. Because the data traveling through these systems carries no meaning beyond the raw values.

The models can't tell the difference between a deprecated pricing page and current policy. They can't distinguish a test account from a real customer. They retrieve answers confidently, cite sources correctly, and still get everything wrong.

This is the year we stop treating context as optional documentation and start treating it as infrastructure.
The Context Engineering Pivot
Something shifted in 2025. The industry stopped talking about "prompt engineering" and started talking about "context engineering."

Andrej Karpathy called it "the delicate art and science of filling the context window with just the right information for each step." MIT Technology Review documented the transition from "vibe coding" to systematic context management. Google's December release of their Agent Development Kit was entirely focused on context architecture.

The terminology change matters. "Prompt" implies a single instruction you craft carefully. "Context" implies an entire information environment you engineer deliberately.

And it turns out most organizations have been engineering that environment with all the care of a teenager cleaning their room by shoving everything under the bed.

David Lanstein, CEO of Atolio, put it bluntly in IBM's 2026 predictions: "The solution isn't bigger models, but smarter data. True value will come from feeding models high-quality, permission-aware structured data to generate intelligent, relevant and trustworthy answers."

The race for bigger context windows missed the point. A 200K token context window filled with undifferentiated garbage produces undifferentiated garbage outputs, just with more confident citations.
What Context Actually Means
When I talk about context, I don't mean adding a few comments to your SQL. I mean four distinct layers that most data systems ignore entirely.

Semantic context is what a value actually represents. Not just "this column is called revenue" but "this is recognized revenue under ASC 606, calculated monthly, excluding deferred amounts, as defined by the finance team's Q3 2025 policy update." When the definition changes, the context changes with it. When someone queries the data six months from now, they see what it meant then, not what it means today.

Operational context is the health and provenance of the data at query time. Is this number fresh? Did the upstream pipeline fail overnight? Is there an active incident affecting the source system? A dashboard that shows revenue without showing "by the way, the payment processor had a three-hour outage last night" is lying by omission.

Experimental context is which flags and tests were active when the data was generated. Your MAU number is meaningless if you don't know that 40% of users were in an onboarding experiment that changed the activation flow. The number isn't wrong. It's just uninterpretable without the experiment metadata traveling alongside it.

Human context is ownership and decision history. Who defined this metric? What decisions have been made based on it? Where's the design doc? When someone has a question, they shouldn't have to play archeologist in Slack to figure out who to ask.

Most data systems capture maybe one of these. Usually the semantic layer, poorly, in a data catalog that nobody updates and fewer people read.
The Kubernetes Lesson I Should Have Learned Sooner
When I was the first product manager on Kubernetes at Google, we thought we'd solved the orchestration problem. Pods, services, deployments. State reconciliation. Declarative configuration. Ship your containers and let the scheduler figure out the rest.

What we hadn't solved was context.

A customer came to us wanting to run a global footprint of clusters, one per region, with synchronized jobs. Low-latency application, workloads coordinated across continents. We had an internal project called "Ubernetes" that was supposed to handle this, but the complexity was brutal. We ended up helping them build a custom solution.

The problem wasn't deploying the workloads. GitOps handles that fine now. The problem was that when data crossed cluster boundaries, all the context about that data evaporated. Each cluster was internally consistent. The global system was broken because nobody knew what the data meant in aggregate.

I've watched the same pattern repeat across every data problem I've worked on since. The compute orchestration is largely solved. The data orchestration is still a mess, and it's a mess because context doesn't travel with the data. This is actually why I'm writing a book on exactly this: the hidden complexity of data preparation that causes 80% of AI projects to fail.
Why RAG Doesn't Fix This
The popular assumption has been that Retrieval-Augmented Generation solves the context problem. Point your model at your documents, let it retrieve what it needs, problem solved.

InfoWorld's analysis last week explains why this breaks at scale: "RAG breaks at scale because organizations treat it like a feature of LLMs rather than a platform discipline. Models generate confidently incorrect answers because the retrieval layer returns ambiguous or outdated knowledge."

The failure mode is insidious. RAG with good retrieval but no context governance produces what I've started calling "hallucination with citations." The model cites a real document. The citation is accurate. The document is from 2023 and contradicts current policy. The answer is wrong, but it looks impeccably sourced.

CX Today reported on exactly this pattern: "If the knowledge base is outdated, RAG just retrieves the wrong answer faster. If content is unstructured, like PDFs, duplicate docs, or inconsistent schemas, the model struggles to pull reliable context."

The problem isn't retrieval. The problem is that the documents themselves carry no context about their validity, scope, or temporal bounds. A PDF is just a PDF. It doesn't know that it was superseded by a newer version, that it only applies to EMEA customers, or that the pricing section was invalidated by a board decision last quarter.

When VentureBeat declared "RAG is dead" in their 2026 predictions, they were being provocative. But the underlying point stands: RAG without context governance is dying. The organizations that will succeed with retrieval-augmented systems are the ones treating their knowledge bases as living, context-rich assets rather than static document dumps.
The Toll Booth Is Coming
There's a harder version of this problem emerging, and most organizations haven't noticed it yet.

Constellation Research warns that "enterprise data tolls and API economics are going to be a headache" in 2026. Celonis is suing SAP over data access. The Information reported that Salesforce is raising prices on apps that tap into its data. Connector fees are trickling down to IT budgets.

"Connection fees are going to be the new cloud egress," Constellation writes.

Here's what this means: if you don't own the context layer for your own data, you'll rent it from someone else. Every vendor building "AI-ready" connectors is essentially building a context layer on top of your data and charging you for access to the meaning of information you already own.

The Solutions Review predictions roundup makes this explicit: "By the end of 2026, connectivity, governance, and context provisioning for AI agents will be built into every serious data platform."

Built in. Not optional. Not a nice-to-have catalog project. Core infrastructure.

The organizations that treat context as someone else's problem will find themselves paying tolls to access the semantic meaning of their own customer data. The ones that invest now will own that layer.
Resolution #1: Ship Context With Every Event
The practical version of this starts at ingestion.

Every event entering your system should carry enough metadata that a reader (human or machine) can interpret it without external lookups. Not "user_id, timestamp, action" but "user_id, timestamp, action, schema_version, experiment_flags, source_system, data_classification."

This isn't aspirational. Anthropic's context engineering guide describes exactly this pattern: maintaining lightweight identifiers that allow systems to "dynamically load data into context at runtime using tools."

A transformation editor should show, live, which downstream dashboards and models will break if you drop a column. A query should surface its lineage alongside its results. A dashboard shouldn't just display a number; hovering over it should reveal the definition, the upstream tables, the freshness, and the last incident that affected it.

This requires tooling changes, yes. But mostly it requires treating context as a first-class output of every pipeline stage rather than an afterthought someone might add later.
Resolution #2: Make Context the Default in AI and Agents
TechCrunch's 2026 analysis identifies the Model Context Protocol as "quickly becoming the standard" for agent interoperability. OpenAI and Microsoft have embraced it. Google is standing up managed MCP servers.

The infrastructure for context-aware agents is arriving. The question is whether your data is ready to participate.

That means storing valid_from/valid_to timestamps on policy documents. It means tagging content with scope limitations (region, customer tier, product line). It means encoding data classification and retention rules at the source, not in a compliance spreadsheet nobody maintains.

Stanford HAI's predictions note that "2026 will hear more companies say that AI hasn't yet shown productivity increases." The organizations that do show productivity increases will be the ones whose agents can distinguish current reality from historical noise without human intervention.

An agent that refuses to take high-impact actions without verifying the environment, cohort, and guardrails is not cautious. It's correctly engineered. An agent that charges ahead on stale data with high confidence is the expensive kind of wrong.
Resolution #3: Measure Time to Trustworthy Insight
I wrote about Nicole Forsgren's new book last month. Her frameworks for developer productivity apply directly to data work, but with a crucial modification.

For data teams, the north star isn't deployment frequency or cycle time. It's time to trustworthy insight: from raw logs or events to a result you would put in front of an executive with confidence.

Most organizations can't measure this because they don't know when insight becomes trustworthy. The data arrives, transformations run, dashboards update, but confidence accrues informally. Someone senior enough eventually blesses the number based on vibes and experience.

Context infrastructure makes this measurable. If every metric carries its lineage, freshness, and incident history, you can ask: how long did it take from data landing to a metric with full provenance, no upstream incidents, and a defined owner? That's the number that matters.

When that number shrinks, you're actually improving. When people are just shipping dashboards faster without context, you're accumulating debt.
The Year We Stop Arguing About Definitions
Most New Year's resolutions fail by February. The gym membership lapses. The meditation app goes unused. The ambitious reading list gathers dust.

Data resolutions fail for the same reason: they're framed as one-time efforts rather than infrastructure changes. "We'll document our metrics" becomes a Q1 project that never gets maintained. "We'll improve data quality" becomes a dashboard that nobody checks.

Context isn't a project. It's a property of how data moves through your organization. It either travels with its story or it doesn't.

The organizations that treat 2026 as the year context becomes default will stop having the same arguments in every meeting. The exec review becomes a discussion of strategy instead of a debate about what "active users" means. The AI agent produces answers that come with their own credibility assessment. The data team ships products instead of debugging why downstream consumers don't trust the numbers.

Gartner says 60% of AI projects will fail for lack of AI-ready data. The projects that succeed will be the ones that stopped treating data as numbers and started treating it as knowledge.

That's the resolution. Make the data know what it is.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.

NOTE: I'm currently writing a book called "Zen and the Art of Data Maintenance" based on what I've seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!

Originally published at Distributed Thoughts.

The Natasha Problem: Why Your Data Pipeline Only Fits One Person

David Aronchick — Sun, 11 Jan 2026 06:11:42 +0000

For most folks, you probably don’t think about clothing sizes. There’s a number, you pick it, you try on the clothes, and if they fit, then congrats, you’re that number. But how’d they pick that number? And why does every style/line/person fit slightly differently?

Turns out there's a woman named Natasha in Los Angeles whose body is used to design jeans for seven or eight major clothing brands. She's what the industry calls a "fit model." When Levi's or H&M or whoever designs a new pair of jeans, they don't start with measurements of the general population. They design them on a mannequin, then bring in Natasha, and adjust everything until the jeans fit her perfectly.

She's the only person in the design process with an actual human body.

Every other size is mathematical extrapolation. The size 2 isn't designed for any real person. Neither is the 4, the 8, or the 12. They're all proportional adjustments from Natasha's measurements, calculated by formula. If you're not built exactly like Natasha, your clothes don't actually fit you. They fit a mathematical projection of you derived from someone else's body.

This is, as Radiolab's Heather Radke discovered, why the dressing room feels like a personal judgment. We've internalized the idea that clothes should fit, and when they don't, we assume the problem is our body. But the clothes were never designed for our bodies. They were designed for one body, then scaled with arithmetic.

I kept thinking about this because it's exactly how we build our IT infrastructure.
The Ruth O'Brien Problem
The Natasha situation has a history. In the 1930s, a woman named Ruth O'Brien at the Bureau of Home Economics tried to solve the sizing problem scientifically. She hired "measuring squads" through the WPA to travel the country and measure women's bodies across twenty-six different dimensions: elbow to wrist, thigh girth, heel length, and on and on.

She was going to create the definitive dataset of American women's bodies.

The resulting dataset became the basis for women's clothing sizes for decades. But there never was an “average” person. It was a statistical fiction.
Your Pipeline Has a Fit Model Too
Every data pipeline has its own Natasha. Usually it's the data from headquarters, or the first customer deployment, or whatever clean dataset was available when the system was designed.

I've watched this pattern play out dozens of times. A team builds an ETL pipeline that works beautifully on their development data, where the schema is clean, the timestamps are consistent, and the sensor readings arrive in perfect intervals. They deploy to production, and 40% of their edge sites start throwing errors.

The problem isn't that the edge data is wrong. The problem is that the pipeline was designed for one shape of data, then mathematically extrapolated to handle everything else.

Consider what happens with ML training data. ImageNet became the de facto standard for computer vision benchmarks. Models trained on ImageNet achieve remarkable accuracy on ImageNet test sets. Deploy those same models to a factory floor, and they struggle with the lighting, the angles, the dust on camera lenses, the specific way that this particular production line differs from the curated images in the training set.

The model was fit to ImageNet. Everything else is extrapolation.

Or look at timestamp handling. A pipeline designed on data from a single timezone assumes UTC normalization is someone else's problem. Works fine until you're ingesting from devices across twelve timezones, some of which handle daylight saving time transitions differently, some of which have clock drift, and one of which is running firmware from 2019 that uses a different epoch.

The pipeline wasn't wrong. It was designed for one reality and scaled mathematically to others.
The Measurement Squads Never Left
Ruth O'Brien's original sin was the belief that you could measure a population once, derive a standard, and apply it forever.

We do this constantly with data.

A team defines a schema based on current requirements. They encode assumptions about data types, nullable fields, value ranges, and relationships. Then they treat that schema as ground truth, and any data that doesn't conform is "dirty" and needs to be "cleaned."

But the data isn't dirty. The data is reality. The schema is the idealized projection that reality keeps refusing to match.

I saw a project once where the sensor format had been standardized across the organization with beautiful documentation and clear specifications. Every new deployment was supposed to conform. In practice, about 60% of edge sites had made local modifications: different firmware versions, custom calibration routines, integration with legacy equipment that predated the standard by a decade.

The central data team spent enormous effort "fixing" the non-conforming data with transformations to coerce it into the standard shape, imputation for missing fields, and interpolation for different sampling rates.

By the time the data reached the analytics layer, it had been mathematically adjusted to fit a shape it never had. The "cleaned" data was a fiction, no more real than a size 2 extrapolated from Natasha's measurements.
Bodies Resist Standardization. So Does Data.
The reality is that bodies cannot be forced into interchangeable parts. The entire apparatus of industrial manufacturing assumes standardization. You take raw materials, process them into uniform components, and assemble them into identical products. It works for cars and electronics, but it fundamentally doesn't work for human bodies.

The closer you get to where data is generated, the more specific and contextual it becomes. A sensor on a drilling rig in the Permian Basin produces readings shaped by the specific geology, equipment age, and operational patterns of that site. A sensor in the North Sea produces data shaped by entirely different conditions. Both might be "pressure readings," but they're not interchangeable.

The centralization assumption says: bring all the data to one place, normalize it, and then analyze. This works if the data is genuinely similar. It falls apart when the normalization process destroys the very information you needed.
The Dressing Room Moment
Radke describes the experience of trying on clothes that don't fit as a moment of internalized judgment. The clothes were never designed for your body, but you blame yourself anyway. The sizing system has convinced you that "normal" is a real thing, and you're the deviation.

Data teams have the same experience. The pipeline breaks on edge cases, and the team treats it as a data quality problem to be solved with more transformation logic, more coercion, more normalization, more effort to force reality into the shape the system expected.

But the pipeline was designed for one type of data. Everything else is extrapolation. When the extrapolation fails, that's information. It's telling you that your model of reality was incomplete.

The question isn't "how do we clean this data to fit our schema?" The question is "why did we assume all data would look like our development dataset?"

Martha Skidmore was the closest match to Norma out of 3,864 women, and she still wasn't Norma. She was a real person, with a real body, that happened to approximate a statistical fiction slightly better than 3,863 others.

Your edge data isn't defective. It's real. The pipeline is the fiction.

Originally published at Distributed Thoughts.

A Picture Is Worth Ten Thousand Tokens

David Aronchick — Tue, 30 Dec 2025 00:34:11 +0000

"A picture is worth a thousand words" has been greeting-card wisdom for a century, the kind of thing we nod along to while understanding it metaphorically because images convey emotion, show rather than tell, and bypass the limitations of language. What we didn't expect was for this to become literally, computationally true.

DeepSeek released a paper this fall that made a lot of people rethink what we know about LLM efficiency, At the core, one of the findings seems obviously wrong until you work through the math: if you render text onto an image and have a vision-language model decode it, you can achieve 97% accuracy while using one-tenth the tokens. Take a document with 1,000 text tokens, turn it into an image, and the model can reconstruct that text using just 100 vision tokens. For those that aren’t researchers, this seems insane: You’re telling me that words, representing by simple characters, are HARDER to make sense of than the same words but represented by pixels? Nuts.

What this possibly reveals under the hood could be pretty foundational. If something computationally "heavy" modality turns out to be the efficient one while the computationally "light" modality is actually the expensive one, then we’ve been thinking about a lot of things wrong. And, according to the page, it looks like we have been.
The Architecture That Makes This Work
DeepSeek-OCR pairs two components: an encoder called DeepEncoder (about 380M parameters) and a decoder based on their 3B MoE model with 570M active parameters. The encoder is the interesting part because it chains together a SAM-base model for local perception using window attention, a 16x convolutional compressor, and a CLIP-large model for global understanding in a sequence that exploits how these different attention mechanisms scale.

The key insight is that window attention (that’s the thing the model “looks” at to predict the right answer) processes lots of tokens cheaply because it only looks at local neighborhoods, which means you can afford to have thousands of tokens at that stage without blowing up your compute budget. Then the compressor aggressively reduces token count before the expensive global attention (where it compares its predictions with everything else) kicks in so you're only paying the quadratic attention cost on the compressed representation rather than the full input.

Feed it a 1024x1024 image and you get 4,096 patch tokens from the initial segmentation, but after compression that becomes just 256 vision tokens entering the decoder, and a 640x640 image yields only 100 tokens.

The magic isn't in any single component but in recognizing that compression can happen inside the pipeline rather than fighting against the text representation after the fact. And because the computational characteristics of vision encoders (cheap local processing followed by expensive global processing on a much smaller token set) happen to be more favorable than the characteristics of text transformers (expensive global processing on the full token count from the start), it gives DeepEncoder the ability to specifically to exploit that gap.
The Numbers That Matter
On their Fox benchmark testing documents with 600-1,300 text tokens, the results show a graceful degradation curve that suggests we're not just getting lucky on easy cases: at 100 vision tokens (compression ratio around 7-10x) they hit 87-98% OCR precision depending on document complexity, and at 64 vision tokens (pushing toward 20x compression) precision drops to 59-96%, which is still surprisingly usable for applications where you need the gist rather than perfect fidelity.

On OmniDocBench, a practical document parsing benchmark, DeepSeek-OCR with 100 vision tokens beats GOT-OCR2.0 which uses 256 tokens, and with fewer than 800 vision tokens it outperforms MinerU2.0 which averages over 6,000 tokens per page. In production they're processing 200,000+ pages per day on a single A100-40G, which isn't a research demo but a training data pipeline running at scale.
Why This Works (And Why It's Counterintuitive)
We've built our mental models around text as the native format for language understanding, and for good reason: text is what LLMs were designed for, text is lightweight, text is structured, and vision is the bolt-on capability we added later for multimodal tasks. But attention mechanisms don't care about our intuitions because they care about sequence length, and attention scales quadratically with sequence length, which means a document with 5,000 tokens pays the O(n²) cost across all 5,000 tokens regardless of how "lightweight" we think text ought to be.

Vision encoders, particularly modern ones with the window-then-global architecture DeepSeek uses, have fundamentally different computational characteristics because you're essentially buying cheap local processing at the window attention stage and only paying the quadratic cost on a much smaller token count after compression, which means the image isn't a burden you're adding to the model but a compression layer that happens to be more efficient than operating directly on text tokens. This is counterintuitive until you remember that efficiency depends on the compute path, not just the data size.
The Memory Decay Proposal
The paper includes a fascinating speculation in their discussion section that I think deserves more attention than the OCR results themselves. They draw a parallel between human memory decay over time, visual perception degradation over distance, and text compression at different resolutions, and their proposal is elegant: for multi-turn conversations, render older dialogue turns as images at progressively lower resolutions.

Recent context stays high-resolution (their "Gundam" mode, 800+ tokens) while older context gets progressively downscaled to Large mode for yesterday's conversation, Base mode for last week, and Small or Tiny for anything older.

The information doesn't disappear but compresses into something lossy and gist-preserving, which mirrors something real about how memory actually works: you don't remember conversations from a year ago at the same fidelity as conversations from an hour ago, but the information is still there in some form, accessible if you need it but not consuming the same cognitive resources as recent experience.

The engineering implication is that instead of choosing between "keep full context" and "truncate old context," you get a spectrum where the context window becomes a memory system with natural decay characteristics built into the architecture itself, giving you a gradient rather than a cliff.
The Distributed Systems Angle
There's a pattern here that feels familiar from data infrastructure because the optimal representation for processing isn't always the optimal representation for storage or transmission. We compress video for streaming then decompress for playback, we convert data to columnar formats for analytics even though row-oriented formats are more natural for transactional workloads, we build materialized views that trade storage for query performance, and we ship compute to data when moving the data would be more expensive than moving the code.

What DeepSeek is demonstrating is that text-to-image-to-text can be a legitimate processing pipeline, not because images are somehow "better" but because the computational characteristics of the vision encoder path happen to be more favorable for certain workloads and the transformation overhead pays for itself in reduced attention costs. This is compute-over-data thinking applied to tokens themselves: instead of asking "how do we process this text efficiently," you ask "what representation makes the compute most efficient, and is the transformation cost worth it?"

For long documents, the answer might genuinely be to render to image, process visually, and reconstruct text.
What This Means
DeepSeek is careful to call this "early-stage work that requires further investigation," and they're right to be cautious because OCR is a specific task where you have ground-truth text to measure against and general language understanding is considerably harder to evaluate. But the direction is suggestive in ways that go beyond document processing.

If you're building systems that process long documents, analyze historical conversations, or maintain persistent context across sessions, the architecture you're probably using (longer context windows, sliding attention, memory banks, retrieval augmentation) might be competing with an approach that seems absurd on its face: just turn the text into pictures. The optimal representation for text processing in an LLM might not be text, which is either a profound insight about the nature of these systems or a temporary artifact of current architectures that will disappear when we build better text processing.

I genuinely don't know which, but when the counterintuitive result has solid empirical backing, that's usually where the interesting questions live.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at Distributed Thoughts.

NVIDIA Bought the Bouncer: SchedMD and Where Lock-In Actually Lives

David Aronchick — Mon, 29 Dec 2025 00:37:14 +0000

On December 15, 2025, NVIDIA acquired SchedMD, a 40-person company based in Lehi, Utah. The price wasn't disclosed, the press release emphasized a commitment to open source, and most coverage focused on NVIDIA’s expanding software portfolio, thereby missing the point entirely. Most folks missed how huge this was.

SchedMD maintains Slurm, the workload manager running on 65% of the TOP500 supercomputers, including more than half of the top 10 and more than half of the top 100. Every time a researcher submits a training job, every time an ML engineer queues a batch inference run, every time a national lab allocates compute for a simulation, there's a decent chance Slurm is deciding which GPUs actually run it.

Everyone's been watching the CUDA moat. Judah Taub's recent Substack piece frames it perfectly: the programming model as the source of lock-in, with five potential escape routes ranging from OpenAI's Triton to Google's TPUs to AMD's ROCm to Modular's Mojo to Tenstorrent's RISC-V approach. All of which are valid competitive threats.

But NVIDIA, to their credit, saw through the programming model debates and identified one of the key ways to accelerate the scale-out. They bought the bouncer.
What Slurm Actually Does
If you've never submitted a job to an HPC cluster, Slurm is invisible infrastructure, and that's intentional. Researchers type sbatch my_training_job.sh and their code runs on GPUs. Still, how those GPUs get allocated, when the job actually starts, which nodes handle which portions of distributed training, how competing jobs get prioritized, whether your experiment runs tonight or next Tuesday—that's all Slurm.

The formal description sounds almost TOO basic: "allocating exclusive and/or non-exclusive access to resources, providing a framework for starting, executing, and monitoring work, and arbitrating contention for resources by managing a queue of pending jobs."

The reality is that Slurm is the layer that translates organizational policy into compute allocation. This includes things like: * Fair-share scheduling across research groups * Priority overrides for deadline-sensitive projects * Resource limits that prevent any single user from monopolizing a cluster * Preemption policies that balance throughput * Responsiveness to Hilbert curve scheduling that optimizes for network topology

And lots more. Or just launching a job without requiring SSH!

Every organization running Slurm has encoded its resource management philosophy into its configuration over years of tuning, with institutional knowledge baked into partition definitions and quality of service policies, accounting systems tied to grants and budgets, and user training built around Slurm commands. This isn't a program you swap out over a weekend.
Why Slurm Won
Slurm wasn’t the obvious choice. When development began at Lawrence Livermore National Laboratory in 2001, the HPC world ran on proprietary schedulers: PBS (Portable Batch System) had variants everywhere, IBM's LoadLeveler dominated their ecosystem, Quadrics RMS handled specialized clusters, and Platform Computing's LSF (Load Sharing Facility) served enterprise HPC.

LLNL wanted something different because they were moving from proprietary supercomputers to commodity Linux clusters and needed a resource manager that could scale to tens of thousands of nodes, remain highly portable across architectures, and stay open source. The 2002 first release was deliberately simple, and the name originally stood for "Simple Linux Utility for Resource Management" (the acronym was later dropped, though the Futurama reference remained).

What happened next is a case study in how open source wins infrastructure markets.

PBS fragmented into OpenPBS, Torque, and PBS Pro (now Altair), with each fork diluting the community and scattering innovation, leaving organizations that chose PBS to pick which one, and none had the whole community behind them. THEN LSF went commercial when IBM acquired Platform Computing in 2012. While enterprise support is valuable, licensing costs matter when you're scaling to thousands of nodes, which makes the open-source alternative increasingly attractive. THEN Grid Engine's ownership bounced between Sun Microsystems, Oracle, and Univa, with each transition eroding community trust as development priorities shifted with corporate strategy.

Slurm stayed focused on one codebase with GPLv2 licensing that couldn't be closed and a plugin architecture that let organizations customize without forking. And in 2010, Morris Jette and Danny Auble, the lead developers, left LLNL to form SchedMD (https://en.wikipedia.org/wiki/SchedMD), creating a commercial support model that kept the software free while funding continued development—the Red Hat playbook, applied to HPC scheduling.

Hyperion Research data from 2023 shows that 50% of HPC sites use Slurm, while the next closest, OpenPBS, sits at 18.9%, PBS Pro at 13.9%, and LSF at 10.6%. The gap isn't closing, and it's widening.
The Two-Door Strategy
In parallel with ALL that noise, NVIDIA wasn’t sitting around flat-footed.

In April 2024, NVIDIA acquired Run:ai for approximately $700 million. Run:ai builds Kubernetes-based GPU orchestration, and if Slurm is how supercomputers and traditional HPC clusters manage GPU workloads, Run:ai is how cloud-native organizations do the same thing on Kubernetes—different paradigms serving the same function, and NVIDIA now owns the scheduling layer for both.

Run:ai handles the world that emerged from containers and microservices: * Organizations running on GKE, EKS, or on-prem Kubernetes clusters * Data science teams whose workflows are built around Jupyter notebooks, Kubeflow, and MLflow * And companies that think in pods and deployments rather than batch queues and node allocations.

Slurm handles the world that emerged from supercomputing: national labs, research universities, pharmaceutical companies running molecular dynamics, financial firms running risk simulations, organizations where HPC predates the cloud, and where "scale" means dedicated clusters with thousands of nodes.

Both roads lead to GPUs, and NVIDIA now controls traffic on both.
What Lock-In Actually Looks Like
Judah Taub's CUDA analysis is correct that the programming model creates real lock-in, because rewriting GPU kernels for a different platform is expensive, and the ecosystem of libraries, tools, and community knowledge around CUDA represents decades of accumulated investment.

But programming models can be abstracted, compilers translate, and compatibility layers exist. PyTorch runs on AMD GPUs via ROCm, JAX runs on TPUs, and the code you write doesn't have to be tied permanently to CUDA, even if the transition has friction.

Orchestration creates a different kind of stickiness, because your workflows are encoded in Slurm through every batch script, every job array definition, every dependency chain that says "run step B only after step A completes successfully,” and that's not just code but institutional memory. Your accounting systems integrate with Slurm through reports that show department heads how their GPU allocation was used, chargeback systems that bill internal projects, and compliance logs that verify your government-funded research ran on approved infrastructure. Your users know Slurm through the commands they type without thinking, the debugging instincts for when jobs hang or fail, the training materials your HPC team developed, and the Stack Overflow answers they Google at 2 AM. Your cluster topology is optimized for Slurm's algorithms through a network configuration that aligns with Slurm’s understanding of a fat-tree topology, a partition structure that reflects your organizational hierarchy, and node groupings that balance locality and fairness.

Switching schedulers isn't a recompile; it's a reorganization.
The Promise and the Pattern
NVIDIA says Slurm will remain open source and vendor-neutral, and the GPLv2 license makes closing the source legally problematic anyway, so SchedMD's existing customers aren't about to get cut off.

But control of the roadmap is different from control of the code.

When NVIDIA prioritizes features, which hardware gets first-class Slurm support? When performance optimizations ship, which GPUs benefit most? When integrations between Slurm and the rest of NVIDIA’s software stack tighten, does the "vendor-neutral" promise mean equal optimization for AMD and Intel accelerators?

The pattern exists in enterprise software: Oracle doesn't prevent you from running MySQL, Microsoft doesn't prevent you from using GitHub with non-Azure clouds, but the integration points, the polish, and the performance optimizations flow toward the owner's products.

NVIDIA's official line emphasizes that Slurm "forms the essential infrastructure used by global developers, research institutions, and cloud service providers to run massive scale training infrastructure," which is true—and now NVIDIA owns that essential infrastructure.
The Distributed Gap
There's a less-discussed implication in all this.

Traditional HPC scheduling, whether Slurm or its competitors, assumes a particular architecture: a big, centralized cluster where jobs are scheduled across nodes, making the optimization problem one of matching jobs to resources within a unified system.

This architecture works well when data and compute are co-located, with training runs pulling from high-speed parallel file systems and simulations operating on datasets staged to local storage, making the cluster a world unto itself.

But increasingly, that's not the world in which organizations operate.

Data sovereignty requirements mean datasets can't always move to where the GPUs are; edge deployments generate data that shouldn’t traverse networks just to run inference; federated learning needs to coordinate training across institutions without centralizing sensitive information; and multi-cloud strategies mean compute is scattered across providers, regions, and architectures.

Run:ai helps with Kubernetes-based orchestration but assumes Kubernetes, while Slurm helps with HPC workloads but assumes a traditional cluster architecture. Neither solves the problem of "I have data in 50 locations, compute in 12 different configurations, and regulatory constraints that prevent me from pretending this is one big cluster."

NVIDIA's acquisitions reinforce the gravitational pull toward centralization: bigger clusters, more GPUs, bring your data to us. That's a valid architecture for many workloads, and for foundation model training at hyperscale, it might be the only architecture.

But it's not the only architecture that matters, and the orchestration gap for truly distributed computing remains wide open. (We have some thoughts if you’re interested :))
What NVIDIA Actually Understood
Credit where it's due: NVIDIA read the landscape correctly.

The hardware competition gets the attention, with AMD's MI300X, Intel's Gaudi, Google's TPUs, and startups raising hundreds of millions to build custom silicon, keeping everyone focused on the chip.

NVIDIA looked one layer up and recognized that whoever owns the orchestration layer owns the decision about which chips run which workloads, because the scheduler doesn't just allocate resources; it also encodes assumptions about what resources exist and how they should be used.

By acquiring both Slurm and Run:ai, NVIDIA ensures that, regardless of which paradigm you use (traditional HPC or cloud-native Kubernetes), the software layer that schedules your GPU workloads comes from NVIDIA, meaning alternatives to CUDA still need to run through NVIDIA's orchestration. It's like owning both the road and the traffic lights: the cars might be different, but they all stop at the same intersections.
Where This Leaves Everyone Else
For organizations already running Slurm, not much changes immediately because the software remains open source, SchedMD's support contracts presumably continue, and the 40 employees who built their careers around making Slurm work are now NVIDIA employees with presumably NVIDIA resources.

For organizations building alternatives to NVIDIA's hardware dominance, the landscape has grown harder: your new accelerator needs software ecosystem support, which now means either convincing NVIDIA-owned Slurm to treat your hardware as a first-class citizen or building your own orchestration layer from scratch.

For anyone thinking about distributed computing that doesn't fit the cluster model, the message is clear: the major players aren't building for you, and the orchestration layer for truly distributed, heterogeneous, data-gravity-respecting deployments doesn't exist in their portfolio.

That's both a challenge and an opportunity.

The CUDA moat is real, but it was always visible, always discussed, always the focus of competitive energy. The orchestration moat is quieter because Slurm doesn't make headlines like GPUs do, and scheduling software isn't sexy; it's just where the actual decisions get made.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at Distributed Thoughts.

Edge ML Has a Size Obsession

David Aronchick — Tue, 16 Dec 2025 00:34:14 +0000

UPS could deliver your Amazon package on a cargo e-bike. In most cities, for most packages, this would actually be faster. No parking. No traffic. Straight to your door.

Instead, a 16,000-pound truck idles outside your apartment building while the driver walks up three flights of stairs with an envelope containing a phone charger.

It's not that UPS is stupid. The truck handles the complicated cases: bulk deliveries, heavy items, and commercial routes with 200 stops. Once you've built infrastructure for complex cases, running it for easy cases feels free. Same truck, same driver, same route. Why optimize?

But "feels free" isn't free. The truck burns diesel at idle. It needs a commercial parking spot that doesn't exist. The driver spends 30% of their day not delivering packages but managing the logistics of operating a vehicle designed for a more complex problem than most stops actually present.

Edge ML has the same problem. We built the infrastructure for complex cases (language models, multimodal reasoning, generative AI), and now we're using it for everything. Sensor classification? Deploy a neural network. Anomaly detection? Fine-tune a transformer. Predictive maintenance? Surely this needs deep learning.

A quantized Llama 3B takes 2GB on disk and 4GB in memory. A 4-bit quantized 7B model still needs roughly 4GB. Want to run a 70B model? Even with aggressive quantization, you're looking at 35GB minimum.

A scikit-learn random forest for the same classification task takes 50KB.

The industry spent three years figuring out how to squeeze the truck into tighter parking spaces. Most deliveries never needed the truck.
Two Mistakes, Not One
The size obsession hides two distinct problems. First: teams often reach for the wrong vehicle entirely. Second: even with the right vehicle, the route planning determines whether packages arrive.

Most edge deployments handle predictive maintenance, anomaly detection, sensor classification, and quality control. These are tabular data problems. A NeurIPS 2022 paper confirmed what practitioners already suspected: tree-based models such as XGBoost and Random Forests outperform deep learning on tabular data across 45 benchmark datasets. A study on industrial IoT found that XGBoost achieved 96% accuracy in predicting factory equipment, reducing downtime by 45%. Random forests hit 98.5% on equipment failure classification.

TensorFlow Lite Micro fits in 16KB. TinyML gesture recognition runs at 138KB and 30 FPS. These aren't compromised models. They're right-sized for their problems. The e-bike, not the truck.

But here's what matters more: whether you're dispatching e-bikes or trucks, you need routes that work. And route planning is why 70% of Industry 4.0 AI projects stall in pilot.

The models work in demos. Deployment breaks them.
The Orchestration Gap
In the early days of Kubrenetes, we made a version of this mistake. We thought container scheduling was the hard part. The hard part was everything after scheduling: networking, storage, observability, updates, rollbacks. The entire operational lifecycle.

Edge ML is learning this lesson now. Where MLOps ends with a packaged model, orchestration begins. And orchestration is where edge ML goes to die.

Think about what makes delivery logistics hard. It's not the vehicles. It's coordinating thousands of them across changing conditions. And when you're operating with vehicles/artifacts that are too big for the use case, you're just making everything harder.

Model staleness hits regardless of model size. Edge models, once deployed, might not be frequently updated. A classifier trained on 2024 patterns doesn't recognize 2025 anomalies. Rolling out updates across thousands of devices is nontrivial, whether you're pushing 50KB or 5GB. This is the equivalent of route maps that don't know about new neighborhoods. Your drivers show up, but they can't find the addresses.

Fleet heterogeneity compounds everything. Devices don't update uniformly. You end up managing fragmented fleets in which different nodes run different model versions with varying capabilities. Cloud deployments update in minutes. Edge deployments take weeks, sometimes months. Some devices never update at all. Imagine dispatching trucks, vans, and bikes from the same warehouse without a unified system that tracks which vehicle has which capabilities. Version skew creates subtle bugs that only manifest at scale.

Energy constraints create hard limits that benchmarks ignore. Thermal throttling kicks in when you stress mobile CPUs. Even a small model running continuous inference drains batteries and generates heat. Academic papers report Joules per prediction. Users report their phone dying by 2pm. The diesel cost that accumulates invisibly on every route, whether you're carrying one envelope or a hundred boxes.

Network variability breaks every cloud-native assumption. Traditional MLOps assumes stable, high-bandwidth connections. That assumption doesn't hold when inference pipelines need to survive outages, intermittent connectivity, or bandwidth that costs real money. What happens when your edge device goes offline for a week? When does it reconnect with stale models and queued data? It's like planning routes that assume every road is always open. The moment a bridge closes, your whole system breaks.
The Data Pipeline Problem
This is where "feels free" really isn't free. Edge ML isn't failing because models are too big. It's failing because data pipelines weren't designed for bidirectional flow.

Delivery networks learned this decades ago. Packages flow out from warehouses, but returns flow back. Damage reports flow back. Delivery confirmations flow back. Reverse logistics are just as important as forward logistics, and often harder.

The traditional ML assumption:
Edge Device → Cloud → Inference → Response

What edge ML actually needs:
Edge Device ↔ Local Inference ↔ Selective Sync ↔ Model Updates ↔ Back to Edge

This bidirectionality creates problems most teams don't anticipate. As IBM's edge deployment guide notes: "In an edge deployment scenario, there is no direct reason to send production data to the cloud. This may create the issue that you'll never receive it, and you can check the accuracy of your training data. Generally, your training data will not grow."

Your model improves based on the data it sees. If that data never leaves the edge, your model never improves. But if all data goes to the cloud, you've rebuilt the centralized architecture you were trying to escape, with extra latency and bandwidth costs. It's like routing every return through your main distribution center instead of handling them at local hubs. Technically correct, but operationally a nightmare.

Edge-to-cloud ETL pipelines are emerging as critical infrastructure. They need real-time ingestion, adaptive transformation, graceful degradation when connectivity fails, and respect for data sovereignty constraints. A 50KB model and a 5GB model face identical challenges here. The pipeline doesn't care about parameter count, just like the route doesn't care whether you're driving a truck or riding a bike.
What Actually Works
The teams succeeding with edge ML have stopped optimizing vehicles and started optimizing routes.

Tiered inference separates quick decisions from complex reasoning. Vector search at the edge runs in 5-10ms using in-memory indexes. No GPU required. Simple classifications and caching happen locally. Complex reasoning routes selectively when network allows. This is the e-bike for last-mile delivery, the truck for bulk warehouse transfers. Match the vehicle to the delivery, not the other way around.

Edge MLOps mirrors replicate minimal cloud capabilities locally. When the network disappears, edge nodes still manage model lifecycle, handle updates from local cache, and queue telemetry for later sync. This approach acknowledges what cloud-native architectures ignore: networks fail. Devices go offline. The question isn't whether your deployment loses connectivity. It's whether it keeps working when it does. Local dispatch centers that function when headquarters goes dark.

Data locality as first principle means processing where data lives, not where servers are convenient. By 2025, over 50% of enterprise data will be processed at the edge, up from 10% in 2021. This shift is already happening in manufacturing, retail, healthcare, and logistics. Organizations adapting successfully treat edge deployment as first-class infrastructure, building intelligent data orchestration that moves compute to data rather than data to compute. Deliver from the nearest warehouse, not the central hub.

Selective synchronization solves the training data problem. Not all edge data needs to reach the cloud, but representative samples do. Anomalies do. Edge cases that challenged local models do. Smart filtering at the edge, with policies that adapt based on model confidence and data novelty, keeps training pipelines fed without overwhelming bandwidth or centralized storage. Send back the damage reports. Don't send back confirmation that every package arrived fine.

This is exactly why we built Expanso around data orchestration rather than model serving. The model isn't the bottleneck, whether it's a 50KB decision tree or a 4GB quantized LLM. The bottleneck is getting the right data to the right place at the right time, coordinating updates across heterogeneous fleets, and maintaining observability when half your nodes are intermittently connected. Our approach treats edge nodes as first-class participants in data pipelines, not afterthoughts bolted onto cloud architectures. Route planning, not vehicle engineering.
Where This Is Heading
$378 billion in projected edge computing spending by 2028. IDC expects edge AI deployments to grow at 35% CAGR over the next three years. That investment isn't going into building better trucks. The quantization problem is largely solved. The money is going into the logistics layer that makes edge deployment actually work.

Federated learning is moving from research curiosity to production requirement. It's the only practical way to improve models from edge data without centralizing that data, solving the training feedback loop that IBM's guide warned about. Standardized edge-cloud orchestration protocols are emerging to simplify deployment across heterogeneous environments. The security surface is expanding dramatically as AI distributes across thousands of devices rather than sitting in secured data centers.

The companies navigating this successfully aren't the ones with the smallest vehicles or the fastest engines. They're the ones who recognized early that vehicle optimization was table stakes, not competitive advantage. The hard problems were always about fleet management, route planning, package tracking, and graceful degradation when conditions change.
The Right Questions
Not "how do I compress my neural network to fit on edge hardware?"

Start with "what's the simplest model that solves my actual problem?" For sensor data, that's often a decision tree. Kilobytes, not gigabytes. Proven to outperform neural networks on tabular data. For language tasks, yes, you need transformers. Adobe's SlimLM shows what's possible: 125M-1B parameters, document assistance on smartphones.

Then ask "can my infrastructure actually deploy and maintain this?" Can you push updates to a fragmented fleet? Can your edge nodes operate when disconnected? Does your data pipeline support bidirectional flow? Can you monitor inference quality across thousands of distributed nodes?

The size obsession missed the point twice: once by reaching for complex models when simple ones work better, and again by focusing on compression when deployment was the actual bottleneck.

UPS isn't going to start delivering envelopes on e-bikes anytime soon. The truck infrastructure exists. The routes are planned. The drivers are trained. Switching has costs.

But if you're building edge ML from scratch, you get to choose. You can build the truck fleet because trucks are what serious logistics companies use. Or you can look at what you're actually delivering and pick the right vehicle for the job.

A 50KB model that deploys beats a 50MB model that doesn't. But even an e-bike needs a route that works.

The edge isn't where ML projects go to die. It's where the logistics need to grow up.

Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*

Originally published at Distributed Thoughts.

Emergence vs. Engineering: The Industry Just Bet Against the God Model

David Aronchick — Sat, 13 Dec 2025 18:09:33 +0000

Monday, OpenAI, Anthropic, Google, Microsoft, and AWS jointly donated their agent infrastructure to the Linux Foundation. If any of them actually believed a single model would achieve AGI in 2-3 years, this would be the dumbest move in corporate history.

You don't standardize the plumbing when you're about to build God.

The Agentic AI Foundation launched with three projects: Anthropic's Model Context Protocol (MCP) for connectivity, Block's goose for execution, and OpenAI's AGENTS.md for instructions. Together they form a complete stack for building composable AI systems, many specialized tools working through standard interfaces.

This isn't a technical footnote. It IS a recognition that no one is going to be able to do it all themselves.

For many MANY years, we've tried to engineer general intelligence from first principles. The results are impressive but bounded. This week, you could argue, the AI industry quietly bet on a different approach: letting intelligence emerge from simpler components.
The Physics Problem
Tim Dettmers published "Why AGI Will Not Happen" the day after the MCP announcement. His argument is remarkably clear.

"Computation isn't abstract. It happens in silicon, constrained by the speed of light, thermodynamics, and the square-cube law. Moving global information to local neighborhoods scales quadratically with distance. Memory becomes more expensive relative to compute as transistors shrink. "If you want to produce 10 exaflops on a chip, you can do that easily," Dettmers writes, "but you will not be able to service it with memory."

GPUs maxed out their performance-per-dollar around 2018. The gains since then came from one-off features: 16-bit precision, Tensor Cores, HBM, 8-bit quantization, 4-bit inference. Those tricks are exhausted. Dettmers estimates maybe one or two more years of meaningful scaling improvements before we hit the wall.

The transformer architecture itself is already near physically optimal. There doesn't appear (BUT I HAVE BEEN WRONG MANY TIMES BEFORE) to be a clever redesign waiting in the wings to unlock another order of magnitude.

Superintelligence? Fantasy. Recursive self-improvement still obeys scaling laws. An AI improving itself faces the same diminishing returns as engineers improving it externally. You're filling gaps in capability, not extending the frontier.

If you can't engineer your way to general intelligence through scale, what's the alternative?

The same thing that produced intelligence in nature: emergence.
More Is Different
In 1972, physicist Philip Anderson published "More is Different" in Science. It became one of the most cited papers in complexity research and helped establish the Santa Fe Institute.

Anderson's argument was profound: reductionism doesn't imply constructionism. You can break a system down into its fundamental parts, but you cannot rebuild complex behavior by assembling those parts. "At each new level of complexity," he wrote, "entirely new properties appear."

Consciousness isn't hiding in neurons. Traffic patterns don't exist in individual cars. The economy isn't a property of any single transaction. These phenomena emerge from interactions between simpler component, and they can't be predicted or engineered from first principles.

This isn't mysticism. It's how complex systems actually work.

The Santa Fe Institute defines emergence as "properties at one scale that are not present at another scale." Complex adaptive systems share common features: many agents, each intelligent and adaptive within their domain, none possessing complete information about the whole. Global patterns arise from local interactions without central control.

You don't engineer emergence. You create conditions for it.
The Ant Colony Test
Deborah Gordon at Stanford has spent decades studying ant colonies. Her description of individual ants is memorable: "I probably wouldn't hire them."

And yet collectively, ants build complex nests, find food sources efficiently, coordinate defense, and adapt to changing environments. Zero central control. The queen doesn't manage; she reproduces. As Gordon puts it, "Tasks allocate workers, rather than a manager allocating tasks to workers."

The mechanism is stigmergy: coordination through environmental modification. Ants leave pheromone trails that influence other ants' behavior. Simple rules at the individual level (follow strong trails, lay pheromones when successful)produce sophisticated collective intelligence.

Gordon draws the parallel explicitly: "In many ways, understanding the behavior of ant colonies could teach us about the way billions of relatively simple neurons work together in our brains."

The brain follows the same pattern. Neurons aren't conscious. They fire or don't fire based on local inputs. Consciousness emerges from billions of these simple interactions. There's no central "intelligence unit" directing traffic, no homunculus watching the show.

Decentralized control. Simple rules. Local interactions producing global behavior. Resilience through redundancy. Adaptation without central planning.

This is how nature builds intelligence. Not by engineering a god, but by enabling a swarm.
The Pattern Repeats
The internet works the same way.

David Clark's 1988 paper on DARPA's design philosophy reveals remarkably minimal assumptions: the network can transport a datagram with reasonable, not perfect, reliability. That's it. Everything else emerges from endpoints following simple protocols.

TCP/IP split responsibility deliberately. Keep IP simple and flexible. Push complexity to the edges. "Fate-sharing" means intelligent endpoints, dumb pipes. The result: a decentralized system that scaled beyond anyone's imagination and survives failures that would destroy centralized alternatives.

Unix philosophy follows the same template. Ken Thompson and Doug McIlroy: "Make each program do one thing well. Expect the output of every program to become the input to another." Small tools, standard interfaces, emergent capability from composition.

Nobody said "let's build one giant program that does everything." That was the mainframe mentality, and it lost.

I watched this pattern win with Kubernetes. We didn't build bigger VMs. We built smaller containers with standard interfaces and let orchestration handle the complexity. The sophisticated behavior emerged from composition, not from engineering a monolith.
What MCP Actually Means
The MCP donation makes sense through this lens.

With 97 million monthly SDK downloads and adoption by Claude, ChatGPT, Gemini, Microsoft Copilot, Cursor, and VS Code, MCP has become TCP/IP for AI agents: the standard protocol for connecting models to tools, data, and services.

David Soria Parra, MCP's lead maintainer: "The main goal is to have enough adoption in the world that it's the de facto standard."

Nick Cooper from OpenAI: "We need multiple protocols to negotiate, communicate, and work together to deliver value for people, and that sort of openness and communication is why it's not ever going to be one provider, one host, one company."

Read that again. OpenAI's own engineer saying it's not ever going to be one company.

When your fiercest competitors agree on a protocol, they're hedging. They're building for a world where no single system wins. They're betting on emergence over engineering.
The Honest Assessment
This doesn't mean AI won't be transformative. It means the path isn't "scale until AGI."

It's: build composable tools, let emergence do the heavy lifting.

Dettmers contrasts the US "winner-take-all" philosophy (that is betting everything on frontier models) with China's "economic diffusion" approach, integrating AI capabilities throughout the economy. The diffusion strategy doesn't require AGI. It requires useful, composable tools that produce emergent value when combined.

The MCP ecosystem is infrastructure for exactly this. Specialized agents handling narrow tasks, connected through standard protocols, producing collective intelligence that no individual component possesses.

Ant colonies. Neural networks. The internet. Unix. Kubernetes. Now AI agents.

The pattern keeps winning because it's how complexity actually works.
The Kicker
Fifty years ago, Philip Anderson argued that you can't construct complexity from simple parts through pure engineering. Emergence requires different tools, different thinking. You don't build intelligence; you create conditions for it to arise.

This week, the AI industry admitted he was right.

When OpenAI, Anthropic, Google, Microsoft, and AWS all agree on something, pay attention. They're not building for a world where one model solves everything. They're building for emergence.

The god model was always a fantasy.

The swarm is real.

Originally published at Distributed Thoughts.