<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: David Aronchick</title>
    <description>The latest articles on Forem by David Aronchick (@aronchick).</description>
    <link>https://forem.com/aronchick</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1294202%2Fe7ab50ef-66a0-4ab1-b75f-30006ae9a811.jpeg</url>
      <title>Forem: David Aronchick</title>
      <link>https://forem.com/aronchick</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aronchick"/>
    <language>en</language>
    <item>
      <title>The Brownian Ratchet for Data</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Fri, 23 Jan 2026 00:35:31 +0000</pubDate>
      <link>https://forem.com/aronchick/the-brownian-ratchet-for-data-1ngc</link>
      <guid>https://forem.com/aronchick/the-brownian-ratchet-for-data-1ngc</guid>
      <description>&lt;p&gt;&lt;a href="https://www.expanso.io/blog/brownian-ratchet-for-data?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Monday&lt;/a&gt; I wrote about how &lt;a href="https://github.com/dlorenc/multiclaude?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;multiclaude&lt;/a&gt; and &lt;a href="https://steve-yegge.medium.com/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;GasTown&lt;/a&gt; converged on nearly identical primitives for multi-agent orchestration. The key insight wasn't about prompts or models or agent personas. It was about infrastructure: CI is the ratchet. Let chaos reign. Multiple agents, overlapping work, duplicated effort, whatever. As long as you have a mechanism that only captures forward progress, you're good.&lt;/p&gt;

&lt;p&gt;That phrase has been rattling around my head ever since. Because here's the thing: we have this for code. &lt;strong&gt;What's the equivalent for data?&lt;/strong&gt;&lt;br&gt;
The Missing Ratchet&lt;br&gt;
CI transformed software development by giving us a one-way gate. Code either passes or it doesn't. No negotiations, no exceptions, no "we'll fix it later." The ratchet clicks forward, and it never clicks back.&lt;/p&gt;

&lt;p&gt;Data has no such mechanism.&lt;/p&gt;

&lt;p&gt;Oh, we have tools. We have &lt;a href="https://greatexpectations.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;great expectations&lt;/a&gt; (pun intended). We have dbt tests and schema validators and anomaly detectors. But none of them function as &lt;em&gt;the arbiter&lt;/em&gt;-the single, uncompromising source of truth that says "this data is real now, and we're never going backward."&lt;/p&gt;

&lt;p&gt;Instead, we have... hope? Process? Tickets that say "data quality issue" that sit in someone's backlog for three sprints while the dashboard keeps serving numbers that everyone knows are wrong but nobody can prove?&lt;br&gt;
What Would a Data Ratchet Look Like?&lt;br&gt;
Let's steal the multiclaude architecture and apply it to data:&lt;/p&gt;

&lt;p&gt;Code Ratchet&lt;br&gt;
Data Ratchet&lt;/p&gt;

&lt;p&gt;CI tests&lt;br&gt;
Schema validation + semantic checks&lt;/p&gt;

&lt;p&gt;Passing tests&lt;br&gt;
Data meeting quality thresholds&lt;/p&gt;

&lt;p&gt;Merged PRs&lt;br&gt;
Verified, immutable records&lt;/p&gt;

&lt;p&gt;Git history&lt;br&gt;
Data lineage with provenance&lt;/p&gt;

&lt;p&gt;Multiple agents&lt;br&gt;
Multiple validators / transformation paths&lt;/p&gt;

&lt;p&gt;The principle is the same: &lt;strong&gt;chaos is fine, as long as we ratchet forward.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multiple data sources can feed into your system. They can be messy, inconsistent, formatted in ways that make you question whether the upstream team has ever heard of ISO 8601. That's the Brownian motion: the random thermal energy of the real world generating data in a thousand incompatible ways.&lt;/p&gt;

&lt;p&gt;But the ratchet, the verification layer, only lets validated data through. And once it's through, it's permanent. Immutable. Part of the record.&lt;br&gt;
The Four Components&lt;br&gt;
I think a data ratchet needs four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Pawl: Schema as Contract
JSON Schema (or Avro, or Protobuf, whatever floats your boat) isn't just documentation. It's the pawl that prevents backward motion. Data either conforms or it doesn't. No partial credit.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's what a schema-as-pawl actually looks like:&lt;br&gt;
{&lt;br&gt;
  "$schema": "&lt;a href="https://json-schema.org/draft/2020-12/schema" rel="noopener noreferrer"&gt;https://json-schema.org/draft/2020-12/schema&lt;/a&gt;",&lt;br&gt;
  "title": "SensorReading",&lt;br&gt;
  "type": "object",&lt;br&gt;
  "required": ["device_id", "timestamp", "value", "unit"],&lt;br&gt;
  "properties": {&lt;br&gt;
    "device_id": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "pattern": "^[A-Z]{2}-[0-9]{6}$"&lt;br&gt;
    },&lt;br&gt;
    "timestamp": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "format": "date-time"&lt;br&gt;
    },&lt;br&gt;
    "value": {&lt;br&gt;
      "type": "number",&lt;br&gt;
      "minimum": -273.15&lt;br&gt;
    },&lt;br&gt;
    "unit": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "enum": ["celsius", "fahrenheit", "kelvin"]&lt;br&gt;
    }&lt;br&gt;
  },&lt;br&gt;
  "additionalProperties": false&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;Notice &lt;code&gt;additionalProperties: false&lt;/code&gt;. That's the pawl. You can't sneak extra fields through. You can't send &lt;code&gt;&amp;amp;quot;value&amp;amp;quot;: &amp;amp;quot;hot&amp;amp;quot;&lt;/code&gt; instead of a number. You can't omit the timestamp and promise to fill it in later.&lt;/p&gt;

&lt;p&gt;But here's where most systems fail: they treat schema validation as a warning, not a wall. "Schema violation detected, logging and continuing." That's not a ratchet. That's a turnstile with a broken lock.&lt;/p&gt;

&lt;p&gt;A real data ratchet &lt;em&gt;rejects&lt;/em&gt; non-conforming data. Full stop. The data can go back to the source, get transformed, get remediated, whatever it needs to do. But it doesn't get through until it conforms.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Wheel: Idempotent Checkpoints
In multiclaude, git worktrees give each agent isolation. If an agent's work fails, it fails in its own branch. The main branch (the ratcheted progress) stays untouched.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Data pipelines need the same thing: checkpoints that are idempotent and isolated. If a transformation fails, you can retry from the last checkpoint without corrupting the verified data downstream.&lt;br&gt;
class CheckpointedPipeline:&lt;br&gt;
    def &lt;strong&gt;init&lt;/strong&gt;(self, checkpoint_store: str):&lt;br&gt;
        self.checkpoint_store = checkpoint_store&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def process_batch(self, batch_id: str, records: list[dict]) -&amp;amp;gt; str:
    # Check if we already processed this batch
    checkpoint = self.load_checkpoint(batch_id)
    if checkpoint and checkpoint[&amp;amp;quot;status&amp;amp;quot;] == &amp;amp;quot;completed&amp;amp;quot;:
        return checkpoint[&amp;amp;quot;output_path&amp;amp;quot;]  # Idempotent: return existing result

    # Process in isolation (write to temp location)
    temp_path = f&amp;amp;quot;{self.checkpoint_store}/pending/{batch_id}&amp;amp;quot;
    validated = []
    for record in records:
        if self.validate(record):
            validated.append(record)
        else:
            self.quarantine(record, batch_id)  # Don&amp;amp;apos;t lose it, just don&amp;amp;apos;t let it through

    self.write_records(temp_path, validated)

    # Only after success: commit the checkpoint
    final_path = f&amp;amp;quot;{self.checkpoint_store}/verified/{batch_id}&amp;amp;quot;
    self.atomic_move(temp_path, final_path)
    self.save_checkpoint(batch_id, {&amp;amp;quot;status&amp;amp;quot;: &amp;amp;quot;completed&amp;amp;quot;, &amp;amp;quot;output_path&amp;amp;quot;: final_path})

    return final_path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The key moves: write to a temp location first, only move to the verified path after success, and the checkpoint makes retries safe. If the process dies mid-batch, we start over. No partial state leaking into the verified dataset.&lt;/p&gt;

&lt;p&gt;Most pipelines I've seen treat state as something that happens &lt;em&gt;to&lt;/em&gt; them rather than something they &lt;em&gt;manage&lt;/em&gt;. They're stateless in theory and stateful in practice, which is the worst of both worlds.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Arbiter: Automated Verification with Teeth
Here's the multiclaude rule that matters: &lt;em&gt;agents are forbidden from weakening CI to make their work pass.&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Translate that to data: no one can weaken the validation rules to make bad data pass. Not the data team, not the business stakeholder with a deadline, not the executive who needs the dashboard updated yesterday.&lt;/p&gt;

&lt;p&gt;What does "CI for data" actually look like? Something like this:&lt;/p&gt;

&lt;h1&gt;
  
  
  data-ci.yaml
&lt;/h1&gt;

&lt;p&gt;name: Data Quality Gate&lt;/p&gt;

&lt;p&gt;on:&lt;br&gt;
  data_ingestion:&lt;br&gt;
    sources: ["sensor-feed", "partner-api", "user-uploads"]&lt;/p&gt;

&lt;p&gt;jobs:&lt;br&gt;
  validate:&lt;br&gt;
    steps:&lt;br&gt;
      - name: Schema Validation&lt;br&gt;
        run: |&lt;br&gt;
          jsonschema --instance ${{ inputs.data_path }} \&lt;br&gt;
                     --schema schemas/${{ inputs.source }}.json&lt;br&gt;
        fail_on_error: true  # This is the ratchet. No exceptions.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  - name: Semantic Checks
    run: |
      python checks/semantic_validator.py \
        --data ${{ inputs.data_path }} \
        --rules rules/${{ inputs.source }}.yaml
    # Example rules:
    # - timestamp must be within last 24 hours
    # - device_id must exist in device registry
    # - value must be within 3 std devs of rolling mean

  - name: Lineage Recording
    if: success()
    run: |
      record-lineage \
        --input ${{ inputs.data_path }} \
        --schema-version ${{ inputs.schema_hash }} \
        --validator-version ${{ github.sha }} \
        --output verified/${{ inputs.batch_id }}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;on_failure:&lt;br&gt;
    steps:&lt;br&gt;
      - name: Quarantine Bad Data&lt;br&gt;
        run: |&lt;br&gt;
          move-to-quarantine ${{ inputs.data_path }} \&lt;br&gt;
            --reason "${{ job.failure_reason }}"&lt;br&gt;
      - name: Alert Source System&lt;br&gt;
        run: |&lt;br&gt;
          notify-upstream ${{ inputs.source }} \&lt;br&gt;
            --batch ${{ inputs.batch_id }} \&lt;br&gt;
            --errors ${{ job.validation_errors }}&lt;/p&gt;

&lt;p&gt;The critical bit is &lt;code&gt;fail_on_error: true&lt;/code&gt; with no escape hatch. No &lt;code&gt;continue-on-error&lt;/code&gt;. No "warn and proceed." The data either passes or it goes to quarantine.&lt;/p&gt;

&lt;p&gt;This is culturally difficult. It requires the same organizational commitment that "we don't ship if tests fail" required for software teams. But it's the only way the ratchet works.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reproducibility: The Secret Ingredient
There's one more piece that makes the code ratchet work: reproducibility. When CI fails, you can reproduce the failure. When it passes, you can reproduce the pass. Same inputs, same outputs, every time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Data systems are notoriously bad at this. The pipeline that worked yesterday fails today because someone changed an upstream schema. Or because the source system had a hiccup. Or because Mercury is in retrograde. (I've debugged all three. The Mercury one was actually a timezone issue in a system named "Mercury." I wish I was kidding.)&lt;/p&gt;

&lt;p&gt;A real data ratchet needs what I'd call a "usability signature":&lt;br&gt;
{&lt;br&gt;
  "batch_id": "2026-01-22-sensor-feed-042",&lt;br&gt;
  "verified_at": "2026-01-22T14:32:01Z",&lt;br&gt;
  "input_hash": "sha256:a1b2c3d4...",&lt;br&gt;
  "schema": {&lt;br&gt;
    "name": "SensorReading",&lt;br&gt;
    "version": "2.1.0",&lt;br&gt;
    "hash": "sha256:e5f6g7h8..."&lt;br&gt;
  },&lt;br&gt;
  "validators": {&lt;br&gt;
    "semantic_checks": "v1.4.2",&lt;br&gt;
    "anomaly_detector": "v0.9.1"&lt;br&gt;
  },&lt;br&gt;
  "result": {&lt;br&gt;
    "status": "passed",&lt;br&gt;
    "records_in": 10482,&lt;br&gt;
    "records_verified": 10479,&lt;br&gt;
    "records_quarantined": 3&lt;br&gt;
  },&lt;br&gt;
  "output_path": "verified/2026-01-22/sensor-feed-042.parquet",&lt;br&gt;
  "output_hash": "sha256:i9j0k1l2..."&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;This signature is an artifact, not just a log line. You can take this signature, grab the input data by its hash, run the exact versions of the validators, and you'll get the same result. If you can't do that, you don't have a ratchet. You have a coin flip.&lt;br&gt;
The Uncomfortable Implication&lt;br&gt;
Here's what this means in practice: a lot of data that's currently flowing through your systems wouldn't make it through a real ratchet.&lt;/p&gt;

&lt;p&gt;That's not a bug. That's the point.&lt;/p&gt;

&lt;p&gt;The Brownian ratchet works because it's &lt;em&gt;uncompromising&lt;/em&gt;. The pawl doesn't care that you really need this data for a quarterly review. It doesn't care that the source system "usually" sends valid records. It doesn't care about your deadline.&lt;/p&gt;

&lt;p&gt;CI transformed software quality not by being smart, but by being stubborn. It created a culture where "works on my machine" stopped being an excuse because there was an objective arbiter that didn't care about your machine.&lt;/p&gt;

&lt;p&gt;Data needs the same stubbornness. The same willingness to say "no" and mean it.&lt;br&gt;
What This Looks Like in Practice&lt;br&gt;
I've been thinking about this in the context of what we're building at Expanso: intelligent data pipelines that can process data at the edge. The edge is where the Brownian motion is strongest. Sensors, devices, user inputs, all generating data in a thousand formats with a thousand failure modes.&lt;/p&gt;

&lt;p&gt;The traditional answer is to centralize. Pull everything to a data lake, clean it up, validate it there. But that's expensive, slow, and loses context. By the time you've moved the data, you've lost the ability to remediate at the source.&lt;/p&gt;

&lt;p&gt;What if the ratchet lived at the edge? Validation happens where data is generated. Non-conforming data gets rejected &lt;em&gt;immediately&lt;/em&gt;, while there's still context to fix it. Only verified data propagates upstream.&lt;/p&gt;

&lt;p&gt;That's the vision. Not a single central ratchet, but a distributed network of ratchets. Each one small and stubborn. Each one clicking forward, never back.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost.&lt;/strong&gt; &lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;I'd love to hear your thoughts&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/brownian-ratchet-for-data/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devto</category>
    </item>
    <item>
      <title>The Brownian Ratchet and the Chimpanzee Factory</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Tue, 20 Jan 2026 00:34:00 +0000</pubDate>
      <link>https://forem.com/aronchick/the-brownian-ratchet-and-the-chimpanzee-factory-583n</link>
      <guid>https://forem.com/aronchick/the-brownian-ratchet-and-the-chimpanzee-factory-583n</guid>
      <description>&lt;p&gt;Two weeks ago, Steve Yegge released &lt;a href="https://github.com/steveyegge/gastown?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;GasTown&lt;/a&gt;, a multi-agent orchestrator he describes as "an industrialized coding factory manned by superintelligent chimpanzees." A few days later, Dan Lorenc quietly pushed &lt;a href="https://github.com/dlorenc/multiclaude?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;multiclaude&lt;/a&gt;, built on what he calls the "Brownian Ratchet" principle: chaos is fine, as long as we ratchet forward.&lt;/p&gt;

&lt;p&gt;While the projects are separate, Dan says he was deeply inspired by Gas Town. However, they, and many others like them, landed on almost identical foundational architecture: detached UI for observability, git worktrees for isolation, external state persistence, and CI as the final arbiter. That convergence tells us something important about where agent tooling is heading.&lt;br&gt;
The Problem Both Solve&lt;br&gt;
Running one Claude Code instance is straightforward, but running twenty in parallel on the same codebase is a distributed systems problem. The challenges are familiar to anyone who's operated infrastructure at scale: agent sessions crash, work gets duplicated, changes conflict, and state disappears when processes restart. Without proper isolation, a single runaway agent can corrupt shared resources, and without observability you can't debug what's happening. If state doesn't persist, progress evaporates the moment something fails.&lt;/p&gt;

&lt;p&gt;In source code, people saw this same problem (lots of people working on the same thing), and solved it incrementally with things like &lt;a href="https://subversion.apache.org/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;SVN&lt;/a&gt;, and then &lt;a href="https://git-scm.com/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Git&lt;/a&gt; (with many others as well).&lt;/p&gt;

&lt;p&gt;Every multi-agent orchestration system has to answer these questions about multiple &lt;em&gt;things&lt;/em&gt; working on a single &lt;em&gt;thing&lt;/em&gt;, and what's interesting is watching &lt;em&gt;how&lt;/em&gt; different systems answer them.&lt;br&gt;
Two Philosophies, Same Primitives&lt;br&gt;
GasTown takes the comprehensive approach, with seven distinct agent roles that divide responsibilities across the system. The Mayor coordinates overall work, Polecats handle ephemeral tasks, the Refinery manages merge queues, and so on through Witness, Deacon, Dogs, and Crew. Work flows through what Yegge calls the MEOW stack (Molecules, Epics, Work orders), with state persisting through git-backed "hooks" and the Beads memory framework. Agents maintain persistent identities that survive session crashes via the GUPP principle: "If there is work on your hook, YOU MUST RUN IT." This is ... confusing ... but i have to respect explicitly naming everything. Gives you an unequivocal way to know what (and who) is doing what, at the cost of a lot of UX affordance up front.&lt;/p&gt;

&lt;p&gt;multiclaude takes the minimalist path with just three roles: a supervisor that coordinates, workers that execute tasks, and a merge-queue agent that handles CI. State lives in a JSON file and the filesystem, communication happens through simple message passing, and the philosophy is explicit: "Trying to perfectly coordinate agent work is both expensive and fragile. Instead, we let chaos happen and use CI as the ratchet that captures forward progress."&lt;/p&gt;

&lt;p&gt;The rhetoric differs dramatically between the two projects. Yegge's documentation reads like a manifesto, complete with warnings that you shouldn't use GasTown if you "care about money" or are "more than 4 feet tall." Lorenc's README is Unix-philosophy spare, with clean diagrams and matter-of-fact explanations. But underneath the different personalities, you find the same primitives.&lt;/p&gt;

&lt;p&gt;Primitive&lt;br&gt;
GasTown&lt;br&gt;
multiclaude&lt;/p&gt;

&lt;p&gt;Process isolation&lt;br&gt;
tmux windows&lt;br&gt;
tmux windows&lt;/p&gt;

&lt;p&gt;Code isolation&lt;br&gt;
git worktrees&lt;br&gt;
git worktrees&lt;/p&gt;

&lt;p&gt;State persistence&lt;br&gt;
Git-backed hooks&lt;br&gt;
JSON + filesystem&lt;/p&gt;

&lt;p&gt;Quality gate&lt;br&gt;
CI (automated merging)&lt;br&gt;
CI ("the ratchet")&lt;/p&gt;

&lt;p&gt;Observability&lt;br&gt;
Attach to any session&lt;br&gt;
Attach to any session&lt;/p&gt;

&lt;p&gt;Both folks recognized that you can't just spawn Claude instances and hope for the best. You need boundaries.&lt;br&gt;
What the Convergence Reveals&lt;br&gt;
When two experienced engineers arrive at the same architectural primitives, that's signal worth paying attention to. It suggests these aren't arbitrary choices but structural requirements for the problem space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolation requires more than process boundaries.&lt;/strong&gt; Both projects use git worktrees rather than just separate directories because a worktree gives each agent its own branch, its own working copy, and its own commit history. Conflicts become merge conflicts, which git already knows how to surface, and the blast radius of any single agent is bounded by what it can do to its own worktree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability can't be an afterthought.&lt;/strong&gt; Both chose tmux as the primary interface rather than a web dashboard or log aggregator. A terminal multiplexer lets you attach to any agent's session, watch it work, and intervene if needed. This is distinctly different from how most "AI agent frameworks" approach the problem, with their emphasis on structured outputs and API-driven orchestration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State must survive failures.&lt;/strong&gt; GasTown invests heavily in crash recovery through git-backed hooks while multiclaude keeps it simpler with filesystem persistence, but both reject the idea of ephemeral agent state. When a session dies, the work shouldn't die with it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI becomes the coordination mechanism.&lt;/strong&gt; In both systems, CI isn't just a quality check but the arbiter of what counts as progress. Lorenc is explicit: "If it passes, the code goes in. If it fails, it doesn't. The automation decides." Yegge's Refinery agent serves the same function, and this approach shifts coordination from real-time synchronization (expensive, fragile) to asynchronous validation (robust, scalable).&lt;br&gt;
The Deeper Pattern: Scoped Autonomy&lt;br&gt;
Strip away the specific implementations and you find a design pattern emerging for AI agent systems: &lt;strong&gt;scoped autonomy with external persistence&lt;/strong&gt;. Give agents freedom to act within clear boundaries, let them fail without cascading damage, capture successful outcomes permanently, and accept that coordination is expensive and often unnecessary if your ratchet mechanism is good enough.&lt;/p&gt;

&lt;p&gt;This isn't a new idea. It's how we've learned to build reliable distributed systems over the past two decades, and the insight here is that agent orchestration &lt;em&gt;is&lt;/em&gt; distributed systems with the same principles applying. Kubernetes asks "Is it running?" and reconciles toward desired state while GasTown asks "Is it done?" and persists completed work. Both are control loops operating over unreliable workers, and both accept that perfect coordination is impossible and design around it.&lt;br&gt;
Where They Diverge: Single-Player vs. MMORPG&lt;br&gt;
The most interesting philosophical difference isn't about orchestration complexity but about the human model.&lt;/p&gt;

&lt;p&gt;Lorenc frames multiclaude explicitly: "Gastown treats agents as NPCs in a single-player game. You're the player, agents are your minions. multiclaude treats software engineering as an MMORPG. You're one player among many."&lt;/p&gt;

&lt;p&gt;In multiclaude, your workspace persists. You spawn workers, go to lunch, come back, and check what merged while you were away. Other humans can have their own workspaces on the same repo, and the system keeps running when you're not watching.&lt;/p&gt;

&lt;p&gt;GasTown is designed around intensive engagement. Yegge describes watching 20-30 agents in parallel, making $100/hour decisions about what work to greenlight, experiencing "palpable stress" as the system runs at speeds too fast to comprehend. It's a powerful multiplier for an engaged operator rather than a fire-and-forget system.&lt;/p&gt;

&lt;p&gt;Neither model is wrong since they're optimizing for different workflows, but the MMORPG framing points toward something important: these systems need to work when humans aren't actively supervising.&lt;br&gt;
What This Means for the Industry&lt;br&gt;
We're watching the orchestration layer crystallize in real time, and the patterns that emerge now will shape how agent systems get built for years.&lt;/p&gt;

&lt;p&gt;The "19-agent trap" (simulating an org chart with Analyst → PM → Architect → Dev → QA handoffs) is giving way to operational models where agents have specific, bounded roles. The emphasis shifts from elaborate prompting frameworks to infrastructure primitives: isolation, persistence, observability.&lt;/p&gt;

&lt;p&gt;The tooling will mature as costs drop. Right now, GasTown burns $100/hour in tokens, partly because the models are expensive and partly because the coordination overhead is high. Both factors will improve, and the architectural patterns being established now will outlast the current pricing structure.&lt;/p&gt;

&lt;p&gt;For teams thinking about agent infrastructure, the lesson isn't "adopt GasTown" or "adopt multiclaude" since both are weeks old and explicitly experimental. The lesson is to watch what primitives they converged on, because if you're building agent systems you'll probably need them too: git worktrees for isolation, something tmux-like for observability, persistent state that survives session failures, and CI or some equivalent as the ratchet that captures forward progress.&lt;/p&gt;

&lt;p&gt;The chimpanzee factory and the Brownian ratchet arrived at the same answer. That's worth paying attention to.&lt;/p&gt;

&lt;p&gt;Repos:&lt;br&gt;
&lt;em&gt;- GasTown:&lt;/em&gt; &lt;a href="https://github.com/steveyegge/gastown?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;em&gt;github.com/steveyegge/gastown&lt;/em&gt;&lt;/a&gt; &lt;br&gt;
&lt;em&gt;- multiclaude:&lt;/em&gt; &lt;a href="https://github.com/dlorenc/multiclaude?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;em&gt;github.com/dlorenc/multiclaude&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. &lt;strong&gt;[&lt;/strong&gt;I'd love to hear your thoughts**](&lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org&lt;/a&gt;)&lt;/strong&gt;!**&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/brownian-ratchet-chimpanzee-factory/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiinfrastructure</category>
      <category>agents</category>
      <category>orchestration</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>The Upstream Problem: Why Context Graphs Are Starving</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Sat, 17 Jan 2026 00:33:27 +0000</pubDate>
      <link>https://forem.com/aronchick/the-upstream-problem-why-context-graphs-are-starving-79j</link>
      <guid>https://forem.com/aronchick/the-upstream-problem-why-context-graphs-are-starving-79j</guid>
      <description>&lt;p&gt;Foundation Capital just published what &lt;a href="https://foundationcapital.com/context-graphs-ais-trillion-dollar-opportunity/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;they're calling AI's trillion-dollar opportunity&lt;/a&gt;: context graphs. They argue that enterprise value is shifting from systems of record (Salesforce, Workday, SAP) to systems of agents. The new crown jewel isn't the data itself. It's the context graph: a living record of decision traces stitched across entities and time, where precedent becomes searchable.&lt;/p&gt;

&lt;p&gt;They're right about the destination. But &lt;a href="https://www.linkedin.com/feed/update/urn:li:activity:7417626837616500736/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Greg Ceccarelli's response on LinkedIn&lt;/a&gt; caught something important that their framing misses. Foundation Capital focuses on capturing decisions at execution time. That matters, but it's the last mile. The first mile is still bleeding out.&lt;br&gt;
The Telephone Game (But With Developers)&lt;br&gt;
Decisions don't originate at execution time. They originate in conversations.&lt;/p&gt;

&lt;p&gt;A PM pattern-matches across customer interviews. Engineering debates constraints in Slack. A VP makes a call on a Zoom that nobody documents. By the time any of this hits a system of record, the context has been compressed, lossy-encoded, and re-interpreted three times. It's a game of telephone where the prize is a barely articulated card in your Kanban roadmap.&lt;/p&gt;

&lt;p&gt;Recording meetings is table stakes now. The raw material exists. But most of it vanishes. It's searchable in theory and useless in practice. You can find that the decision was made, but you can't find why it made sense given everything else that was happening at the time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloudedjudgement.substack.com/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Jamin Ball's piece "Long Live Systems of Record"&lt;/a&gt; pushed back on the "agents kill everything" narrative, arguing that agents don't replace systems of record, they raise the bar for what a good one looks like. I think he’s right, but the problem is no one is the voice for the downstream consumers. The reasoning, the exceptions, the context that justified a decision in the moment isn’t in any form that a human (let alone an agent) can find or consume. That's what's missing.&lt;/p&gt;

&lt;p&gt;Context graphs need to be fed. The feed is conversations, and, right now, conversations evaporate.&lt;br&gt;
The Same Problem, Worse, in Data&lt;br&gt;
Greg's framing focuses on software development, and that's where his company &lt;a href="https://specstory.com/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;SpecStory&lt;/a&gt; and their new &lt;a href="https://www.intent.build/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Intent&lt;/a&gt; product are building. I think these are awesome, and deserve a lot of attention. In fact, so much so that I want to take it further. Software development is just one domain where decisions get lost upstream.&lt;/p&gt;

&lt;p&gt;Data pipelines, &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;our world&lt;/a&gt;, are another, and arguably worse.&lt;/p&gt;

&lt;p&gt;When a data engineer decides which fields to drop during transformation, how to handle null values in a critical column, why a particular join strategy was chosen over another, what "clean" means for this specific dataset... where does that reasoning live? In a PR comment that gets archived. A Slack thread that disappears. Someone's head who leaves the company.&lt;/p&gt;

&lt;p&gt;The data observability market has exploded. &lt;a href="https://www.gartner.com/en/information-technology?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Gartner estimates&lt;/a&gt; data observability will be a $2.5B+ market by 2027. But all of it focuses on detecting problems after they happen. The upstream intent, why the pipeline was designed this way, what tradeoffs were considered, what the original constraints were, remains uncaptured.&lt;/p&gt;

&lt;p&gt;Another favorite company of mine,&lt;a href="https://greatexpectations.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Great Expectations&lt;/a&gt;, does a great job capturing what should be true. And &lt;a href="https://docs.getdbt.com/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;dbt&lt;/a&gt; moves documentation closer to the code. And we have standards&lt;/p&gt;

&lt;p&gt;, for example, captures the what of transformations. But almost nothing captures the WHY.&lt;/p&gt;

&lt;p&gt;When an ML model makes a bad prediction, you can trace back to the training data. But can you trace back to why the training data was prepared that way? Who decided to impute missing values with medians instead of dropping rows? What was the conversation that led to that feature engineering choice? What did the team know at the time that isn't written down anywhere?&lt;/p&gt;

&lt;p&gt;The decision trace doesn't exist because nobody captured it when it happened.&lt;br&gt;
Intent Has Locality&lt;br&gt;
This connects to something I've been thinking about for years. Intent has locality, just like data.&lt;/p&gt;

&lt;p&gt;The richest context about a decision exists at the moment it's made, in the place it's made. Move it somewhere else, a different system, a different time, a summary written later, and you lose fidelity. This is true whether you're moving bytes across a network or moving reasoning into documentation.&lt;/p&gt;

&lt;p&gt;Think about what happens when you try to document a decision after the fact. You're reconstructing. You remember the outcome but not the three alternatives you considered. You remember the constraint that mattered most but not the secondary factors that shaped the final call. You remember that someone raised an objection but not exactly what shifted the conversation.&lt;/p&gt;

&lt;p&gt;The further you get from the moment of decision, the more context you lose. And unlike data, you can't just store a copy closer to where it's needed. The moment passes. The reasoning evaporates. What remains is the artifact without the intent.&lt;br&gt;
What SpecStory Is Building&lt;br&gt;
This is why what &lt;a href="https://www.intent.build/design-partner?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Greg and the SpecStory team are building with Intent&lt;/a&gt; matters. They started where decisions turn into code: the conversation between developers and coding agents. Intent records every exchange with &lt;a href="https://www.anthropic.com/claude-code?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;, &lt;a href="https://cursor.sh/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;, &lt;a href="https://github.com/features/copilot?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;GitHub Copilot&lt;/a&gt;, &lt;a href="https://openai.com/index/openai-codex/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Codex&lt;/a&gt;, Gemini. The transcript of how software actually gets built.&lt;/p&gt;

&lt;p&gt;But as they asked where the intent came from, the answer kept pointing upstream. Team calls. Architecture discussions. Pairing sessions. The decisions that happen before anyone opens an IDE.&lt;/p&gt;

&lt;p&gt;Their solution has three layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capture&lt;/strong&gt;: Every agent prompt and IDE session, recorded automatically. Not just the code that got written, but the back-and-forth that produced it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Arena&lt;/strong&gt;: Real-time collaboration with automatic decision extraction. Not verbose summaries nobody will read. The actual decision linked to the exact moment in the conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo&lt;/strong&gt;: Decisions versioned alongside your source code. Consumable by humans and agents. Searchable forever.&lt;/p&gt;

&lt;p&gt;Full context lineage: Team discusses → Decision extracted → Agent builds → Session reasoning preserved → Code ships. Every line of code traceable back to the exchange where the decision was made.&lt;/p&gt;

&lt;p&gt;That's the upstream feed layer context graphs need. The bridge from conversation to context to code.&lt;br&gt;
The Parallel Problem Nobody's Solving for Data&lt;br&gt;
The same pattern applies to data infrastructure, and the gap is even wider.&lt;/p&gt;

&lt;p&gt;Here's an example I come back to constantly. You're looking at point-of-sale data from a retail chain, and one store shows zero transactions for six hours. What happened?&lt;/p&gt;

&lt;p&gt;Maybe the system wasn't connected. Maybe there was a hurricane. Maybe it was midnight and the store was closed. Maybe there was a police action in the area. Maybe the pipeline is connected but stopped running. Maybe someone unplugged the wrong cable during a renovation.&lt;/p&gt;

&lt;p&gt;The data looks identical in every case: zeros. But the appropriate response is completely different. If it's a hurricane, you adjust your forecasts and check on your employees. If it's a pipeline failure, you fix the pipeline and backfill the data. If it's midnight, you do nothing because everything is working correctly.&lt;/p&gt;

&lt;p&gt;The "what" is the same. The "why" determines everything that matters.&lt;/p&gt;

&lt;p&gt;This is the context gap in data infrastructure. Data lineage tools tell you what transformations happened. They don't tell you why someone chose that approach over alternatives. Data catalogs describe what datasets contain. They don't capture the discussions that shaped how those datasets were structured. Data quality tools flag when something looks wrong. They can't explain what "right" was supposed to mean based on the original requirements conversation.&lt;/p&gt;

&lt;p&gt;Every data team has experienced this: you inherit a pipeline, something breaks, and you spend days reverse-engineering decisions that took the original author five minutes to make. The &lt;a href="https://survey.stackoverflow.co/2024/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;2024 Stack Overflow survey&lt;/a&gt; found developers spend 30%+ of their time understanding existing code. For data engineers working with inherited pipelines, I'd bet that number is higher.&lt;/p&gt;

&lt;p&gt;A few teams are starting to explore how to capture intent at the data layer, not just the code layer. The ones who figure out how to preserve decision context where data actually lives, at the edge, in pipelines, across distributed infrastructure, might be building something important. But right now, the tooling barely exists.&lt;br&gt;
Why This Matters for AI Agents&lt;br&gt;
Foundation Capital is right that agents need decision traces to exercise judgment. But consider what happens when we only capture traces at execution time.&lt;/p&gt;

&lt;p&gt;An agent can follow a rule. It can look up a policy. But it can't understand why an exception was made last quarter unless someone captured the reasoning when it happened. It can see that a certain transformation was applied to a dataset but not why that approach was chosen over three alternatives that were discussed and rejected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/research?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Research on AI decision-making&lt;/a&gt; keeps surfacing the same challenge: agents struggle with edge cases because they lack the contextual reasoning that humans use to navigate ambiguity. We've been trying to solve this with better prompts, more examples, refined guardrails. But the fundamental problem is upstream. The reasoning that would help agents handle edge cases was never captured in the first place.&lt;/p&gt;

&lt;p&gt;Agents inherit our documentation debt. Every undocumented decision, every lost conversation, every piece of reasoning that exists only in someone's memory becomes a gap in the context graph. And agents can't exercise judgment across gaps.&lt;br&gt;
The Compounding Problem&lt;br&gt;
Context loss compounds in ways that aren't obvious until you're deep in a system you didn't build.&lt;/p&gt;

&lt;p&gt;Every undocumented decision becomes a landmine for the next person (or agent) who encounters that code, that pipeline, that system. They see what was built but not why. So they either preserve it blindly (accumulating technical debt they don't understand) or change it without understanding the original constraints (breaking things the original author anticipated but never wrote down).&lt;/p&gt;

&lt;p&gt;I've seen this pattern repeatedly. A team inherits a data pipeline with a seemingly arbitrary filter. They remove it because it doesn't match current requirements. Three months later, they discover it was preventing a subtle data quality issue that only surfaces under specific conditions. The original author knew about this. They even discussed it extensively with the team. But that conversation happened on a Zoom call that was never transcribed, and the person who made the decision left the company two years ago.&lt;/p&gt;

&lt;p&gt;Multiply this across every team, every pipeline, every codebase. The &lt;a href="https://dora.dev/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;DORA research&lt;/a&gt; shows that elite teams ship faster partly because they spend less time reverse-engineering past decisions. They've somehow preserved more context. Usually through heroic documentation efforts that don't scale.&lt;br&gt;
The Path Forward&lt;br&gt;
Foundation Capital's context graph thesis is right about the destination. Greg Ceccarelli and the SpecStory team are right about the first mile.&lt;/p&gt;

&lt;p&gt;The platforms that win won't just capture decisions at execution time. They'll capture intent upstream, in the conversations, the debates, the reasoning that happens before anyone writes a line of code or builds a pipeline.&lt;/p&gt;

&lt;p&gt;And they'll keep that intent close to where it matters. Versioned with the code. Traveling with the data. Available when someone (or something) needs to understand not just what happened, but why it was allowed to happen.&lt;/p&gt;

&lt;p&gt;We're good at storing what happened. We're terrible at capturing why. The next trillion-dollar platforms will be the ones that figure out how to close that gap, not at execution time, but upstream, where the decisions actually get made.&lt;/p&gt;

&lt;p&gt;Want to learn how intelligent data pipelines can reduce your AI costs? &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;Check out Expanso&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost.&lt;/strong&gt; &lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;I'd love to hear your thoughts&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/2026-01-15-upstream-problem-context-graphs-starving/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>contextgraphs</category>
      <category>datapipelines</category>
      <category>decisiontraces</category>
    </item>
    <item>
      <title>The $1B AI Drug Lab That Can't Touch Its Own Data</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Tue, 13 Jan 2026 06:12:34 +0000</pubDate>
      <link>https://forem.com/aronchick/the-1b-ai-drug-lab-that-cant-touch-its-own-data-bep</link>
      <guid>https://forem.com/aronchick/the-1b-ai-drug-lab-that-cant-touch-its-own-data-bep</guid>
      <description>&lt;p&gt;Nvidia and Eli Lilly &lt;a href="https://www.globenewswire.com/news-release/2026/01/12/3217075/0/en/NVIDIA-and-Lilly-Announce-Co-Innovation-AI-Lab-to-Reinvent-Drug-Discovery-in-the-Age-of-AI.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;announced a $1 billion AI drug discovery lab&lt;/a&gt; today at the J.P. Morgan Healthcare Conference. The press releases are full of the expected language: "reinvent drug discovery," "accelerate medicine development," "foundation models for biology." Lilly's CEO David Ricks said they're "combining our volumes of data and scientific knowledge with Nvidia's computational power."&lt;/p&gt;

&lt;p&gt;But, MAN, there is a phrase in there that is doing an INSANE amount of hard word: &lt;em&gt;Combining our volumes of data.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;How, exactly?&lt;br&gt;
The Missing Paragraph&lt;br&gt;
The &lt;a href="https://finance.yahoo.com/news/nvidia-eli-lilly-announce-1-billion-investment-in-ai-drug-discovery-lab-163446796.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;coverage&lt;/a&gt; has something conspicuously absent: any discussion of how pharma data actually moves. The lab will be in South San Francisco. Lilly's clinical trial data, compound libraries, and patient information live in facilities scattered across Indiana, Ireland, and dozens of research sites worldwide. The announcement talks about "lab-in-the-loop" systems linking wet labs and dry labs in "24/7 AI-assisted experimentation."&lt;/p&gt;

&lt;p&gt;That's a lovely vision. It also assumes data flows like water between these locations. In pharma, it doesn't.&lt;br&gt;
Why Pharma Data Is Different&lt;br&gt;
Clinical trial data contains protected health information under HIPAA. Proprietary compound structures represent billions in R&amp;amp;D investment and competitive advantage. Manufacturing process data falls under FDA's &lt;a href="https://www.ecfr.gov/current/title-21/chapter-I/subchapter-A/part-11?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;21 CFR Part 11&lt;/a&gt;, which mandates complete audit trails for every electronic record: who touched it, when, why, and what changed.&lt;/p&gt;

&lt;p&gt;These aren't bureaucratic inconveniences that clever engineering can route around. They're structural constraints that exist because the consequences of failure are measured in patient safety and billion-dollar regulatory actions.&lt;/p&gt;

&lt;p&gt;I've been talking to teams that operate in these environments. The pattern is consistent: they don't lack compute. They lack the ability to make their data &lt;em&gt;usable&lt;/em&gt; without making it &lt;em&gt;movable&lt;/em&gt;.&lt;br&gt;
The Air Gap Paradox&lt;br&gt;
Traditional security thinking offers two options. Lock data down completely in air-gapped environments where nothing gets out. Or open it up for analysis and accept the exfiltration risk. Pharma has mostly chosen the first option, which is why so much valuable data sits in protected directories that researchers can barely access.&lt;/p&gt;

&lt;p&gt;The promise of AI drug discovery assumes you can train models on this data. But training requires moving data to compute, or moving compute to data. The first option triggers every compliance alarm in the building. The second option is what the press releases hand-wave past.&lt;/p&gt;

&lt;p&gt;Security teams need something in between: protected environments where data scientists can actually work, but where every attempted data movement gets logged, analyzed, and blocked if it violates policy. Not just access controls (who can log in) but egress controls (what can leave). The ability to process data, transform it, analyze it, without ever letting raw records escape the protected perimeter.&lt;/p&gt;

&lt;p&gt;This is remarkably hard to build. It's also not a GPU problem.&lt;br&gt;
The Audit Trail Problem&lt;br&gt;
&lt;a href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/part-11-electronic-records-electronic-signatures-scope-and-application?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;21 CFR Part 11&lt;/a&gt; requires that regulated companies maintain computer-generated, time-stamped audit trails recording every modification to electronic records.&lt;/p&gt;

&lt;p&gt;Let’s say that again. &lt;em&gt;Every Modification&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The trail must include the operator identity, date/time, and the nature of the change.&lt;/p&gt;

&lt;p&gt;Now imagine training a foundation model on clinical trial data. The model sees millions of records. It learns patterns. It generates new molecular structures based on those patterns. What's the audit trail for that? When the model suggests a compound, which training records influenced that suggestion? When a researcher uses an AI-generated insight to make a decision, how do you document the provenance?&lt;/p&gt;

&lt;p&gt;These aren't hypothetical concerns. The FDA released &lt;a href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/considerations-use-artificial-intelligence-support-regulatory-decision-making-drug-and-biological?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;draft guidance on AI in drug development&lt;/a&gt; in January 2025, outlining a risk-based credibility assessment framework for AI models across nonclinical, clinical, and manufacturing phases. Regulators are actively figuring out how to apply existing frameworks to machine learning systems. Companies that can demonstrate clean data lineage from source through model to output will have a structural advantage in regulatory discussions.&lt;br&gt;
What Nvidia's Billion Dollars Actually Buys&lt;br&gt;
Nvidia and Lilly aren't naive about these challenges. The announcement mentions that researchers will "generate large-scale data" in the lab itself, creating new datasets specifically designed for AI training. That sidesteps some of the legacy data problems.&lt;/p&gt;

&lt;p&gt;The collaboration will (likely) use Nvidia's &lt;a href="https://www.nvidia.com/en-us/industries/healthcare-life-sciences/biopharma/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;BioNeMo platform&lt;/a&gt;, an open framework for building and training deep learning models for drug discovery that's been &lt;a href="https://nvidianews.nvidia.com/news/nvidia-bionemo-platform-adopted-by-life-sciences-leaders-to-accelerate-ai-driven-drug-discovery?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;adopted by over 200 techbios and large pharma companies&lt;/a&gt;. They're also focusing initial efforts on applications where data constraints are less severe: manufacturing optimization, process simulation, early-stage compound screening. These are real opportunities where GPU compute genuinely is the bottleneck.&lt;/p&gt;

&lt;p&gt;But the highest-value problems in drug discovery involve the data that's hardest to access: longitudinal patient records from clinical trials, real-world evidence from treatment outcomes, proprietary biological assay results accumulated over decades of R&amp;amp;D. That data can't just be copied to a shiny new lab in South San Francisco. And the &lt;a href="https://www.cbo.gov/publication/57126?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;estimated $1-3 billion cost to develop a single new drug&lt;/a&gt; happens largely because of failures that better data access might prevent.&lt;br&gt;
The Actual Hard Problem&lt;br&gt;
The companies that figure out "compute over data" for regulated industries will eat this market. Not by building bigger GPU clusters, but by solving the governance layer that lets valuable data become &lt;em&gt;usable&lt;/em&gt; without becoming &lt;em&gt;vulnerable&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;What does that look like in practice? Tagging data at the source with cryptographic fingerprints so you can always verify provenance. Processing pipelines that run inside protected perimeters with whitelist-only egress. Audit systems that log not just access but every transformation, every query, every attempted export. The ability to prove, at any point, exactly what happened to every record.&lt;/p&gt;

&lt;p&gt;This is boring infrastructure work. It doesn't make for exciting keynote demos. But it's the actual constraint on AI-driven drug discovery, and throwing more GPUs at it doesn't help.&lt;br&gt;
What I'd Watch For&lt;br&gt;
If you're evaluating pharma AI investments, look past the compute announcements. Ask instead:&lt;br&gt;
How does the company handle data that can't leave its current location?What's their approach to federated learning or on-premises model training?How do they maintain audit trails through AI-assisted workflows?What's their story for regulatory submissions that involve AI-generated insights?&lt;br&gt;
The GPU buildout is the visible part of the iceberg. The governance layer underneath is where the actual differentiation happens.&lt;/p&gt;

&lt;p&gt;Nvidia's bet will work for some use cases. Public datasets, synthetic data, newly generated experimental results. But the highest-value pharma AI problems live behind walls that compute power alone can't scale. The billion dollars is impressive. The missing paragraph about data governance is the real story.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost.&lt;/strong&gt; &lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;I'd love to hear your thoughts&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/billion-dollar-ai-drug-lab-cant-touch-data/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datainfrastructure</category>
      <category>ai</category>
      <category>pharma</category>
      <category>datagovernance</category>
    </item>
    <item>
      <title>Your 2026 Resolution: Add Context to Your Data (Before It Breaks You)</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Sun, 11 Jan 2026 06:11:42 +0000</pubDate>
      <link>https://forem.com/aronchick/your-2026-resolution-add-context-to-your-data-before-it-breaks-you-2k5n</link>
      <guid>https://forem.com/aronchick/your-2026-resolution-add-context-to-your-data-before-it-breaks-you-2k5n</guid>
      <description>&lt;p&gt;Last week I sat in an executive review where two teams spent forty minutes arguing about "active users." Not about strategy. Not about growth. About what the number meant.&lt;/p&gt;

&lt;p&gt;One team counted anyone who logged in. The other excluded users who bounced in under 30 seconds. Neither knew which experiment flags were active when the data was pulled. The dashboard just showed a number. No definition. No lineage. No context.&lt;/p&gt;

&lt;p&gt;This happens constantly. And it's about to get significantly worse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://backendnews.net/gartner-lack-of-ai-ready-data-threatens-success-of-ai-projects/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Gartner predicts&lt;/a&gt; that 60% of AI projects will be abandoned by 2026 because organizations lack "AI-ready data." Not because models failed. Not because compute was too expensive. Because the data traveling through these systems carries no meaning beyond the raw values.&lt;/p&gt;

&lt;p&gt;The models can't tell the difference between a deprecated pricing page and current policy. They can't distinguish a test account from a real customer. They retrieve answers confidently, cite sources correctly, and still get everything wrong.&lt;/p&gt;

&lt;p&gt;This is the year we stop treating context as optional documentation and start treating it as infrastructure.&lt;br&gt;
The Context Engineering Pivot&lt;br&gt;
Something shifted in 2025. The industry stopped talking about "prompt engineering" and started talking about "&lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;context engineering&lt;/a&gt;."&lt;/p&gt;

&lt;p&gt;Andrej Karpathy &lt;a href="https://x.com/karpathy/status/1937902205765607626?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;called it&lt;/a&gt; "the delicate art and science of filling the context window with just the right information for each step." &lt;a href="https://www.technologyreview.com/2025/11/05/1127477/from-vibe-coding-to-context-engineering-2025-in-software-development/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;MIT Technology Review&lt;/a&gt; documented the transition from "vibe coding" to systematic context management. &lt;a href="https://developers.googleblog.com/architecting-efficient-context-aware-multi-agent-framework-for-production/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Google's December release&lt;/a&gt; of their Agent Development Kit was entirely focused on context architecture.&lt;/p&gt;

&lt;p&gt;The terminology change matters. "Prompt" implies a single instruction you craft carefully. "Context" implies an entire information environment you engineer deliberately.&lt;/p&gt;

&lt;p&gt;And it turns out most organizations have been engineering that environment with all the care of a teenager cleaning their room by shoving everything under the bed.&lt;/p&gt;

&lt;p&gt;David Lanstein, CEO of Atolio, put it bluntly in &lt;a href="https://www.ibm.com/think/news/ai-tech-trends-predictions-2026?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;IBM's 2026 predictions&lt;/a&gt;: "The solution isn't bigger models, but smarter data. True value will come from feeding models high-quality, permission-aware structured data to generate intelligent, relevant and trustworthy answers."&lt;/p&gt;

&lt;p&gt;The race for bigger context windows missed the point. A 200K token context window filled with undifferentiated garbage produces undifferentiated garbage outputs, just with more confident citations.&lt;br&gt;
What Context Actually Means&lt;br&gt;
When I talk about context, I don't mean adding a few comments to your SQL. I mean four distinct layers that most data systems ignore entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic context&lt;/strong&gt; is what a value actually represents. Not just "this column is called revenue" but "this is recognized revenue under ASC 606, calculated monthly, excluding deferred amounts, as defined by the finance team's Q3 2025 policy update." When the definition changes, the context changes with it. When someone queries the data six months from now, they see what it meant then, not what it means today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational context&lt;/strong&gt; is the health and provenance of the data at query time. Is this number fresh? Did the upstream pipeline fail overnight? Is there an active incident affecting the source system? A dashboard that shows revenue without showing "by the way, the payment processor had a three-hour outage last night" is lying by omission.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Experimental context&lt;/strong&gt; is which flags and tests were active when the data was generated. Your MAU number is meaningless if you don't know that 40% of users were in an onboarding experiment that changed the activation flow. The number isn't wrong. It's just uninterpretable without the experiment metadata traveling alongside it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human context&lt;/strong&gt; is ownership and decision history. Who defined this metric? What decisions have been made based on it? Where's the design doc? When someone has a question, they shouldn't have to play archeologist in Slack to figure out who to ask.&lt;/p&gt;

&lt;p&gt;Most data systems capture maybe one of these. Usually the semantic layer, poorly, in a data catalog that nobody updates and fewer people read.&lt;br&gt;
The Kubernetes Lesson I Should Have Learned Sooner&lt;br&gt;
When I was the first product manager on Kubernetes at Google, we thought we'd solved the orchestration problem. Pods, services, deployments. State reconciliation. Declarative configuration. Ship your containers and let the scheduler figure out the rest.&lt;/p&gt;

&lt;p&gt;What we hadn't solved was context.&lt;/p&gt;

&lt;p&gt;A customer came to us wanting to run a global footprint of clusters, one per region, with synchronized jobs. Low-latency application, workloads coordinated across continents. We had an internal project called "Ubernetes" that was supposed to handle this, but the complexity was brutal. We ended up helping them build a custom solution.&lt;/p&gt;

&lt;p&gt;The problem wasn't deploying the workloads. GitOps handles that fine now. The problem was that when data crossed cluster boundaries, all the context about that data evaporated. Each cluster was internally consistent. The global system was broken because nobody knew what the data &lt;em&gt;meant&lt;/em&gt; in aggregate.&lt;/p&gt;

&lt;p&gt;I've watched the same pattern repeat across every data problem I've worked on since. The compute orchestration is largely solved. The data orchestration is still a mess, and it's a mess because context doesn't travel with the data. This is actually why I'm &lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;writing a book&lt;/a&gt; on exactly this: the hidden complexity of data preparation that causes 80% of AI projects to fail.&lt;br&gt;
Why RAG Doesn't Fix This&lt;br&gt;
The popular assumption has been that Retrieval-Augmented Generation solves the context problem. Point your model at your documents, let it retrieve what it needs, problem solved.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.infoworld.com/article/4108159/how-to-build-rag-at-scale.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;InfoWorld's analysis last week&lt;/a&gt; explains why this breaks at scale: "RAG breaks at scale because organizations treat it like a feature of LLMs rather than a platform discipline. Models generate confidently incorrect answers because the retrieval layer returns ambiguous or outdated knowledge."&lt;/p&gt;

&lt;p&gt;The failure mode is insidious. RAG with good retrieval but no context governance produces what I've started calling "hallucination with citations." The model cites a real document. The citation is accurate. The document is from 2023 and contradicts current policy. The answer is wrong, but it looks impeccably sourced.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cxtoday.com/customer-analytics-intelligence/ai-hallucinations-start-with-dirty-data-governing-knowledge-for-rag-agents/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;CX Today reported&lt;/a&gt; on exactly this pattern: "If the knowledge base is outdated, RAG just retrieves the wrong answer faster. If content is unstructured, like PDFs, duplicate docs, or inconsistent schemas, the model struggles to pull reliable context."&lt;/p&gt;

&lt;p&gt;The problem isn't retrieval. The problem is that the documents themselves carry no context about their validity, scope, or temporal bounds. A PDF is just a PDF. It doesn't know that it was superseded by a newer version, that it only applies to EMEA customers, or that the pricing section was invalidated by a board decision last quarter.&lt;/p&gt;

&lt;p&gt;When &lt;a href="https://venturebeat.com/data/six-data-shifts-that-will-shape-enterprise-ai-in-2026/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;VentureBeat declared "RAG is dead"&lt;/a&gt; in their 2026 predictions, they were being provocative. But the underlying point stands: RAG without context governance is dying. The organizations that will succeed with retrieval-augmented systems are the ones treating their knowledge bases as living, context-rich assets rather than static document dumps.&lt;br&gt;
The Toll Booth Is Coming&lt;br&gt;
There's a harder version of this problem emerging, and most organizations haven't noticed it yet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.constellationr.com/blog-news/insights/enterprise-technology-2026-15-ai-saas-data-business-trends-watch?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Constellation Research warns&lt;/a&gt; that "enterprise data tolls and API economics are going to be a headache" in 2026. Celonis is suing SAP over data access. The Information reported that Salesforce is raising prices on apps that tap into its data. Connector fees are trickling down to IT budgets.&lt;/p&gt;

&lt;p&gt;"Connection fees are going to be the new cloud egress," Constellation writes.&lt;/p&gt;

&lt;p&gt;Here's what this means: if you don't own the context layer for your own data, you'll rent it from someone else. Every vendor building "AI-ready" connectors is essentially building a context layer on top of your data and charging you for access to the meaning of information you already own.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://solutionsreview.com/ai-and-enterprise-technology-predictions-from-industry-experts-for-2026/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Solutions Review predictions roundup&lt;/a&gt; makes this explicit: "By the end of 2026, connectivity, governance, and context provisioning for AI agents will be built into every serious data platform."&lt;/p&gt;

&lt;p&gt;Built in. Not optional. Not a nice-to-have catalog project. Core infrastructure.&lt;/p&gt;

&lt;p&gt;The organizations that treat context as someone else's problem will find themselves paying tolls to access the semantic meaning of their own customer data. The ones that invest now will own that layer.&lt;br&gt;
Resolution #1: Ship Context With Every Event&lt;br&gt;
The practical version of this starts at ingestion.&lt;/p&gt;

&lt;p&gt;Every event entering your system should carry enough metadata that a reader (human or machine) can interpret it without external lookups. Not "user_id, timestamp, action" but "user_id, timestamp, action, schema_version, experiment_flags, source_system, data_classification."&lt;/p&gt;

&lt;p&gt;This isn't aspirational. &lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Anthropic's context engineering guide&lt;/a&gt; describes exactly this pattern: maintaining lightweight identifiers that allow systems to "dynamically load data into context at runtime using tools."&lt;/p&gt;

&lt;p&gt;A transformation editor should show, live, which downstream dashboards and models will break if you drop a column. A query should surface its lineage alongside its results. A dashboard shouldn't just display a number; hovering over it should reveal the definition, the upstream tables, the freshness, and the last incident that affected it.&lt;/p&gt;

&lt;p&gt;This requires tooling changes, yes. But mostly it requires treating context as a first-class output of every pipeline stage rather than an afterthought someone might add later.&lt;br&gt;
Resolution #2: Make Context the Default in AI and Agents&lt;br&gt;
&lt;a href="https://techcrunch.com/2026/01/02/in-2026-ai-will-move-from-hype-to-pragmatism/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;TechCrunch's 2026 analysis&lt;/a&gt; identifies the Model Context Protocol as "quickly becoming the standard" for agent interoperability. OpenAI and Microsoft have embraced it. Google is standing up managed MCP servers.&lt;/p&gt;

&lt;p&gt;The infrastructure for context-aware agents is arriving. The question is whether your data is ready to participate.&lt;/p&gt;

&lt;p&gt;That means storing valid_from/valid_to timestamps on policy documents. It means tagging content with scope limitations (region, customer tier, product line). It means encoding data classification and retention rules at the source, not in a compliance spreadsheet nobody maintains.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hai.stanford.edu/news/stanford-ai-experts-predict-what-will-happen-in-2026?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Stanford HAI's predictions&lt;/a&gt; note that "2026 will hear more companies say that AI hasn't yet shown productivity increases." The organizations that do show productivity increases will be the ones whose agents can distinguish current reality from historical noise without human intervention.&lt;/p&gt;

&lt;p&gt;An agent that refuses to take high-impact actions without verifying the environment, cohort, and guardrails is not cautious. It's correctly engineered. An agent that charges ahead on stale data with high confidence is the expensive kind of wrong.&lt;br&gt;
Resolution #3: Measure Time to Trustworthy Insight&lt;br&gt;
I wrote about Nicole Forsgren's new book &lt;a href="https://www.distributedthoughts.org/data-engineer-productivity-forsgren/" rel="noopener noreferrer"&gt;last month&lt;/a&gt;. Her frameworks for developer productivity apply directly to data work, but with a crucial modification.&lt;/p&gt;

&lt;p&gt;For data teams, the north star isn't deployment frequency or cycle time. It's time to trustworthy insight: from raw logs or events to a result you would put in front of an executive with confidence.&lt;/p&gt;

&lt;p&gt;Most organizations can't measure this because they don't know when insight becomes trustworthy. The data arrives, transformations run, dashboards update, but confidence accrues informally. Someone senior enough eventually blesses the number based on vibes and experience.&lt;/p&gt;

&lt;p&gt;Context infrastructure makes this measurable. If every metric carries its lineage, freshness, and incident history, you can ask: how long did it take from data landing to a metric with full provenance, no upstream incidents, and a defined owner? That's the number that matters.&lt;/p&gt;

&lt;p&gt;When that number shrinks, you're actually improving. When people are just shipping dashboards faster without context, you're accumulating debt.&lt;br&gt;
The Year We Stop Arguing About Definitions&lt;br&gt;
Most New Year's resolutions fail by February. The gym membership lapses. The meditation app goes unused. The ambitious reading list gathers dust.&lt;/p&gt;

&lt;p&gt;Data resolutions fail for the same reason: they're framed as one-time efforts rather than infrastructure changes. "We'll document our metrics" becomes a Q1 project that never gets maintained. "We'll improve data quality" becomes a dashboard that nobody checks.&lt;/p&gt;

&lt;p&gt;Context isn't a project. It's a property of how data moves through your organization. It either travels with its story or it doesn't.&lt;/p&gt;

&lt;p&gt;The organizations that treat 2026 as the year context becomes default will stop having the same arguments in every meeting. The exec review becomes a discussion of strategy instead of a debate about what "active users" means. The AI agent produces answers that come with their own credibility assessment. The data team ships products instead of debugging why downstream consumers don't trust the numbers.&lt;/p&gt;

&lt;p&gt;Gartner says 60% of AI projects will fail for lack of AI-ready data. The projects that succeed will be the ones that stopped treating data as numbers and started treating it as knowledge.&lt;/p&gt;

&lt;p&gt;That's the resolution. Make the data know what it is.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book called "Zen and the Art of Data Maintenance" based on what I've seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost.&lt;/strong&gt; &lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;I'd love to hear your thoughts&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/2026-01-06-resolution-add-context-to-your-data/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>aiinfrastructure</category>
      <category>contextengineering</category>
      <category>dataquality</category>
    </item>
    <item>
      <title>The Natasha Problem: Why Your Data Pipeline Only Fits One Person</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Sun, 11 Jan 2026 06:11:42 +0000</pubDate>
      <link>https://forem.com/aronchick/the-natasha-problem-why-your-data-pipeline-only-fits-one-person-2gli</link>
      <guid>https://forem.com/aronchick/the-natasha-problem-why-your-data-pipeline-only-fits-one-person-2gli</guid>
      <description>&lt;p&gt;For most folks, you probably don’t think about clothing sizes. There’s a number, you pick it, you try on the clothes, and if they fit, then congrats, you’re that number. But how’d they pick that number? And why does every style/line/person fit slightly differently?&lt;/p&gt;

&lt;p&gt;Turns out there's a woman named Natasha in Los Angeles whose body is used to design jeans for seven or eight major clothing brands. She's what the industry calls a "fit model." When Levi's or H&amp;amp;M or whoever designs a new pair of jeans, they don't start with measurements of the general population. They design them on a mannequin, then bring in Natasha, and adjust everything until the jeans fit her perfectly.&lt;/p&gt;

&lt;p&gt;She's the only person in the design process with an actual human body.&lt;/p&gt;

&lt;p&gt;Every other size is mathematical extrapolation. The size 2 isn't designed for any real person. Neither is the 4, the 8, or the 12. They're all proportional adjustments from Natasha's measurements, calculated by formula. If you're not built exactly like Natasha, your clothes don't actually fit you. They fit a mathematical projection of you derived from someone else's body.&lt;/p&gt;

&lt;p&gt;This is, &lt;a href="https://www.youtube.com/watch?v=zvbwSV6dz9c&amp;amp;ref=distributedthoughts.org" rel="noopener noreferrer"&gt;as Radiolab's Heather Radke discovered&lt;/a&gt;, why the dressing room feels like a personal judgment. We've internalized the idea that clothes should fit, and when they don't, we assume the problem is our body. But the clothes were never designed for our bodies. They were designed for one body, then scaled with arithmetic.&lt;/p&gt;

&lt;p&gt;I kept thinking about this because it's exactly how we build our IT infrastructure.&lt;br&gt;
The Ruth O'Brien Problem&lt;br&gt;
The Natasha situation has a history. In the 1930s, a woman named Ruth O'Brien at the Bureau of Home Economics tried to solve the sizing problem scientifically. She hired "measuring squads" through the WPA to travel the country and measure women's bodies across twenty-six different dimensions: elbow to wrist, thigh girth, heel length, and on and on.&lt;/p&gt;

&lt;p&gt;She was going to create the definitive dataset of American women's bodies.&lt;/p&gt;

&lt;p&gt;The resulting dataset became the basis for women's clothing sizes for decades. But there never was an “average” person. It was a statistical fiction.&lt;br&gt;
Your Pipeline Has a Fit Model Too&lt;br&gt;
Every data pipeline has its own Natasha. Usually it's the data from headquarters, or the first customer deployment, or whatever clean dataset was available when the system was designed.&lt;/p&gt;

&lt;p&gt;I've watched this pattern play out dozens of times. A team builds an ETL pipeline that works beautifully on their development data, where the schema is clean, the timestamps are consistent, and the sensor readings arrive in perfect intervals. They deploy to production, and 40% of their edge sites start throwing errors.&lt;/p&gt;

&lt;p&gt;The problem isn't that the edge data is wrong. The problem is that the pipeline was designed for one shape of data, then mathematically extrapolated to handle everything else.&lt;/p&gt;

&lt;p&gt;Consider what happens with ML training data. &lt;a href="https://paperswithcode.com/dataset/imagenet?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;ImageNet&lt;/a&gt; became the de facto standard for computer vision benchmarks. Models trained on ImageNet achieve remarkable accuracy on ImageNet test sets. Deploy those same models to a factory floor, and they struggle with the lighting, the angles, the dust on camera lenses, the specific way that this particular production line differs from the curated images in the training set.&lt;/p&gt;

&lt;p&gt;The model was fit to ImageNet. Everything else is extrapolation.&lt;/p&gt;

&lt;p&gt;Or look at timestamp handling. A pipeline designed on data from a single timezone assumes UTC normalization is someone else's problem. Works fine until you're ingesting from devices across twelve timezones, some of which handle daylight saving time transitions differently, some of which have clock drift, and one of which is running firmware from 2019 that uses a different epoch.&lt;/p&gt;

&lt;p&gt;The pipeline wasn't wrong. It was designed for one reality and scaled mathematically to others.&lt;br&gt;
The Measurement Squads Never Left&lt;br&gt;
Ruth O'Brien's original sin was the belief that you could measure a population once, derive a standard, and apply it forever.&lt;/p&gt;

&lt;p&gt;We do this constantly with data.&lt;/p&gt;

&lt;p&gt;A team defines a schema based on current requirements. They encode assumptions about data types, nullable fields, value ranges, and relationships. Then they treat that schema as ground truth, and any data that doesn't conform is "dirty" and needs to be "cleaned."&lt;/p&gt;

&lt;p&gt;But the data isn't dirty. The data is reality. The schema is the idealized projection that reality keeps refusing to match.&lt;/p&gt;

&lt;p&gt;I saw a project once where the sensor format had been standardized across the organization with beautiful documentation and clear specifications. Every new deployment was supposed to conform. In practice, about 60% of edge sites had made local modifications: different firmware versions, custom calibration routines, integration with legacy equipment that predated the standard by a decade.&lt;/p&gt;

&lt;p&gt;The central data team spent enormous effort "fixing" the non-conforming data with transformations to coerce it into the standard shape, imputation for missing fields, and interpolation for different sampling rates.&lt;/p&gt;

&lt;p&gt;By the time the data reached the analytics layer, it had been mathematically adjusted to fit a shape it never had. The "cleaned" data was a fiction, no more real than a size 2 extrapolated from Natasha's measurements.&lt;br&gt;
Bodies Resist Standardization. So Does Data.&lt;br&gt;
The reality is that bodies cannot be forced into interchangeable parts. The entire apparatus of industrial manufacturing assumes standardization. You take raw materials, process them into uniform components, and assemble them into identical products. It works for cars and electronics, but it fundamentally doesn't work for human bodies.&lt;/p&gt;

&lt;p&gt;The closer you get to where data is generated, the more specific and contextual it becomes. A sensor on a drilling rig in the Permian Basin produces readings shaped by the specific geology, equipment age, and operational patterns of that site. A sensor in the North Sea produces data shaped by entirely different conditions. Both might be "pressure readings," but they're not interchangeable.&lt;/p&gt;

&lt;p&gt;The centralization assumption says: bring all the data to one place, normalize it, and then analyze. This works if the data is genuinely similar. It falls apart when the normalization process destroys the very information you needed.&lt;br&gt;
The Dressing Room Moment&lt;br&gt;
Radke describes the experience of trying on clothes that don't fit as a moment of internalized judgment. The clothes were never designed for your body, but you blame yourself anyway. The sizing system has convinced you that "normal" is a real thing, and you're the deviation.&lt;/p&gt;

&lt;p&gt;Data teams have the same experience. The pipeline breaks on edge cases, and the team treats it as a data quality problem to be solved with more transformation logic, more coercion, more normalization, more effort to force reality into the shape the system expected.&lt;/p&gt;

&lt;p&gt;But the pipeline was designed for one type of data. Everything else is extrapolation. When the extrapolation fails, that's information. It's telling you that your model of reality was incomplete.&lt;/p&gt;

&lt;p&gt;The question isn't "how do we clean this data to fit our schema?" The question is "why did we assume all data would look like our development dataset?"&lt;/p&gt;

&lt;p&gt;Martha Skidmore was the closest match to Norma out of 3,864 women, and she still wasn't Norma. She was a real person, with a real body, that happened to approximate a statistical fiction slightly better than 3,863 others.&lt;/p&gt;

&lt;p&gt;Your edge data isn't defective. It's real. The pipeline is the fiction.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/2026-01-08-the-natasha-problem/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>dataquality</category>
      <category>distributedsystems</category>
      <category>mlinfrastructure</category>
    </item>
    <item>
      <title>A Picture Is Worth Ten Thousand Tokens</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Tue, 30 Dec 2025 00:34:11 +0000</pubDate>
      <link>https://forem.com/aronchick/a-picture-is-worth-ten-thousand-tokens-11ng</link>
      <guid>https://forem.com/aronchick/a-picture-is-worth-ten-thousand-tokens-11ng</guid>
      <description>&lt;p&gt;"A picture is worth a thousand words" has been greeting-card wisdom for a century, the kind of thing we nod along to while understanding it metaphorically because images convey emotion, show rather than tell, and bypass the limitations of language. What we didn't expect was for this to become literally, computationally true.&lt;/p&gt;

&lt;p&gt;DeepSeek released a &lt;a href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;paper this fall&lt;/a&gt; that made a lot of people rethink what we know about LLM efficiency, At the core, one of the findings seems obviously wrong until you work through the math: if you render text onto an image and have a vision-language model decode it, you can achieve 97% accuracy while using one-tenth the tokens. Take a document with 1,000 text tokens, turn it into an image, and the model can reconstruct that text using just 100 vision tokens. For those that aren’t researchers, this seems insane: You’re telling me that words, representing by simple characters, are HARDER to make sense of than the same words but represented by pixels? Nuts.&lt;/p&gt;

&lt;p&gt;What this possibly reveals under the hood could be pretty foundational. If something computationally "heavy" modality turns out to be the efficient one while the computationally "light" modality is actually the expensive one, then we’ve been thinking about a lot of things wrong. And, according to the page, it looks like we have been.&lt;br&gt;
The Architecture That Makes This Work&lt;br&gt;
DeepSeek-OCR pairs two components: an encoder called DeepEncoder (about 380M parameters) and a decoder based on their 3B MoE model with 570M active parameters. The encoder is the interesting part because it chains together a SAM-base model for local perception using window attention, a 16x convolutional compressor, and a CLIP-large model for global understanding in a sequence that exploits how these different attention mechanisms scale.&lt;/p&gt;

&lt;p&gt;The key insight is that window attention (that’s the thing the model “looks” at to predict the right answer) processes lots of tokens cheaply because it only looks at local neighborhoods, which means you can afford to have thousands of tokens at that stage without blowing up your compute budget. Then the compressor aggressively reduces token count before the expensive global attention (where it compares its predictions with everything else) kicks in so you're only paying the quadratic attention cost on the compressed representation rather than the full input.&lt;/p&gt;

&lt;p&gt;Feed it a 1024x1024 image and you get 4,096 patch tokens from the initial segmentation, but after compression that becomes just 256 vision tokens entering the decoder, and a 640x640 image yields only 100 tokens.&lt;/p&gt;

&lt;p&gt;The magic isn't in any single component but in recognizing that compression can happen inside the pipeline rather than fighting against the text representation after the fact. And because the computational characteristics of vision encoders (cheap local processing followed by expensive global processing on a much smaller token set) happen to be more favorable than the characteristics of text transformers (expensive global processing on the full token count from the start), it gives DeepEncoder the ability to specifically to exploit that gap.&lt;br&gt;
The Numbers That Matter&lt;br&gt;
On their Fox benchmark testing documents with 600-1,300 text tokens, the results show a graceful degradation curve that suggests we're not just getting lucky on easy cases: at 100 vision tokens (compression ratio around 7-10x) they hit 87-98% OCR precision depending on document complexity, and at 64 vision tokens (pushing toward 20x compression) precision drops to 59-96%, which is still surprisingly usable for applications where you need the gist rather than perfect fidelity.&lt;/p&gt;

&lt;p&gt;On OmniDocBench, a practical document parsing benchmark, DeepSeek-OCR with 100 vision tokens beats GOT-OCR2.0 which uses 256 tokens, and with fewer than 800 vision tokens it outperforms MinerU2.0 which averages over 6,000 tokens per page. In production they're processing 200,000+ pages per day on a single A100-40G, which isn't a research demo but a training data pipeline running at scale.&lt;br&gt;
Why This Works (And Why It's Counterintuitive)&lt;br&gt;
We've built our mental models around text as the native format for language understanding, and for good reason: text is what LLMs were designed for, text is lightweight, text is structured, and vision is the bolt-on capability we added later for multimodal tasks. But attention mechanisms don't care about our intuitions because they care about sequence length, and attention scales quadratically with sequence length, which means a document with 5,000 tokens pays the O(n²) cost across all 5,000 tokens regardless of how "lightweight" we think text ought to be.&lt;/p&gt;

&lt;p&gt;Vision encoders, particularly modern ones with the window-then-global architecture DeepSeek uses, have fundamentally different computational characteristics because you're essentially buying cheap local processing at the window attention stage and only paying the quadratic cost on a much smaller token count after compression, which means the image isn't a burden you're adding to the model but a compression layer that happens to be more efficient than operating directly on text tokens. This is counterintuitive until you remember that efficiency depends on the compute path, not just the data size.&lt;br&gt;
The Memory Decay Proposal&lt;br&gt;
The paper includes a fascinating speculation in their discussion section that I think deserves more attention than the OCR results themselves. They draw a parallel between human memory decay over time, visual perception degradation over distance, and text compression at different resolutions, and their proposal is elegant: for multi-turn conversations, render older dialogue turns as images at progressively lower resolutions.&lt;/p&gt;

&lt;p&gt;Recent context stays high-resolution (their "Gundam" mode, 800+ tokens) while older context gets progressively downscaled to Large mode for yesterday's conversation, Base mode for last week, and Small or Tiny for anything older.&lt;/p&gt;

&lt;p&gt;The information doesn't disappear but compresses into something lossy and gist-preserving, which mirrors something real about how memory actually works: you don't remember conversations from a year ago at the same fidelity as conversations from an hour ago, but the information is still there in some form, accessible if you need it but not consuming the same cognitive resources as recent experience.&lt;/p&gt;

&lt;p&gt;The engineering implication is that instead of choosing between "keep full context" and "truncate old context," you get a spectrum where the context window becomes a memory system with natural decay characteristics built into the architecture itself, giving you a gradient rather than a cliff.&lt;br&gt;
The Distributed Systems Angle&lt;br&gt;
There's a pattern here that feels familiar from data infrastructure because the optimal representation for processing isn't always the optimal representation for storage or transmission. We compress video for streaming then decompress for playback, we convert data to columnar formats for analytics even though row-oriented formats are more natural for transactional workloads, we build materialized views that trade storage for query performance, and we ship compute to data when moving the data would be more expensive than moving the code.&lt;/p&gt;

&lt;p&gt;What DeepSeek is demonstrating is that text-to-image-to-text can be a legitimate processing pipeline, not because images are somehow "better" but because the computational characteristics of the vision encoder path happen to be more favorable for certain workloads and the transformation overhead pays for itself in reduced attention costs. This is compute-over-data thinking applied to tokens themselves: instead of asking "how do we process this text efficiently," you ask "what representation makes the compute most efficient, and is the transformation cost worth it?"&lt;/p&gt;

&lt;p&gt;For long documents, the answer might genuinely be to render to image, process visually, and reconstruct text.&lt;br&gt;
What This Means&lt;br&gt;
DeepSeek is careful to call this "early-stage work that requires further investigation," and they're right to be cautious because OCR is a specific task where you have ground-truth text to measure against and general language understanding is considerably harder to evaluate. But the direction is suggestive in ways that go beyond document processing.&lt;/p&gt;

&lt;p&gt;If you're building systems that process long documents, analyze historical conversations, or maintain persistent context across sessions, the architecture you're probably using (longer context windows, sliding attention, memory banks, retrieval augmentation) might be competing with an approach that seems absurd on its face: just turn the text into pictures. The optimal representation for text processing in an LLM might not be text, which is either a profound insight about the nature of these systems or a temporary artifact of current architectures that will disappear when we build better text processing.&lt;/p&gt;

&lt;p&gt;I genuinely don't know which, but when the counterintuitive result has solid empirical backing, that's usually where the interesting questions live.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. &lt;strong&gt;[&lt;/strong&gt;I'd love to hear your thoughts**](&lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org&lt;/a&gt;)&lt;/strong&gt;!**&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/a-picture-is-worth-ten-thousand-tokens/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiinfrastructure</category>
      <category>llm</category>
      <category>visionmodels</category>
      <category>efficiency</category>
    </item>
    <item>
      <title>NVIDIA Bought the Bouncer: SchedMD and Where Lock-In Actually Lives</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Mon, 29 Dec 2025 00:37:14 +0000</pubDate>
      <link>https://forem.com/aronchick/nvidia-bought-the-bouncer-schedmd-and-where-lock-in-actually-lives-2a0i</link>
      <guid>https://forem.com/aronchick/nvidia-bought-the-bouncer-schedmd-and-where-lock-in-actually-lives-2a0i</guid>
      <description>&lt;p&gt;On December 15, 2025, &lt;a href="https://blogs.nvidia.com/blog/nvidia-acquires-schedmd/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;NVIDIA acquired SchedMD&lt;/a&gt;, a 40-person company based in Lehi, Utah. The price wasn't disclosed, the press release emphasized a commitment to open source, and most coverage focused on NVIDIA’s expanding software portfolio, thereby missing the point entirely. Most folks missed how huge this was.&lt;/p&gt;

&lt;p&gt;SchedMD maintains &lt;a href="https://slurm.schedmd.com/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Slurm&lt;/a&gt;, the workload manager running on 65% of the TOP500 supercomputers, including more than half of the top 10 and more than half of the top 100. Every time a researcher submits a training job, every time an ML engineer queues a batch inference run, every time a national lab allocates compute for a simulation, there's a decent chance Slurm is deciding which GPUs actually run it.&lt;/p&gt;

&lt;p&gt;Everyone's been watching the CUDA moat. Judah Taub's recent &lt;a href="https://judahtaub.substack.com/p/the-startup-escape-plan-for-cuda?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Substack piece&lt;/a&gt; frames it perfectly: the programming model as the source of lock-in, with five potential escape routes ranging from OpenAI's Triton to Google's TPUs to AMD's ROCm to Modular's Mojo to Tenstorrent's RISC-V approach. All of which are valid competitive threats.&lt;/p&gt;

&lt;p&gt;But NVIDIA, to their credit, saw through the programming model debates and identified one of the key ways to accelerate the scale-out. They bought the bouncer.&lt;br&gt;
What Slurm Actually Does&lt;br&gt;
If you've never submitted a job to an HPC cluster, Slurm is invisible infrastructure, and that's intentional. Researchers type &lt;code&gt;sbatch my_training_job.sh&lt;/code&gt; and their code runs on GPUs. Still, how those GPUs get allocated, when the job actually starts, which nodes handle which portions of distributed training, how competing jobs get prioritized, whether your experiment runs tonight or next Tuesday—that's all Slurm.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://slurm.schedmd.com/overview.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;formal description&lt;/a&gt; sounds almost TOO basic: "allocating exclusive and/or non-exclusive access to resources, providing a framework for starting, executing, and monitoring work, and arbitrating contention for resources by managing a queue of pending jobs."&lt;/p&gt;

&lt;p&gt;The reality is that Slurm is the layer that translates organizational policy into compute allocation. This includes things like: * Fair-share scheduling across research groups * Priority overrides for deadline-sensitive projects * Resource limits that prevent any single user from monopolizing a cluster * Preemption policies that balance throughput * Responsiveness to &lt;a href="https://en.wikipedia.org/wiki/Slurm_Workload_Manager?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Hilbert curve scheduling&lt;/a&gt; that optimizes for network topology&lt;/p&gt;

&lt;p&gt;And lots more. Or just launching a job without requiring SSH!&lt;/p&gt;

&lt;p&gt;Every organization running Slurm has encoded its resource management philosophy into its configuration over years of tuning, with institutional knowledge baked into partition definitions and quality of service policies, accounting systems tied to grants and budgets, and user training built around Slurm commands. This isn't a program you swap out over a weekend.&lt;br&gt;
Why Slurm Won&lt;br&gt;
Slurm wasn’t the obvious choice. When &lt;a href="https://slurm.schedmd.com/slurm_ug_2010/01-keynote.pdf?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;development began at Lawrence Livermore National Laboratory in 2001&lt;/a&gt;, the HPC world ran on proprietary schedulers: PBS (Portable Batch System) had variants everywhere, IBM's LoadLeveler dominated their ecosystem, Quadrics RMS handled specialized clusters, and Platform Computing's LSF (Load Sharing Facility) served enterprise HPC.&lt;/p&gt;

&lt;p&gt;LLNL wanted something different because they were moving from proprietary supercomputers to commodity Linux clusters and needed a resource manager that could scale to tens of thousands of nodes, remain highly portable across architectures, and stay open source. The &lt;a href="https://www.schedmd.com/about-schedmd/slurm-history/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;2002 first release&lt;/a&gt; was deliberately simple, and the name originally stood for "Simple Linux Utility for Resource Management" (the acronym was later dropped, though the Futurama reference remained).&lt;/p&gt;

&lt;p&gt;What happened next is a case study in how open source wins infrastructure markets.&lt;/p&gt;

&lt;p&gt;PBS fragmented into OpenPBS, Torque, and PBS Pro (now Altair), with each fork diluting the community and scattering innovation, leaving organizations that chose PBS to pick which one, and none had the whole community behind them. THEN LSF went commercial when IBM acquired Platform Computing in 2012. While enterprise support is valuable, licensing costs matter when you're scaling to thousands of nodes, which makes the open-source alternative increasingly attractive. THEN Grid Engine's ownership bounced between Sun Microsystems, Oracle, and Univa, with each transition eroding community trust as development priorities shifted with corporate strategy.&lt;/p&gt;

&lt;p&gt;Slurm stayed focused on one codebase with GPLv2 licensing that couldn't be closed and a plugin architecture that let organizations customize without forking. And in 2010, Morris Jette and Danny Auble, the lead developers, left LLNL to form SchedMD (&lt;a href="https://en.wikipedia.org/wiki/SchedMD" rel="noopener noreferrer"&gt;https://en.wikipedia.org/wiki/SchedMD&lt;/a&gt;), creating a commercial support model that kept the software free while funding continued development—the Red Hat playbook, applied to HPC scheduling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hyperionresearch.com/product/slurm-remains-top-resource-manager/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Hyperion Research data&lt;/a&gt; from 2023 shows that 50% of HPC sites use Slurm, while the next closest, OpenPBS, sits at 18.9%, PBS Pro at 13.9%, and LSF at 10.6%. The gap isn't closing, and it's widening.&lt;br&gt;
The Two-Door Strategy&lt;br&gt;
In parallel with ALL that noise, NVIDIA wasn’t sitting around flat-footed.&lt;/p&gt;

&lt;p&gt;In April 2024, NVIDIA &lt;a href="https://blogs.nvidia.com/blog/runai/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;acquired Run:ai&lt;/a&gt; for approximately $700 million. Run:ai builds Kubernetes-based GPU orchestration, and if Slurm is how supercomputers and traditional HPC clusters manage GPU workloads, Run:ai is how cloud-native organizations do the same thing on Kubernetes—different paradigms serving the same function, and NVIDIA now owns the scheduling layer for both.&lt;/p&gt;

&lt;p&gt;Run:ai handles the world that emerged from containers and microservices: * Organizations running on GKE, EKS, or on-prem Kubernetes clusters * Data science teams whose workflows are built around &lt;a href="https://jupyter.org/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Jupyter notebooks&lt;/a&gt;, &lt;a href="https://www.kubeflow.org/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Kubeflow&lt;/a&gt;, and &lt;a href="https://mlflow.org/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;MLflow&lt;/a&gt; * And companies that think in pods and deployments rather than batch queues and node allocations.&lt;/p&gt;

&lt;p&gt;Slurm handles the world that emerged from supercomputing: national labs, research universities, pharmaceutical companies running molecular dynamics, financial firms running risk simulations, organizations where HPC predates the cloud, and where "scale" means dedicated clusters with thousands of nodes.&lt;/p&gt;

&lt;p&gt;Both roads lead to GPUs, and NVIDIA now controls traffic on both.&lt;br&gt;
What Lock-In Actually Looks Like&lt;br&gt;
Judah Taub's &lt;a href="https://judahtaub.substack.com/p/the-startup-escape-plan-for-cuda?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;CUDA analysis&lt;/a&gt; is correct that the programming model creates real lock-in, because rewriting GPU kernels for a different platform is expensive, and the ecosystem of libraries, tools, and community knowledge around CUDA represents decades of accumulated investment.&lt;/p&gt;

&lt;p&gt;But programming models can be abstracted, compilers translate, and compatibility layers exist. PyTorch runs on AMD GPUs via ROCm, JAX runs on TPUs, and the code you write doesn't have to be tied permanently to CUDA, even if the transition has friction.&lt;/p&gt;

&lt;p&gt;Orchestration creates a different kind of stickiness, because your workflows are encoded in Slurm through every batch script, every job array definition, every dependency chain that says "run step B only after step A completes successfully,” and that's not just code but institutional memory. Your accounting systems integrate with Slurm through reports that show department heads how their GPU allocation was used, chargeback systems that bill internal projects, and compliance logs that verify your government-funded research ran on approved infrastructure. Your users know Slurm through the commands they type without thinking, the debugging instincts for when jobs hang or fail, the training materials your HPC team developed, and the Stack Overflow answers they Google at 2 AM. Your cluster topology is optimized for Slurm's algorithms through a network configuration that aligns with Slurm’s understanding of a fat-tree topology, a partition structure that reflects your organizational hierarchy, and node groupings that balance locality and fairness.&lt;/p&gt;

&lt;p&gt;Switching schedulers isn't a recompile; it's a reorganization.&lt;br&gt;
The Promise and the Pattern&lt;br&gt;
NVIDIA says Slurm will remain open source and vendor-neutral, and the GPLv2 license makes closing the source legally problematic anyway, so SchedMD's existing customers aren't about to get cut off.&lt;/p&gt;

&lt;p&gt;But control of the roadmap is different from control of the code.&lt;/p&gt;

&lt;p&gt;When NVIDIA prioritizes features, which hardware gets first-class Slurm support? When performance optimizations ship, which GPUs benefit most? When integrations between Slurm and the rest of NVIDIA’s software stack tighten, does the "vendor-neutral" promise mean equal optimization for AMD and Intel accelerators?&lt;/p&gt;

&lt;p&gt;The pattern exists in enterprise software: Oracle doesn't prevent you from running MySQL, Microsoft doesn't prevent you from using GitHub with non-Azure clouds, but the integration points, the polish, and the performance optimizations flow toward the owner's products.&lt;/p&gt;

&lt;p&gt;NVIDIA's official line emphasizes that Slurm "forms the essential infrastructure used by global developers, research institutions, and cloud service providers to run massive scale training infrastructure," which is true—and now NVIDIA owns that essential infrastructure.&lt;br&gt;
The Distributed Gap&lt;br&gt;
There's a less-discussed implication in all this.&lt;/p&gt;

&lt;p&gt;Traditional HPC scheduling, whether Slurm or its competitors, assumes a particular architecture: a big, centralized cluster where jobs are scheduled across nodes, making the optimization problem one of matching jobs to resources within a unified system.&lt;/p&gt;

&lt;p&gt;This architecture works well when data and compute are co-located, with training runs pulling from high-speed parallel file systems and simulations operating on datasets staged to local storage, making the cluster a world unto itself.&lt;/p&gt;

&lt;p&gt;But increasingly, that's not the world in which organizations operate.&lt;/p&gt;

&lt;p&gt;Data sovereignty requirements mean datasets can't always move to where the GPUs are; edge deployments generate data that shouldn’t traverse networks just to run inference; federated learning needs to coordinate training across institutions without centralizing sensitive information; and multi-cloud strategies mean compute is scattered across providers, regions, and architectures.&lt;/p&gt;

&lt;p&gt;Run:ai helps with Kubernetes-based orchestration but assumes Kubernetes, while Slurm helps with HPC workloads but assumes a traditional cluster architecture. Neither solves the problem of "I have data in 50 locations, compute in 12 different configurations, and regulatory constraints that prevent me from pretending this is one big cluster."&lt;/p&gt;

&lt;p&gt;NVIDIA's acquisitions reinforce the gravitational pull toward centralization: bigger clusters, more GPUs, bring your data to us. That's a valid architecture for many workloads, and for foundation model training at hyperscale, it might be the only architecture.&lt;/p&gt;

&lt;p&gt;But it's not the only architecture that matters, and the orchestration gap for truly distributed computing remains wide open. (&lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;We&lt;/a&gt; have some thoughts if you’re interested :))&lt;br&gt;
What NVIDIA Actually Understood&lt;br&gt;
Credit where it's due: NVIDIA read the landscape correctly.&lt;/p&gt;

&lt;p&gt;The hardware competition gets the attention, with AMD's MI300X, Intel's Gaudi, Google's TPUs, and startups raising hundreds of millions to build custom silicon, keeping everyone focused on the chip.&lt;/p&gt;

&lt;p&gt;NVIDIA looked one layer up and recognized that whoever owns the orchestration layer owns the decision about which chips run which workloads, because the scheduler doesn't just allocate resources; it also encodes assumptions about what resources exist and how they should be used.&lt;/p&gt;

&lt;p&gt;By acquiring both Slurm and Run:ai, NVIDIA ensures that, regardless of which paradigm you use (traditional HPC or cloud-native Kubernetes), the software layer that schedules your GPU workloads comes from NVIDIA, meaning alternatives to CUDA still need to run through NVIDIA's orchestration. It's like owning both the road and the traffic lights: the cars might be different, but they all stop at the same intersections.&lt;br&gt;
Where This Leaves Everyone Else&lt;br&gt;
For organizations already running Slurm, not much changes immediately because the software remains open source, SchedMD's support contracts presumably continue, and the 40 employees who built their careers around making Slurm work are now NVIDIA employees with presumably NVIDIA resources.&lt;/p&gt;

&lt;p&gt;For organizations building alternatives to NVIDIA's hardware dominance, the landscape has grown harder: your new accelerator needs software ecosystem support, which now means either convincing NVIDIA-owned Slurm to treat your hardware as a first-class citizen or building your own orchestration layer from scratch.&lt;/p&gt;

&lt;p&gt;For anyone thinking about distributed computing that doesn't fit the cluster model, the message is clear: the major players aren't building for you, and the orchestration layer for truly distributed, heterogeneous, data-gravity-respecting deployments doesn't exist in their portfolio.&lt;/p&gt;

&lt;p&gt;That's both a challenge and an opportunity.&lt;/p&gt;

&lt;p&gt;The CUDA moat is real, but it was always visible, always discussed, always the focus of competitive energy. The orchestration moat is quieter because Slurm doesn't make headlines like GPUs do, and scheduling software isn't sexy; it's just where the actual decisions get made.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost.&lt;/strong&gt; &lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;I'd love to hear your thoughts&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;!&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/nvidia-bought-the-bouncer/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiinfrastructure</category>
      <category>devto</category>
    </item>
    <item>
      <title>Edge ML Has a Size Obsession</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Tue, 16 Dec 2025 00:34:14 +0000</pubDate>
      <link>https://forem.com/aronchick/edge-ml-has-a-size-obsession-34fb</link>
      <guid>https://forem.com/aronchick/edge-ml-has-a-size-obsession-34fb</guid>
      <description>&lt;p&gt;UPS could deliver your Amazon package on a cargo e-bike. In most cities, for most packages, this would actually be faster. No parking. No traffic. Straight to your door.&lt;/p&gt;

&lt;p&gt;Instead, a 16,000-pound truck idles outside your apartment building while the driver walks up three flights of stairs with an envelope containing a phone charger.&lt;/p&gt;

&lt;p&gt;It's not that UPS is stupid. The truck handles the complicated cases: bulk deliveries, heavy items, and commercial routes with 200 stops. Once you've built infrastructure for complex cases, running it for easy cases feels free. Same truck, same driver, same route. Why optimize?&lt;/p&gt;

&lt;p&gt;But "feels free" isn't free. The truck burns diesel at idle. It needs a commercial parking spot that doesn't exist. The driver spends 30% of their day not delivering packages but managing the logistics of operating a vehicle designed for a more complex problem than most stops actually present.&lt;/p&gt;

&lt;p&gt;Edge ML has the same problem. We built the infrastructure for complex cases (language models, multimodal reasoning, generative AI), and now we're using it for everything. Sensor classification? Deploy a neural network. Anomaly detection? Fine-tune a transformer. Predictive maintenance? Surely this needs deep learning.&lt;/p&gt;

&lt;p&gt;A quantized Llama 3B takes 2GB on disk and 4GB in memory. A &lt;a href="https://huggingface.co/unsloth/mistral-7b-bnb-4bit?ref=distributedthoughts.org" rel="noreferrer noopener"&gt;4-bit quantized 7B model&lt;/a&gt; still needs roughly 4GB. Want to run a 70B model? Even with aggressive quantization, you're looking at &lt;a href="https://intuitionlabs.ai/articles/local-llm-deployment-24gb-gpu-optimization?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;35GB minimum&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A scikit-learn random forest for the same classification task takes 50KB.&lt;/p&gt;

&lt;p&gt;The industry spent three years figuring out how to squeeze the truck into tighter parking spaces. Most deliveries never needed the truck.&lt;br&gt;
Two Mistakes, Not One&lt;br&gt;
The size obsession hides two distinct problems. First: teams often reach for the wrong vehicle entirely. Second: even with the right vehicle, the route planning determines whether packages arrive.&lt;/p&gt;

&lt;p&gt;Most edge deployments handle predictive maintenance, anomaly detection, sensor classification, and quality control. These are tabular data problems. A &lt;a href="https://arxiv.org/abs/2207.08815?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;NeurIPS 2022 paper&lt;/a&gt; confirmed what practitioners already suspected: tree-based models such as XGBoost and Random Forests outperform deep learning on tabular data across 45 benchmark datasets. A study on industrial IoT found that XGBoost achieved 96% accuracy in predicting factory equipment, reducing downtime by 45%. &lt;a href="https://link.springer.com/chapter/10.1007/978-981-97-0975-5_20?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Random forests hit 98.5%&lt;/a&gt; on equipment failure classification.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.dfrobot.com/blog-13921.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;TensorFlow Lite Micro&lt;/a&gt; fits in 16KB. &lt;a href="https://www.mdpi.com/2072-666X/13/6/851?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;TinyML gesture recognition&lt;/a&gt; runs at 138KB and 30 FPS. These aren't compromised models. They're right-sized for their problems. The e-bike, not the truck.&lt;/p&gt;

&lt;p&gt;But here's what matters more: whether you're dispatching e-bikes or trucks, you need routes that work. And route planning is why &lt;a href="https://www.edge-ai-vision.com/2025/12/why-edge-ai-struggles-towards-production-the-deployment-problem/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;70% of Industry 4.0 AI projects stall in pilot&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The models work in demos. Deployment breaks them.&lt;br&gt;
The Orchestration Gap&lt;br&gt;
In the early days of Kubrenetes, we made a version of this mistake. We thought container scheduling was the hard part. The hard part was everything after scheduling: networking, storage, observability, updates, rollbacks. The entire operational lifecycle.&lt;/p&gt;

&lt;p&gt;Edge ML is learning this lesson now. Where MLOps ends with a packaged model, orchestration begins. And orchestration is where edge ML goes to die.&lt;/p&gt;

&lt;p&gt;Think about what makes delivery logistics hard. It's not the vehicles. It's coordinating thousands of them across changing conditions. And when you're operating with vehicles/artifacts that are too big for the use case, you're just making everything harder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model staleness&lt;/strong&gt; hits regardless of model size. &lt;a href="https://queue.acm.org/detail.cfm?id=3733702&amp;amp;ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Edge models, once deployed, might not be frequently updated&lt;/a&gt;. A classifier trained on 2024 patterns doesn't recognize 2025 anomalies. Rolling out updates across thousands of devices is nontrivial, whether you're pushing 50KB or 5GB. This is the equivalent of route maps that don't know about new neighborhoods. Your drivers show up, but they can't find the addresses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fleet heterogeneity&lt;/strong&gt; compounds everything. Devices don't update uniformly. You end up managing &lt;a href="https://ibm.github.io/data-science-best-practices/edge_deployment.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;fragmented fleets&lt;/a&gt; in which different nodes run different model versions with varying capabilities. Cloud deployments update in minutes. Edge deployments take weeks, sometimes months. Some devices never update at all. Imagine dispatching trucks, vans, and bikes from the same warehouse without a unified system that tracks which vehicle has which capabilities. Version skew creates subtle bugs that only manifest at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Energy constraints&lt;/strong&gt; create hard limits that benchmarks ignore. &lt;a href="https://queue.acm.org/detail.cfm?id=3733702&amp;amp;ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Thermal throttling&lt;/a&gt; kicks in when you stress mobile CPUs. Even a small model running continuous inference drains batteries and generates heat. Academic papers report Joules per prediction. Users report their phone dying by 2pm. The diesel cost that accumulates invisibly on every route, whether you're carrying one envelope or a hundred boxes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network variability&lt;/strong&gt; breaks every cloud-native assumption. Traditional MLOps assumes stable, high-bandwidth connections. That assumption doesn't hold when inference pipelines need to survive outages, intermittent connectivity, or bandwidth that costs real money. What happens when your edge device goes offline for a week? When does it reconnect with stale models and queued data? It's like planning routes that assume every road is always open. The moment a bridge closes, your whole system breaks.&lt;br&gt;
The Data Pipeline Problem&lt;br&gt;
This is where "feels free" really isn't free. Edge ML isn't failing because models are too big. It's failing because &lt;a href="https://www.researchgate.net/publication/396921005_Edge-to-Cloud_ETL_Pipelines_Integrating_IoT_and_Enterprise_Data_Streams?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;data pipelines weren't designed for bidirectional flow&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Delivery networks learned this decades ago. Packages flow out from warehouses, but returns flow back. Damage reports flow back. Delivery confirmations flow back. Reverse logistics are just as important as forward logistics, and often harder.&lt;/p&gt;

&lt;p&gt;The traditional ML assumption:&lt;br&gt;
Edge Device → Cloud → Inference → Response&lt;/p&gt;

&lt;p&gt;What edge ML actually needs:&lt;br&gt;
Edge Device ↔ Local Inference ↔ Selective Sync ↔ Model Updates ↔ Back to Edge&lt;/p&gt;

&lt;p&gt;This bidirectionality creates problems most teams don't anticipate. As &lt;a href="https://ibm.github.io/data-science-best-practices/edge_deployment.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;IBM's edge deployment guide&lt;/a&gt; notes: "In an edge deployment scenario, there is no direct reason to send production data to the cloud. This may create the issue that you'll never receive it, and you can check the accuracy of your training data. Generally, your training data will not grow."&lt;/p&gt;

&lt;p&gt;Your model improves based on the data it sees. If that data never leaves the edge, your model never improves. But if all data goes to the cloud, you've rebuilt the centralized architecture you were trying to escape, with extra latency and bandwidth costs. It's like routing every return through your main distribution center instead of handling them at local hubs. Technically correct, but operationally a nightmare.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.researchgate.net/publication/396921005_Edge-to-Cloud_ETL_Pipelines_Integrating_IoT_and_Enterprise_Data_Streams?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Edge-to-cloud ETL pipelines&lt;/a&gt; are emerging as critical infrastructure. They need real-time ingestion, adaptive transformation, graceful degradation when connectivity fails, and respect for data sovereignty constraints. A 50KB model and a 5GB model face identical challenges here. The pipeline doesn't care about parameter count, just like the route doesn't care whether you're driving a truck or riding a bike.&lt;br&gt;
What Actually Works&lt;br&gt;
The teams succeeding with edge ML have stopped optimizing vehicles and started optimizing routes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tiered inference&lt;/strong&gt; separates quick decisions from complex reasoning. &lt;a href="https://www.rtinsights.com/closing-the-latency-gap-real-time-decision-making-at-the-point-of-data-creation/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Vector search at the edge runs in 5-10ms&lt;/a&gt; using in-memory indexes. No GPU required. Simple classifications and caching happen locally. Complex reasoning routes selectively when network allows. This is the e-bike for last-mile delivery, the truck for bulk warehouse transfers. Match the vehicle to the delivery, not the other way around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge MLOps mirrors&lt;/strong&gt; replicate minimal cloud capabilities locally. When the network disappears, edge nodes still manage model lifecycle, handle updates from local cache, and queue telemetry for later sync. This approach acknowledges what cloud-native architectures ignore: networks fail. Devices go offline. The question isn't whether your deployment loses connectivity. It's whether it keeps working when it does. Local dispatch centers that function when headquarters goes dark.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data locality as first principle&lt;/strong&gt; means processing where data lives, not where servers are convenient. &lt;a href="https://medium.com/@sanjay.mohindroo66/edge-computing-unleashed-how-it-leaders-must-gear-up-for-2025s-real-time-revolution-896c7e980774?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;By 2025, over 50% of enterprise data will be processed at the edge&lt;/a&gt;, up from 10% in 2021. This shift is already happening in manufacturing, retail, healthcare, and logistics. Organizations adapting successfully treat edge deployment as first-class infrastructure, building &lt;a href="https://telefonicatech.com/en/blog/edge-computing-and-the-future-of-distributed-ai?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;intelligent data orchestration&lt;/a&gt; that moves compute to data rather than data to compute. Deliver from the nearest warehouse, not the central hub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Selective synchronization&lt;/strong&gt; solves the training data problem. Not all edge data needs to reach the cloud, but representative samples do. Anomalies do. Edge cases that challenged local models do. Smart filtering at the edge, with policies that adapt based on model confidence and data novelty, keeps training pipelines fed without overwhelming bandwidth or centralized storage. Send back the damage reports. Don't send back confirmation that every package arrived fine.&lt;/p&gt;

&lt;p&gt;This is exactly why we built &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Expanso&lt;/a&gt; around data orchestration rather than model serving. The model isn't the bottleneck, whether it's a 50KB decision tree or a 4GB quantized LLM. The bottleneck is getting the right data to the right place at the right time, coordinating updates across heterogeneous fleets, and maintaining observability when half your nodes are intermittently connected. Our approach treats edge nodes as first-class participants in data pipelines, not afterthoughts bolted onto cloud architectures. Route planning, not vehicle engineering.&lt;br&gt;
Where This Is Heading&lt;br&gt;
&lt;a href="https://www.itprotoday.com/cloud-computing/cloud-edge-computing-trends-and-predictions-2025-from-industry-insiders?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;$378 billion&lt;/a&gt; in projected edge computing spending by 2028. IDC expects edge AI deployments to grow at 35% CAGR over the next three years. That investment isn't going into building better trucks. The quantization problem is largely solved. The money is going into the logistics layer that makes edge deployment actually work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://stlpartners.com/articles/edge-computing/50-edge-computing-companies-2025/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Federated learning&lt;/a&gt; is moving from research curiosity to production requirement. It's the only practical way to improve models from edge data without centralizing that data, solving the training feedback loop that IBM's guide warned about. Standardized edge-cloud orchestration protocols are emerging to simplify deployment across heterogeneous environments. The security surface is expanding dramatically as AI distributes across thousands of devices rather than sitting in secured data centers.&lt;/p&gt;

&lt;p&gt;The companies navigating this successfully aren't the ones with the smallest vehicles or the fastest engines. They're the ones who recognized early that vehicle optimization was table stakes, not competitive advantage. The hard problems were always about fleet management, route planning, package tracking, and graceful degradation when conditions change.&lt;br&gt;
The Right Questions&lt;br&gt;
Not "how do I compress my neural network to fit on edge hardware?"&lt;/p&gt;

&lt;p&gt;Start with "what's the simplest model that solves my actual problem?" For sensor data, that's often a decision tree. Kilobytes, not gigabytes. &lt;a href="https://arxiv.org/abs/2207.08815?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Proven to outperform neural networks&lt;/a&gt; on tabular data. For language tasks, yes, you need transformers. &lt;a href="https://arxiv.org/html/2411.09944v1?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Adobe's SlimLM&lt;/a&gt; shows what's possible: 125M-1B parameters, document assistance on smartphones.&lt;/p&gt;

&lt;p&gt;Then ask "can my infrastructure actually deploy and maintain this?" Can you push updates to a fragmented fleet? Can your edge nodes operate when disconnected? Does your data pipeline support bidirectional flow? Can you monitor inference quality across thousands of distributed nodes?&lt;/p&gt;

&lt;p&gt;The size obsession missed the point twice: once by reaching for complex models when simple ones work better, and again by focusing on compression when deployment was the actual bottleneck.&lt;/p&gt;

&lt;p&gt;UPS isn't going to start delivering envelopes on e-bikes anytime soon. The truck infrastructure exists. The routes are planned. The drivers are trained. Switching has costs.&lt;/p&gt;

&lt;p&gt;But if you're building edge ML from scratch, you get to choose. You can build the truck fleet because trucks are what serious logistics companies use. Or you can look at what you're actually delivering and pick the right vehicle for the job.&lt;/p&gt;

&lt;p&gt;A 50KB model that deploys beats a 50MB model that doesn't. But even an e-bike needs a route that works.&lt;/p&gt;

&lt;p&gt;The edge isn't where ML projects go to die. It's where the logistics need to grow up.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. &lt;strong&gt;[&lt;/strong&gt;I'd love to hear your thoughts**](&lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org&lt;/a&gt;)&lt;/strong&gt;!**&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/edge-ml-size-obsession/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>edgecomputing</category>
      <category>machinelearning</category>
      <category>tinyml</category>
      <category>orchestration</category>
    </item>
    <item>
      <title>Emergence vs. Engineering: The Industry Just Bet Against the God Model</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Sat, 13 Dec 2025 18:09:33 +0000</pubDate>
      <link>https://forem.com/aronchick/emergence-vs-engineering-the-industry-just-bet-against-the-god-model-1oo9</link>
      <guid>https://forem.com/aronchick/emergence-vs-engineering-the-industry-just-bet-against-the-god-model-1oo9</guid>
      <description>&lt;p&gt;Monday, OpenAI, Anthropic, Google, Microsoft, and AWS jointly donated their agent infrastructure to the Linux Foundation. If any of them actually believed a single model would achieve AGI in 2-3 years, this would be the dumbest move in corporate history.&lt;/p&gt;

&lt;p&gt;You don't standardize the plumbing when you're about to build God.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Agentic AI Foundation&lt;/a&gt; launched with three projects: Anthropic's Model Context Protocol (MCP) for connectivity, Block's goose for execution, and OpenAI's AGENTS.md for instructions. Together they form a complete stack for building &lt;em&gt;composable&lt;/em&gt; AI systems, many specialized tools working through standard interfaces.&lt;/p&gt;

&lt;p&gt;This isn't a technical footnote. It IS a recognition that no one is going to be able to do it all themselves.&lt;/p&gt;

&lt;p&gt;For many MANY years, we've tried to engineer general intelligence from first principles. The results are impressive but bounded. This week, you could argue, the AI industry quietly bet on a different approach: letting intelligence emerge from simpler components.&lt;br&gt;
The Physics Problem&lt;br&gt;
Tim Dettmers published "&lt;a href="https://timdettmers.com/2025/12/10/why-agi-will-not-happen/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Why AGI Will Not Happen&lt;/a&gt;" the day after the MCP announcement. His argument is remarkably clear.&lt;/p&gt;

&lt;p&gt;"Computation isn't abstract. It happens in silicon, constrained by the speed of light, thermodynamics, and the square-cube law. Moving global information to local neighborhoods scales quadratically with distance. Memory becomes more expensive relative to compute as transistors shrink. "If you want to produce 10 exaflops on a chip, you can do that easily," Dettmers writes, "but you will not be able to service it with memory."&lt;/p&gt;

&lt;p&gt;GPUs maxed out their performance-per-dollar around 2018. The gains since then came from one-off features: 16-bit precision, Tensor Cores, HBM, 8-bit quantization, 4-bit inference. Those tricks are exhausted. Dettmers estimates maybe one or two more years of meaningful scaling improvements before we hit the wall.&lt;/p&gt;

&lt;p&gt;The transformer architecture itself is already near physically optimal. There doesn't appear (BUT I HAVE BEEN WRONG MANY TIMES BEFORE) to be a clever redesign waiting in the wings to unlock another order of magnitude.&lt;/p&gt;

&lt;p&gt;Superintelligence? Fantasy. Recursive self-improvement still obeys scaling laws. An AI improving itself faces the same diminishing returns as engineers improving it externally. You're filling gaps in capability, not extending the frontier.&lt;/p&gt;

&lt;p&gt;If you can't engineer your way to general intelligence through scale, what's the alternative?&lt;/p&gt;

&lt;p&gt;The same thing that produced intelligence in nature: emergence.&lt;br&gt;
More Is Different&lt;br&gt;
In 1972, physicist Philip Anderson published "&lt;a href="https://www.science.org/doi/10.1126/science.177.4047.393?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;More is Different&lt;/a&gt;" in Science. It became one of the most cited papers in complexity research and helped establish the Santa Fe Institute.&lt;/p&gt;

&lt;p&gt;Anderson's argument was profound: reductionism doesn't imply constructionism. You can break a system down into its fundamental parts, but you cannot rebuild complex behavior by assembling those parts. "At each new level of complexity," he wrote, "entirely new properties appear."&lt;/p&gt;

&lt;p&gt;Consciousness isn't hiding in neurons. Traffic patterns don't exist in individual cars. The economy isn't a property of any single transaction. These phenomena &lt;em&gt;emerge&lt;/em&gt; from interactions between simpler component, and they can't be predicted or engineered from first principles.&lt;/p&gt;

&lt;p&gt;This isn't mysticism. It's how complex systems actually work.&lt;/p&gt;

&lt;p&gt;The Santa Fe Institute defines emergence as "properties at one scale that are not present at another scale." Complex adaptive systems share common features: many agents, each intelligent and adaptive within their domain, none possessing complete information about the whole. Global patterns arise from local interactions without central control.&lt;/p&gt;

&lt;p&gt;You don't engineer emergence. You create conditions for it.&lt;br&gt;
The Ant Colony Test&lt;br&gt;
Deborah Gordon at Stanford has spent decades studying ant colonies. Her description of individual ants is memorable: "I probably wouldn't hire them."&lt;/p&gt;

&lt;p&gt;And yet collectively, ants build complex nests, find food sources efficiently, coordinate defense, and adapt to changing environments. Zero central control. The queen doesn't manage; she reproduces. As Gordon puts it, "Tasks allocate workers, rather than a manager allocating tasks to workers."&lt;/p&gt;

&lt;p&gt;The mechanism is stigmergy: coordination through environmental modification. Ants leave pheromone trails that influence other ants' behavior. Simple rules at the individual level (follow strong trails, lay pheromones when successful)produce sophisticated collective intelligence.&lt;/p&gt;

&lt;p&gt;Gordon draws the parallel explicitly: "In many ways, understanding the behavior of ant colonies could teach us about the way billions of relatively simple neurons work together in our brains."&lt;/p&gt;

&lt;p&gt;The brain follows the same pattern. Neurons aren't conscious. They fire or don't fire based on local inputs. Consciousness emerges from billions of these simple interactions. There's no central "intelligence unit" directing traffic, no homunculus watching the show.&lt;/p&gt;

&lt;p&gt;Decentralized control. Simple rules. Local interactions producing global behavior. Resilience through redundancy. Adaptation without central planning.&lt;/p&gt;

&lt;p&gt;This is how nature builds intelligence. Not by engineering a god, but by enabling a swarm.&lt;br&gt;
The Pattern Repeats&lt;br&gt;
The internet works the same way.&lt;/p&gt;

&lt;p&gt;David Clark's 1988 paper on &lt;a href="https://dl.acm.org/doi/10.1145/52325.52336?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;DARPA's design philosophy&lt;/a&gt; reveals remarkably minimal assumptions: the network can transport a datagram with reasonable, not perfect, reliability. That's it. Everything else emerges from endpoints following simple protocols.&lt;/p&gt;

&lt;p&gt;TCP/IP split responsibility deliberately. Keep IP simple and flexible. Push complexity to the edges. "Fate-sharing" means intelligent endpoints, dumb pipes. The result: a decentralized system that scaled beyond anyone's imagination and survives failures that would destroy centralized alternatives.&lt;/p&gt;

&lt;p&gt;Unix philosophy follows the same template. Ken Thompson and Doug McIlroy: "Make each program do one thing well. Expect the output of every program to become the input to another." Small tools, standard interfaces, emergent capability from composition.&lt;/p&gt;

&lt;p&gt;Nobody said "let's build one giant program that does everything." That was the mainframe mentality, and it lost.&lt;/p&gt;

&lt;p&gt;I watched this pattern win with Kubernetes. We didn't build bigger VMs. We built smaller containers with standard interfaces and let orchestration handle the complexity. The sophisticated behavior emerged from composition, not from engineering a monolith.&lt;br&gt;
What MCP Actually Means&lt;br&gt;
The MCP donation makes sense through this lens.&lt;/p&gt;

&lt;p&gt;With 97 million monthly SDK downloads and adoption by Claude, ChatGPT, Gemini, Microsoft Copilot, Cursor, and VS Code, MCP has become TCP/IP for AI agents: the standard protocol for connecting models to tools, data, and services.&lt;/p&gt;

&lt;p&gt;David Soria Parra, MCP's lead maintainer: "The main goal is to have enough adoption in the world that it's the de facto standard."&lt;/p&gt;

&lt;p&gt;Nick Cooper from OpenAI: "We need multiple protocols to negotiate, communicate, and work together to deliver value for people, and that sort of openness and communication is why it's not ever going to be one provider, one host, one company."&lt;/p&gt;

&lt;p&gt;Read that again. OpenAI's own engineer saying it's not ever going to be one company.&lt;/p&gt;

&lt;p&gt;When your fiercest competitors agree on a protocol, they're hedging. They're building for a world where no single system wins. They're betting on emergence over engineering.&lt;br&gt;
The Honest Assessment&lt;br&gt;
This doesn't mean AI won't be transformative. It means the path isn't "scale until AGI."&lt;/p&gt;

&lt;p&gt;It's: build composable tools, let emergence do the heavy lifting.&lt;/p&gt;

&lt;p&gt;Dettmers contrasts the US "winner-take-all" philosophy (that is betting everything on frontier models) with China's "economic diffusion" approach, integrating AI capabilities throughout the economy. The diffusion strategy doesn't require AGI. It requires useful, composable tools that produce emergent value when combined.&lt;/p&gt;

&lt;p&gt;The MCP ecosystem is infrastructure for exactly this. Specialized agents handling narrow tasks, connected through standard protocols, producing collective intelligence that no individual component possesses.&lt;/p&gt;

&lt;p&gt;Ant colonies. Neural networks. The internet. Unix. Kubernetes. Now AI agents.&lt;/p&gt;

&lt;p&gt;The pattern keeps winning because it's how complexity actually works.&lt;br&gt;
The Kicker&lt;br&gt;
Fifty years ago, Philip Anderson argued that you can't construct complexity from simple parts through pure engineering. Emergence requires different tools, different thinking. You don't build intelligence; you create conditions for it to arise.&lt;/p&gt;

&lt;p&gt;This week, the AI industry admitted he was right.&lt;/p&gt;

&lt;p&gt;When OpenAI, Anthropic, Google, Microsoft, and AWS all agree on something, pay attention. They're not building for a world where one model solves everything. They're building for emergence.&lt;/p&gt;

&lt;p&gt;The god model was always a fantasy.&lt;/p&gt;

&lt;p&gt;The swarm is real.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/emergence-vs-engineering/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aiinfrastructure</category>
      <category>ai</category>
      <category>emergence</category>
      <category>mcp</category>
    </item>
    <item>
      <title>🎄 On the First Day of Debugging: The Twelve Characters of Christmas</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Fri, 12 Dec 2025 06:12:14 +0000</pubDate>
      <link>https://forem.com/aronchick/on-the-first-day-of-debugging-the-twelve-characters-of-christmas-4gem</link>
      <guid>https://forem.com/aronchick/on-the-first-day-of-debugging-the-twelve-characters-of-christmas-4gem</guid>
      <description>&lt;p&gt;🎵 On the first day of debugging, production gave to me:&lt;br&gt;
An emoji that broke awk's field count tree 🎵&lt;br&gt;
A Holiday Horror Story&lt;br&gt;
Friday morning. Coffee in hand. You commit a documentation change before the holiday break. You've decided to get EXTRA cool, nothing fancy, just adding some friendly emoji to your metrics. The GitHub Actions workflow fails:&lt;br&gt;
Error: Invalid format '100'&lt;br&gt;
Expected: number&lt;br&gt;
Got: string '100'&lt;/p&gt;

&lt;p&gt;Four hours later (goodbye, early weekend), you've discovered that &lt;code&gt;&amp;amp;#x1f4ca;&lt;/code&gt; breaks &lt;code&gt;awk &amp;amp;apos;{print $3}&amp;amp;apos;&lt;/code&gt; because emoji count as fields and your clever metric extraction just imploded.&lt;/p&gt;

&lt;p&gt;Welcome to production, where every character is a potential landmine wrapped in festive paper.&lt;/p&gt;

&lt;p&gt;This holiday season, let me gift you the knowledge of &lt;strong&gt;The Twelve Characters of Christmas&lt;/strong&gt;. Twelve special characters that will ruin your week, test your patience, and teach you why pipelines are terrible at telling you what's actually wrong.&lt;/p&gt;

&lt;p&gt;Think of this as your technical advent calendar. Behind each door: a character that breaks things in fascinating ways.&lt;br&gt;
🎁 The First Gift: The Emoji 📊&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; Emojis have variable width.&lt;/p&gt;

&lt;p&gt;The problem wasn't the emoji itself; it was the assumption that field extraction works on visual spacing.&lt;/p&gt;

&lt;h1&gt;
  
  
  What we see:
&lt;/h1&gt;

&lt;p&gt;📊 Metric: 100%&lt;/p&gt;

&lt;h1&gt;
  
  
  What awk sees with '{print $3}':
&lt;/h1&gt;

&lt;p&gt;Field 1: 📊&lt;br&gt;
Field 2: Metric:&lt;br&gt;
Field 3: 100%&lt;/p&gt;

&lt;h1&gt;
  
  
  After adding emoji:
&lt;/h1&gt;

&lt;p&gt;📊 📈 Metric: 100%&lt;/p&gt;

&lt;h1&gt;
  
  
  Now awk sees:
&lt;/h1&gt;

&lt;p&gt;Field 1: 📊&lt;br&gt;
Field 2: 📈&lt;br&gt;
Field 3: Metric:&lt;br&gt;
Field 4: 100%  # Wrong field! 🎁&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;br&gt;
String length lies (&lt;code&gt;&amp;amp;quot;&amp;amp;#x1f4ca;&amp;amp;quot;.length&lt;/code&gt; = 2 in JavaScript, not 1)Byte count ≠ character countSorting alphabetically becomes... interestingEvery string operation you thought you understood is now probabilistic&lt;br&gt;
&lt;strong&gt;The fix:&lt;/strong&gt; &lt;code&gt;awk &amp;amp;apos;{print $NF}&amp;amp;apos;&lt;/code&gt; always grabs the last field.&lt;br&gt;
🎁 The Second Gift: The Zero-Width Joiner ‍&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; Authentication bypass wrapped in invisibility.&lt;/p&gt;

&lt;p&gt;These characters are invisible glue between emoji, but they work anywhere:&lt;br&gt;
"hello" vs "he‍llo"&lt;/p&gt;

&lt;p&gt;Those look identical. They're not. The second has U+200D (zero-width joiner) between the 'e' and 'l'.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;br&gt;
user_input = "admin‍"  # has ZWJ at end&lt;br&gt;
if user_input == "admin":  # nope! 🎁&lt;br&gt;
    grant_access()&lt;/p&gt;

&lt;h1&gt;
  
  
  Also fails:
&lt;/h1&gt;

&lt;p&gt;db.query("SELECT * FROM users WHERE username = ?", user_input)&lt;/p&gt;

&lt;p&gt;Authentication bypass via invisible character. Your WAF can't see it. Your logs look fine. Your security audit finds nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; Someone's "username not found" ticket escalates to a database investigation revealing two "identical" usernames with different bytes.&lt;br&gt;
🎁 The Third Gift: Right-to-Left Override ‮&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; A trojan horse with a bow on top.&lt;/p&gt;

&lt;p&gt;Unicode includes directionality controls. U+202E reverses text rendering:&lt;br&gt;
Filename: "image.txt‮gpj.evil"&lt;br&gt;
Displays as: "image.txtlive.jpg"&lt;br&gt;
Actual bytes: "image.txt[RLO]live.jpg"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your file viewer shows a JPG. Your security scanner checks JPG extensions. Your pipeline processes a JPG. What actually executes? &lt;code&gt;evil.txt&lt;/code&gt; 🎁&lt;/p&gt;

&lt;p&gt;This isn't theoretical—&lt;a href="https://trojansource.codes/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Trojan Source attacks&lt;/a&gt; use this for malicious code injection that passes code review because it &lt;strong&gt;looks&lt;/strong&gt; fine on screen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; Your image processor starts executing arbitrary code. Merry Christmas!&lt;br&gt;
🎁 The Fourth Gift: The Null Byte \0&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; The gift of dual realities.&lt;/p&gt;

&lt;p&gt;In C, &lt;code&gt;\0&lt;/code&gt; terminates strings. In everything built on C (which is everything), null bytes create two parallel universes:&lt;/p&gt;

&lt;h1&gt;
  
  
  What you check:
&lt;/h1&gt;

&lt;p&gt;filename = "safe.txt\0../../etc/passwd"&lt;br&gt;
if filename.endswith(".txt"):  # True! 🎁&lt;br&gt;
    process_file(filename)&lt;/p&gt;

&lt;h1&gt;
  
  
  What actually happens:
&lt;/h1&gt;

&lt;h1&gt;
  
  
  String terminates at \0
&lt;/h1&gt;

&lt;h1&gt;
  
  
  Opens: "safe.txt"... or does it?
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;br&gt;
-- What you think you're querying:&lt;br&gt;
SELECT * FROM files WHERE name = 'safe.txt\0'&lt;/p&gt;

&lt;p&gt;-- What the SQL parser sees:&lt;br&gt;
-- String terminated, rest ignored 🎁&lt;/p&gt;

&lt;p&gt;SQL injection's sneaky cousin. Your input validation passes. Your database dies. Your logs show "safe.txt" and nothing suspicious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; After your security audit, when the penetration tester shows you what they did.&lt;br&gt;
🎁 The Fifth Gift: The BOM ﻿&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; The invisible file prefix that breaks everything.&lt;/p&gt;

&lt;p&gt;UTF-8 files sometimes start with U+FEFF (byte order mark). It's invisible in most editors. It destroys your scripts:&lt;/p&gt;

&lt;h1&gt;
  
  
  !/bin/bash
&lt;/h1&gt;

&lt;p&gt;echo "Hello World"&lt;/p&gt;

&lt;p&gt;Looks fine. Won't execute:&lt;br&gt;
$ ./script.sh&lt;br&gt;
bash: ./script.sh: /bin/bash: bad interpreter: No such file or directory 🎁&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The shebang is actually &lt;code&gt;#&amp;amp;#xfeff;!/bin/bash&lt;/code&gt;. Different bytes. Bash doesn't recognize it.&lt;/p&gt;

&lt;p&gt;CSV files with BOM? Your parser thinks the first column is named &lt;code&gt;&amp;amp;quot;&amp;amp;#xfeff;id&amp;amp;quot;&lt;/code&gt; instead of &lt;code&gt;&amp;amp;quot;id&amp;amp;quot;&lt;/code&gt;. Joins fail mysteriously. Someone suggests "just trim whitespace" and it still doesn't work because &lt;strong&gt;there is no whitespace&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; After you've checked file permissions, reinstalled bash, rebooted the server, and finally run &lt;code&gt;xxd script.sh&lt;/code&gt; to see the bytes.&lt;br&gt;
🎁 The Sixth Gift: The Soft Hyphen ­&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; The invisible line-break suggestion.&lt;/p&gt;

&lt;p&gt;U+00AD is a "suggestion to break here if needed." Invisible until line-wrapping occurs:&lt;br&gt;
"super­cali­fragi­listic" appears as "supercalifragilistic"&lt;/p&gt;

&lt;p&gt;But:&lt;br&gt;
"supercalifragilistic".includes("cali")  // true&lt;br&gt;
"super­cali­fragi­listic".includes("cali")  // false! 🎁&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Copy-paste from a website into your pipeline, and suddenly:&lt;br&gt;
Searches failLog aggregation misses matchesDeduplication creates duplicatesYour debugging session: "I can literally SEE the string matches. Why isn't it finding it?"&lt;br&gt;
&lt;strong&gt;The discovery:&lt;/strong&gt; After you paste the same text directly into your code, and THAT works.&lt;br&gt;
🎁 The Seventh Gift: Turkish İ&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; The gift of internationalization nightmares.&lt;/p&gt;

&lt;p&gt;Turkish has four 'i' letters: i, ı, İ, I. Case conversion depends on locale:&lt;br&gt;
"file".upcase  # "FILE" in English&lt;br&gt;
"file".upcase  # "FİLE" in Turkish (tr_TR locale) 🎁&lt;/p&gt;

&lt;p&gt;"FİLE".downcase  # "fi̇le" (with combining dot)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your case-insensitive comparison just became locale-sensitive. Germans debate whether ß.upcase should be "SS" or "ẞ". Greeks have three lowercase sigmas (σ, ς, Σ).&lt;/p&gt;

&lt;p&gt;File systems differ:&lt;br&gt;
Windows: case-insensitive, case-preservingmacOS: optionally case-sensitiveLinux: case-sensitive always&lt;br&gt;
&lt;strong&gt;The discovery:&lt;/strong&gt; Your pipeline works in dev (macOS), breaks in staging (Linux), works differently in production (Windows containers)..&lt;br&gt;
🎁 The Eighth Gift: The Combining Accent ́&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; Two ways to write the same letter.&lt;/p&gt;

&lt;p&gt;Unicode has two ways to write é:&lt;br&gt;
é  # U+00E9 (single character)&lt;br&gt;
é  # U+0065 + U+0301 (e + combining acute accent) 🎁&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Visually identical. Different bytes. macOS normalizes to decomposed (NFD). Linux doesn't (NFC).&lt;/p&gt;

&lt;h1&gt;
  
  
  Create file on macOS:
&lt;/h1&gt;

&lt;p&gt;touch café.txt  # stored as cafe\u0301.txt&lt;/p&gt;

&lt;h1&gt;
  
  
  Access from Linux:
&lt;/h1&gt;

&lt;p&gt;ls -l café.txt  # File not found 🎁&lt;/p&gt;

&lt;h1&gt;
  
  
  Why?
&lt;/h1&gt;

&lt;p&gt;$ ls -lb&lt;br&gt;
-rw-r--r--  1 user  staff  0 Nov 24 10:00 cafe\314\201.txt&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; Your cross-platform pipeline mysteriously loses files between systems.&lt;br&gt;
🎁 The Ninth Gift: The Surrogate Pair 💩&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; Emoji that need TWO UTF-16 code units.&lt;br&gt;
"💩".length  // 2, not 1 🎁&lt;br&gt;
"💩"[0]      // � (invalid Unicode)&lt;br&gt;
"💩".substring(0, 1)  // � (broken character)&lt;/p&gt;

&lt;p&gt;// Counting is hard:&lt;br&gt;
[..."Hello 💩 World"].length  // 13 (correct)&lt;br&gt;
"Hello 💩 World".length       // 14 (wrong)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your pipeline truncates strings at byte boundaries. Emoji get cut in half. JSON becomes invalid. Logs show question marks. Everyone blames the database collation.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reversing destroys emoji:
&lt;/h1&gt;

&lt;p&gt;"Hello 💩 World"[::-1]  # "dlroW �� olleH" 🎁&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; When a customer complains that their emoji-filled messages look like gibberish.&lt;br&gt;
🎁 The Tenth Gift: The Homoglyph а&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; Letters that look identical but aren't.&lt;/p&gt;

&lt;p&gt;Latin 'a' (U+0061) and Cyrillic 'а' (U+0430) look identical. Different bytes:&lt;br&gt;
twitter.com  # real&lt;br&gt;
twіtter.com  # Cyrillic 'і' (U+0456) 🎁&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;br&gt;
expected = "admin"&lt;br&gt;
actual = "аdmin"  # First character is Cyrillic&lt;/p&gt;

&lt;p&gt;if actual == expected:  # False! 🎁&lt;br&gt;
    grant_access()&lt;/p&gt;

&lt;h1&gt;
  
  
  But to humans reading logs:
&lt;/h1&gt;

&lt;p&gt;print(f"Login attempt: {actual}")  # looks like "admin"&lt;/p&gt;

&lt;p&gt;Your domain validation passes. Your email verification passes. Your phishing attack succeeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; During your security incident post-mortem.&lt;br&gt;
🎁 The Eleventh Gift: The Newline \n&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; Three standards for ending a line.&lt;br&gt;
Unix:    \n   (LF)&lt;br&gt;
Windows: \r\n (CRLF) 🎁&lt;br&gt;
Old Mac: \r   (CR)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;br&gt;
"line1\r\nline2\r\nline3".count('\n')  # 2 (wrong)&lt;br&gt;
"line1\r\nline2\r\nline3".splitlines()  # 3 (right) 🎁&lt;/p&gt;

&lt;p&gt;Git helpfully converts line endings based on &lt;code&gt;.gitattributes&lt;/code&gt;. Now every line in your diff is "changed." Your PR is 10,000 lines. The actual change was one word.&lt;/p&gt;

&lt;p&gt;Your pipeline hash-checks files for integrity. Same content, different line endings, different hashes. False positive failure alert at 3am.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; When your coworker opens your file and their editor "fixes" all the line endings.&lt;br&gt;
🎁 The Twelfth Gift: The Tab \t&lt;br&gt;
&lt;strong&gt;What you unwrapped:&lt;/strong&gt; The final gift—visual width that lies.&lt;/p&gt;

&lt;h1&gt;
  
  
  These look identical:
&lt;/h1&gt;

&lt;p&gt;"key:    value"  # 4 spaces&lt;br&gt;
"key:\tvalue"    # 1 tab character 🎁&lt;/p&gt;

&lt;h1&gt;
  
  
  But:
&lt;/h1&gt;

&lt;p&gt;"key:    value".split('\t')  # ['key:    value']&lt;br&gt;
"key:\tvalue".split('\t')     # ['key:', 'value']&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your gift that keeps giving:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;YAML treats tabs and spaces differently. One is indentation. One is death:&lt;br&gt;
config:&lt;br&gt;
    key: value    # spaces, valid&lt;br&gt;
config:&lt;br&gt;
    key: value    # tab, invalid YAML 🎁&lt;/p&gt;

&lt;p&gt;Your config looks fine in your editor (which converts tabs to spaces). Your pipeline reads the raw file (which has tabs). Your deployment fails with "invalid YAML" and the error message points to a line that looks perfectly fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discovery:&lt;/strong&gt; After you copy-paste the "broken" YAML into a validator and it works fine.&lt;br&gt;
🎄 What All Twelve Gifts Have in Common&lt;br&gt;
Notice what these twelve characters share:&lt;br&gt;
&lt;strong&gt;Invisible to humans&lt;/strong&gt;: Your eyes can't distinguish them*&lt;em&gt;ASCII-land works fine&lt;/em&gt;&lt;em&gt;: English text with basic punctuation passes every time&lt;/em&gt;&lt;em&gt;Deterministic but unpredictable&lt;/em&gt;&lt;em&gt;: Given the character, it always fails the same way, but you have no idea which characters your pipeline will encounter&lt;/em&gt;&lt;em&gt;Production discovery&lt;/em&gt;&lt;em&gt;: Test data is sanitized, user input is chaos&lt;/em&gt;&lt;em&gt;Binary debugging&lt;/em&gt;&lt;em&gt;: Remove pieces until something changes, like unwrapping boxes to find which gift is broken&lt;br&gt;
This isn't fundamentally about Unicode complexity. It's about **pipeline observability&lt;/em&gt;*.&lt;br&gt;
🎁 The Real Problem (Unwrapped)&lt;br&gt;
Your pipeline has stages:&lt;br&gt;
[Input] → [Parse] → [Transform] → [Validate] → [Store]&lt;/p&gt;

&lt;p&gt;Each stage has opinions about text encoding. None agree:&lt;br&gt;
Input accepts UTF-8Parse assumes ASCIITransform uses locale-sensitive operationsValidate checks byte lengthStore expects UTF-8 but doesn't verify&lt;br&gt;
When something breaks, you get: &lt;code&gt;Error: Invalid format &amp;amp;apos;100&amp;amp;apos;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;What you need:&lt;br&gt;
Stage: Parse&lt;br&gt;
Input bytes: [F0 9F 93 8A 20 4D 65 74 72 69 63 3A 20 31 30 30 25]&lt;br&gt;
Encoding: UTF-8&lt;br&gt;
Character count: 13&lt;br&gt;
Byte count: 17&lt;br&gt;
Field extraction: Expected 3 fields, got 4&lt;br&gt;
Problem character: U+1F4CA (📊) at position 0&lt;br&gt;
Suggestion: Use byte-based parsing or normalize input&lt;/p&gt;

&lt;p&gt;But you don't get that. You get silence until complete failure.&lt;br&gt;
🎁 Three Gifts for Better PipelinesGift 1: Declare Encoding Everywhere# Not this:&lt;br&gt;
with open('file.txt') as f:&lt;br&gt;
    data = f.read()&lt;/p&gt;

&lt;h1&gt;
  
  
  This:
&lt;/h1&gt;

&lt;p&gt;with open('file.txt', encoding='utf-8', errors='strict') as f:&lt;br&gt;
    data = f.read()&lt;/p&gt;

&lt;p&gt;&lt;code&gt;errors=&amp;amp;apos;strict&amp;amp;apos;&lt;/code&gt; means fail immediately on invalid bytes. Don't guess. Don't substitute. Fail with the exact byte position.&lt;br&gt;
Gift 2: Normalize at Boundariesimport unicodedata&lt;/p&gt;

&lt;p&gt;def sanitize_input(text):&lt;br&gt;
    # Pick ONE canonical form and enforce it&lt;br&gt;
    normalized = unicodedata.normalize('NFC', text)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Remove invisibles
visible = &amp;amp;apos;&amp;amp;apos;.join(c for c in normalized 
                  if unicodedata.category(c)[0] != &amp;amp;apos;C&amp;amp;apos;)

# Verify what&amp;amp;apos;s left
try:
    visible.encode(&amp;amp;apos;ascii&amp;amp;apos;)  # Or &amp;amp;apos;utf-8&amp;amp;apos;, be explicit
except UnicodeEncodeError as e:
    raise ValueError(
        f&amp;amp;quot;Invalid character at position {e.start}: &amp;amp;quot;
        f&amp;amp;quot;{repr(visible[e.start])}&amp;amp;quot;
    )

return visible
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Gift 3: Make Intermediate State Visible&lt;br&gt;
Between pipeline stages, log:&lt;br&gt;
Byte count vs character countEncoding declarationCharacter categories presentSample of problematic characters&lt;br&gt;
Your monitoring should show:&lt;br&gt;
"Stage 2 processed 1M records, 47 contained non-ASCII, 3 contained control characters, 0 validation failures."&lt;/p&gt;

&lt;p&gt;Not: "Stage 2 complete."&lt;br&gt;
🎄 The Holiday Season Lesson&lt;br&gt;
Every data pipeline has two modes:&lt;br&gt;
&lt;strong&gt;Development&lt;/strong&gt;: All data is well-formed (like your holiday wish list)&lt;strong&gt;Production&lt;/strong&gt;: The real world exists (like your actual gifts)&lt;br&gt;
There's no smooth transition. Your pipeline either handles emoji, zero-width joiners, null bytes, and right-to-left overrides, or it silently corrupts data until someone notices.&lt;/p&gt;

&lt;p&gt;These twelve characters didn't cost you hours of debugging because character encoding is hard. They cost you those hours because your pipeline's feedback loop is binary: success or mysterious failure, nothing in between.&lt;/p&gt;

&lt;p&gt;Design-time testing uses sanitized data. Production sends you the real world. The gap between them is where you spend your Friday afternoon debugging why &lt;code&gt;&amp;amp;#x1f4ca;&lt;/code&gt; appears in your error message instead of going to your holiday party.&lt;br&gt;
🎵 &lt;em&gt;And a Pipeline That Actually Works Reliably&lt;/em&gt; 🎵&lt;br&gt;
&lt;em&gt;Happy Holidays from everyone at **Expanso.io&lt;/em&gt;&lt;em&gt;. May your deployments be clean, your pipelines be observable, and your error messages be specific.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P.S. If you're reading this during the December code freeze/during a holiday week/when every one else is drunk on "egg-nog" : I'm sorry. The Further Reading section below might help. Or at least commiserate.&lt;/em&gt;&lt;br&gt;
Further Reading (Stocking Stuffers)&lt;a href="https://trojansource.codes/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Trojan Source: Invisible Vulnerabilities&lt;/a&gt; - RLO attacks in real code&lt;a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;The Absolute Minimum Every Software Developer Must Know About Unicode&lt;/a&gt; - Joel Spolsky&lt;a href="http://utf8everywhere.org/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;UTF-8 Everywhere&lt;/a&gt; - Why UTF-8 should be your default&lt;a href="https://unicode.org/reports/tr36/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Unicode Security Considerations&lt;/a&gt; - Official security implications&lt;a href="https://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;The Turkey Test&lt;/a&gt; - Case conversion nightmare&lt;br&gt;
&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. &lt;strong&gt;[&lt;/strong&gt;I'd love to hear your thoughts**](&lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org&lt;/a&gt;)&lt;/strong&gt;!**&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/twelve-characters-of-christmas/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datapipelines</category>
      <category>debugging</category>
      <category>productionfailures</category>
      <category>observability</category>
    </item>
    <item>
      <title>Multi-Everything: Why Your Data Strategy Is Harder Than Your Cloud Strategy</title>
      <dc:creator>David Aronchick</dc:creator>
      <pubDate>Tue, 09 Dec 2025 12:13:43 +0000</pubDate>
      <link>https://forem.com/aronchick/multi-everything-why-your-data-strategy-is-harder-than-your-cloud-strategy-15hm</link>
      <guid>https://forem.com/aronchick/multi-everything-why-your-data-strategy-is-harder-than-your-cloud-strategy-15hm</guid>
      <description>&lt;p&gt;An Uber engineer gave &lt;a href="https://thenewstack.io/inside-ubers-multicloud-ai-reality-the-gap-between-data-and-compute/?ref=distributedthoughts.org" rel="noreferrer noopener"&gt;a great talk at Kubecon&lt;/a&gt; I have wanted to write about: “...we end up having to think about use cases that can either reside entirely within one cloud provider so that I can put training and serving together, or I need to think about the use cases where it makes sense to actually pull the data from one provider to another, in order to facilitate being able to leverage that compute. It doesn’t make it quite as seamless as it could be, and you have to be purposeful in how you think about what workloads you’re going to be converging together.”&lt;/p&gt;

&lt;p&gt;Two sentences. They explain why multicloud is complicated. It works! But it's not just "spread stuff everywhere and balance."&lt;/p&gt;

&lt;p&gt;And when you're thinking about it for yourself, put this in context. Uber has &lt;a href="https://www.uber.com/blog/engineering/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;dedicated platform engineering teams&lt;/a&gt;, specialized GPU infrastructure groups, and the budget to build custom observability solutions. They still struggle with this. If you're not dedicated just as many resources to address these challenges, they reality of the complexity is going to hit you even harder.&lt;br&gt;
The Conversation We've Been Having&lt;br&gt;
For the past decade, the infrastructure industry focused relentlessly on making compute portable. Kubernetes runs anywhere. Containers abstract away the underlying machine. We've built elaborate systems to ensure that a workload running in AWS can, in theory, run identically in GCP or Azure.&lt;/p&gt;

&lt;p&gt;And it worked! The compute layer genuinely is more portable than it was in 2015.&lt;/p&gt;

&lt;p&gt;But what Uber was talking about here reveals a the immutable truth. Compute can PRETTY READILY between clouds because the commodity layer is portable. But when it comes to running in multiple locations, it's not the containers you need to worry about, it's the data.&lt;br&gt;
Why Data Gravity Wins&lt;br&gt;
&lt;a href="https://www.computerweekly.com/feature/Data-gravity-What-is-it-and-how-to-manage-it?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Data gravity&lt;/a&gt;, the phenomenon where data accumulates mass and attracts applications toward it, isn't a new concept. &lt;a href="https://datagravitas.com/?ref=distributedthoughts.org" rel="noreferrer noopener"&gt;Dave McCrory&lt;/a&gt; &lt;a href="https://datagravitas.com/2010/12/07/data-gravity-in-the-clouds/?ref=distributedthoughts.org" rel="noreferrer noopener"&gt;coined the term&lt;/a&gt; back in 2010, and there have been many other great &lt;a href="https://blog.debedb.com/2016/01/05/rethinking-data-gravity/?ref=distributedthoughts.org" rel="noreferrer noopener"&gt;pieces&lt;/a&gt; covering it. But the implications have become dramatically more severe as AI workloads have grown.&lt;/p&gt;

&lt;p&gt;Uber's engineering teams maintain a data lake on one cloud provider. They run inference workloads on a different provider. Training happens somewhere else entirely. Each choice was rational in isolation; optimize for the best GPU availability here, the best storage economics there.&lt;/p&gt;

&lt;p&gt;The result? "You have to be purposeful in how you think about what workloads you're going to be converging together."&lt;/p&gt;

&lt;p&gt;That's a diplomatic way of describing a constraint that dominates every architectural decision. When considering whether a use case can leverage GPUs efficiently, the first question isn't "do we have the compute?" It's "where does the data live, and what does it cost to move it?"&lt;/p&gt;

&lt;p&gt;This isn't a Kubernetes problem. It's not even really a cloud provider problem. It's physics meeting economics.&lt;/p&gt;

&lt;p&gt;Moving a petabyte of training data from one cloud to another isn't JUST a technical challenge, it's a business calculation. &lt;a href="https://www.cloudzero.com/blog/aws-egress-costs/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Egress fees&lt;/a&gt;, while expenalone can run into six figures. AWS charges &lt;a href="https://www.nops.io/blog/aws-egress-costs-and-how-to-avoid/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;$0.09 per GB&lt;/a&gt; for the first 10 TB transferred to the internet. Do the math on a 50TB training dataset and you're looking at $4,500 just in network transfer—before you've stored anything, processed anything, or extracted a single insight.&lt;/p&gt;

&lt;p&gt;Add latency considerations for real-time inference. Add &lt;a href="https://www.kiteworks.com/gdpr-compliance/understand-and-adhere-to-gdpr-data-residency-requirements/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;compliance requirements&lt;/a&gt; that may prohibit certain data from crossing certain boundaries. Add the simple fact that a model optimized with &lt;a href="https://developer.nvidia.com/tensorrt?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;TensorRT&lt;/a&gt; for a specific NVIDIA GPU configuration doesn't just "run" on different hardware.&lt;/p&gt;

&lt;p&gt;The container is portable. Everything the container needs is not.&lt;br&gt;
You Don't Have Uber's Teams&lt;br&gt;
Uber's response to these challenges involves building custom metrics APIs to abstract away GPU vendor differences. Their teams revamped their entire observability stack when they discovered that &lt;a href="https://github.com/google/cadvisor?ref=distributedthoughts.org" rel="noreferrer noopener"&gt;CAdvisor&lt;/a&gt;-based GPU metrics didn't support newer models. They're actively working on making GPU capacity more fungible across clusters.&lt;/p&gt;

&lt;p&gt;They have the engineering headcount to do this. &lt;a href="https://www.informatica.com/blogs/the-surprising-reason-most-ai-projects-fail-and-how-to-avoid-it-at-your-enterprise.html?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Data quality issues&lt;/a&gt; are consistently cited as a primary cause of AI-project failure, and when your data lives in multiple places with different access patterns, different compliance requirements, and different cost structures, maintaining quality becomes exponentially harder.&lt;/p&gt;

&lt;p&gt;Uber has spent 15+ years building dedicated teams to solve this problem. Most organizations are asking their existing platform engineers to figure it out alongside everything else they're already doing.&lt;br&gt;
The Cluster Was the Wrong Abstraction&lt;br&gt;
Uber made an observation that cuts to the heart of how we've been thinking about infrastructure: "We ended up with a Kubernetes infrastructure focused on batch and a Kubernetes infrastructure focused on microservices... The distinction has been that the hardware was segregated at the cluster level."&lt;/p&gt;

&lt;p&gt;Their teams built dedicated GPU clusters for GPU workloads. Dedicated CPU clusters for CPU workloads. The cluster became the organizing principle for hardware allocation.&lt;/p&gt;

&lt;p&gt;This made sense when clusters were the primary unit of deployment. But it created silos. GPU capacity sat isolated from CPU capacity. When GPU clusters were underutilized, that capacity couldn't easily flow to other workloads. When CPU-bound services accidentally landed on GPU nodes, expensive hardware sat wasted running authentication checks.&lt;/p&gt;

&lt;p&gt;"We've been over-indexing on a Kubernetes cluster as an abstraction for hardware," Uber noted, "rather than leveraging a lot of what we can do internally from Kubernetes itself."&lt;/p&gt;

&lt;p&gt;The cluster was supposed to abstract away infrastructure complexity. Instead, it became another boundary; another wall between resources that could, in principle, be fungible but in practice are not.&lt;/p&gt;

&lt;p&gt;If this is happening at Uber, with their dedicated platform teams and infrastructure budgets, what does your cluster architecture actually look like?&lt;br&gt;
GPUs Make Everything Harder&lt;br&gt;
The challenges with multi-cloud compute get significantly worse when GPUs enter the picture. CPUs are MOSTLY fungible (the entire stack may not be, but it's fairly close). An x86-compatible chip in AWS behaves essentially the same as a x86-compatible chip in Azure (MOSTLY). You can pack multiple workloads onto a single CPU. Failovers are straightforward.&lt;/p&gt;

&lt;p&gt;None of this applies to GPUs.&lt;/p&gt;

&lt;p&gt;"GPU workloads aren't quite as fungible as CPU workloads. I can't as easily just dynamically pack eight workloads onto one GPU now, where I could have just squeezed things onto a single CPU."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.runpod.io/articles/comparison/choosing-a-gpu-for-training-vs-inference?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Choosing the right GPU&lt;/a&gt; for training versus inference is already complex enough when you're working with a single provider. Training requires massive compute throughput and high memory bandwidth. Inference optimizes for latency and cost-per-query. The hardware choices are fundamentally different.&lt;/p&gt;

&lt;p&gt;Now multiply that complexity across providers. &lt;a href="https://northflank.com/blog/12-best-gpu-cloud-providers?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Different cloud GPU offerings&lt;/a&gt; have different availability, different pricing models, different networking characteristics. An H100 on AWS isn't quite the same as an H100 on GCP when you factor in interconnect speeds, memory configurations, and the software stack surrounding it.&lt;/p&gt;

&lt;p&gt;And worse, disaster recovery math changes even further. With CPUs, you might provision 20% overhead for failover capacity. With GPUs, given their cost, their scarcity, and the fact that models are often optimized for specific hardware configurations, that overhead becomes genuinely painful to justify.&lt;/p&gt;

&lt;p&gt;If the workload was optimized for this GPU, with this memory configuration, using this specific NVIDIA architecture. Moving it isn't just a scheduling decision; it's potentially a retraining decision.&lt;br&gt;
Observability in a Multi-Vendor World&lt;br&gt;
One thing that PARTICULARLY stood out for me was observability. In Uber's case, they are exploring building their own metrics API to abstract away GPU vendor differences.&lt;/p&gt;

&lt;p&gt;Why? Because they use NVIDIA hardware but are also evaluating AMD. Each vendor exposes different metrics. Teams built dashboards around low-level Cadvisor metrics that don't even support newer GPU models. When they tried to migrate to updated metrics, they discovered the entire organization had built dependencies on the old metric set.&lt;/p&gt;

&lt;p&gt;"You're going to end up with a mix of a variety of different metrics and with nuances about what each of them means." They're now trying to build "metrics almost as an API" =&amp;gt; a platform-level abstraction that can source data from vendor-specific implementations without requiring every team to understand GPU model differences.&lt;/p&gt;

&lt;p&gt;This is the kind of problem that doesn't show up in multicloud architecture diagrams. It's the accumulated weight of real decisions made by real teams trying to get actual work done.&lt;/p&gt;

&lt;p&gt;And again: Uber has dedicated teams building custom solutions for this. What's your plan?&lt;br&gt;
The Real Problem: Multi-Everything&lt;br&gt;
What Uber's experience reveals is that "multicloud" was always the wrong frame for this conversation.&lt;/p&gt;

&lt;p&gt;The challenge isn't running compute across multiple cloud providers. Kubernetes solved that. The challenge is that modern AI workloads exist in a multi-everything environment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-region:&lt;/strong&gt; Data generated in Europe may have &lt;a href="https://www.techtarget.com/searchcloudcomputing/definition/data-residency?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;different residency requirements&lt;/a&gt; than data generated in Asia. Training might happen in a region with GPU availability. Inference might need to happen close to users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-provider:&lt;/strong&gt; Not just AWS vs. GCP vs. Azure, but also on-premises data centers that still hold sensitive datasets, edge locations that generate real-time data, and &lt;a href="https://www.runpod.io/articles/guides/top-cloud-gpu-providers?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;specialized AI clouds&lt;/a&gt; that offer unique hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-compliance-zone:&lt;/strong&gt; Regulatory boundaries don't align with cloud provider boundaries. &lt;a href="https://www.sysdig.com/learn-cloud-native/a-guide-to-gdpr-compliance-for-containers-and-the-cloud?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;GDPR&lt;/a&gt;, HIPAA, financial regulations, and industry-specific requirements create a patchwork of constraints that have nothing to do with where your Kubernetes clusters run. &lt;a href="https://www.kiteworks.com/gdpr-compliance/understand-and-adhere-to-gdpr-data-residency-requirements/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Some EU member states&lt;/a&gt; have enacted additional residency requirements beyond GDPR for specific sectors like healthcare and public services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-format:&lt;/strong&gt; Data lakes, data warehouses, streaming platforms, feature stores, vector databases. Each optimized for different access patterns. Each with its own replication and consistency guarantees.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://odsc.medium.com/3-data-management-challenges-in-multicloud-environments-b44ac0c768db?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;Over 80% of enterprises&lt;/a&gt; with multicloud environments experience interoperability and connectivity problems. The compute layer was the easiest part of this puzzle to solve. We've been celebrating that victory while the harder problems compound.&lt;br&gt;
What Comes Next&lt;br&gt;
Uber's agentic AI workflows are still experimental, representing a minority of GPU usage. But they noted that if they "unlock some agentic workflow and put it everywhere," it would represent "a considerable increase in what they need to support with GPUs."&lt;/p&gt;

&lt;p&gt;That's the trajectory the entire industry is on. More AI workloads. More models. More demand for training and inference capacity. And every one of those workloads will inherit all the multi-everything constraints that already make enterprise architecture so complex.&lt;/p&gt;

&lt;p&gt;The industry spent a decade making compute portable. The next decade's problem is fundamentally different: making data &lt;em&gt;accessible&lt;/em&gt; without necessarily making it &lt;em&gt;mobile&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That's not a Kubernetes upgrade. It's not a new cloud service. It's a rethinking of how we architect systems when the data—not the compute—is the constraint that matters.&lt;/p&gt;

&lt;p&gt;Uber's engineers are living in that future right now, with dedicated teams and substantial budgets to figure it out. The rest of us need to start thinking about how we'll solve the same problems with a fraction of the resources.&lt;/p&gt;

&lt;p&gt;Because the data isn't going to move itself. And frankly, given the economics, you probably don't want it to.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to learn how intelligent data pipelines can reduce your AI costs?&lt;/em&gt; &lt;a href="https://expanso.io/?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Check out Expanso&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt;. Or don't. Who am I to tell you what to do.*&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. &lt;strong&gt;[&lt;/strong&gt;I'd love to hear your thoughts**](&lt;a href="https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org" rel="noopener noreferrer"&gt;https://github.com/aronchick/Project-Zen-and-the-Art-of-Data-Maintenance?ref=distributedthoughts.org&lt;/a&gt;)&lt;/strong&gt;!**&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.distributedthoughts.org/multi-everything-data-strategy/" rel="noopener noreferrer"&gt;Distributed Thoughts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>multicloud</category>
      <category>dataarchitecture</category>
      <category>kubernetes</category>
      <category>aiinfrastructure</category>
    </item>
  </channel>
</rss>
