Forem: Thomas John

When the Cloud Fails, the Browser Still Thinks

Thomas John — Wed, 22 Apr 2026 00:12:23 +0000

Browser-native LLMs are the most underrated shift in edge AI. Here's why.

3:17 AM. North Sea. 200 kilometers from the nearest coastline.

The satellite uplink has been down since midnight. The drilling platform runs on skeleton watch. At exactly 3:17, a pressure sensor on mud pump P-3 starts drifting.

Marcus, the on-call engineer, pulls up the asset interface on his tablet. Types what he sees:

"mud pump P-3 pressure readings drifting high since 0200, vibration also slightly elevated"

Two seconds later:

Probable cause: partial blockage or liner wear
Action: reduce RPM by 15%, schedule inspection at next safe window
Escalate if: pressure exceeds 420 PSI or vibration crosses 2.4g

No spinner. No server. The satellite is still down.

The model that just assessed that fault is running on Marcus's tablet — cached since the last port call, running on the tablet's GPU, no internet required.

What's Actually Happening

Modern browsers ship with direct GPU access through an API called WebGPU. The WebLLM project uses it to run large language models — real ones, billions of parameters — entirely inside a browser tab.

Download once. Cache locally. Run on the GPU. Zero network calls per query.

import * as webllm from "@mlc-ai/web-llm";

const engine = await webllm.CreateMLCEngine(
  "Qwen2.5-3B-Instruct-q4f32_1-MLC"
);

const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a drilling equipment diagnostic assistant." },
    { role: "user",   content: engineerDescription }
  ],
  tools: DIAGNOSTIC_TOOLS,
  tool_choice: "required",
  temperature: 0.1
});

Same API as OpenAI. Runs offline. Ships inside your web app — no server to provision, no API key to manage, no usage bill.

"Why not run an AI server on the ship itself?" Valid question. A ship-side GPU server costs $15,000–$30,000 in hardware, 400W continuous power, dedicated cooling, and someone to maintain it. When the server room floods — exactly when you need it most — every device on the ship loses AI simultaneously. With browser LLM, each device is independent. Nothing to lose because there's no single point of failure.

The Edge Gets Smart

The oil platform story is bigger than one engineer and one pump.

In production telemetry systems, the standard monitoring pattern is threshold rules — a value crosses a line, an alarm fires. We've shipped these pipelines at scale. They work. They also cannot reason. They cannot synthesize across signals. They tell you that something happened, never why.

Pressure drifting high plus vibration elevated plus flow rate slightly reduced — an experienced engineer reads that combination as liner wear, not a sudden blockage. A threshold rule sees three independent events.

A browser-resident model interprets the combination the way the engineer would. In plain language. On the device. With no operational data leaving the platform network.

The asset stops being a data source. It becomes a narrator of its own condition.

This is what edge AI actually looks like in distributed sensor environments — not a GPU server in a rack requiring its own reliability engineering, but inference embedded in the devices already in the field. Hardware that exists. Zero marginal cost per query. Available when the network isn't.

OpenWrt routers, industrial HMIs running embedded Chromium on ARM, ruggedized tablets — all valid targets today. As sub-1B models compiled to WASM mature, the hardware floor drops further.

0430 Hours. Somewhere That Doesn't Appear on Maps.

The forward operating base has been in communications blackout for six hours. Electronic warfare — the enemy is jamming everything. The field medic has two casualties. No medevac window.

She types vitals into her laptop. The local model returns in under three seconds:

{
  "probable": ["tension-pneumothorax", "hemothorax"],
  "priority": "immediate",
  "interventions": ["needle-decompression-right-2nd-ICS", "large-bore-IV-x2"]
}

No connectivity. No cloud. No PHI transmitted.

This makes explicit what the oil rig only implies: sometimes the network being down is not a failure. It is an attack. Cloud-dependent AI fails the moment the adversary succeeds. A browser-resident model doesn't.

Privacy Is Architecture, Not Policy

Most networks are up most of the time. And still — there are environments where sending data out is not a technical problem. It's a legal one.

Portable glucometers. Handheld ECG readers. Spirometers in field screening programs. These devices increasingly run browser companion apps. The data they handle is among the most protected in existence.

When a patient reading goes to a cloud LLM, it triggers a cascade: Business Associate Agreement, retention audit, training data policy review, ongoing compliance monitoring. In high-availability healthcare systems, we've seen this compliance surface grow with every model update the vendor ships.

With browser LLM, the reading never leaves the device. Not because of policy. Because transmission is architecturally impossible.

// Reading interpreted locally — never transmitted
const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "Provide plain-language context for diagnostic readings. Do not diagnose." },
    { role: "user",   content: `Blood glucose: ${reading} mg/dL. Fasting: ${fasting}.` }
  ],
  temperature: 0.2
});
// "This reading is above the normal fasting range. Please consult your healthcare provider."

Rural clinic. Mobile screening unit. Offline. Private. Instant.

Where This Breaks

This architecture is not for every application. Be honest about the constraints before you commit.

Cold start is real. The Qwen2.5-3B model is 1.5GB. On a corporate network that's a 2-minute first load. On mobile broadband it's longer. Plan for it — pre-cache via service worker at install time, not on first user interaction.

The GPU floor matters. Below a modern integrated GPU (Intel Iris Xe or better), inference drops to CPU fallback at ~1 tok/s. That's not interactive. Detect WebGPU availability and route to a server-side fallback for unsupported devices — don't leave users with a broken experience.

Model quality has a ceiling. A 3B parameter model handles structured output, classification, and short reasoning reliably. It hallucates on complex multi-hop logic. It degrades on inputs above ~4K tokens. For tasks that need frontier reasoning, escalate to cloud — don't try to replace GPT-4 with Qwen-3B.

iOS Safari is still constrained. WebGPU landed in Safari 18 with a 256MB buffer limit that restricts which models run. Android Chrome is solid. Desktop Chrome and Edge are solid. iOS is improving but not there yet for larger models.

Pick Your Model

Model	Size	Min GPU VRAM	Typical Speed
Qwen2.5-0.5B	300 MB	1 GB	~90 tok/s
Qwen2.5-1.5B	900 MB	1 GB	~65 tok/s
Qwen2.5-3B	1.5 GB	2 GB	~38–52 tok/s
Phi-3.5-mini	2.2 GB	3 GB	~28 tok/s
Llama-3.2-8B	4.5 GB	6 GB	~12–18 tok/s

For structured output — filtering, diagnostics, classification, form-fill — Qwen2.5-3B is the sweet spot. Fast enough to feel instant. Capable enough for production use on real tasks.

Who Should Be Paying Attention

Industrial and field operations — oil and gas, maritime, logistics, manufacturing. Anywhere operators work in connectivity-constrained environments with operationally sensitive data.

Defense and government — air-gapped networks, EMCON operations, ITAR-controlled systems. Cloud AI is often forbidden. Browser LLM works within those constraints without additional infrastructure.

Healthcare at the point of care — portable diagnostics, rural medicine, field triage. PHI stays on device by architecture, not by agreement.

Enterprise SaaS in regulated industries — legal, financial, HR. Any product where "add an AI feature" currently means "add an OpenAI dependency and all the compliance overhead."

What Comes Next

Models are getting smaller. Sub-1B parameter models capable enough for structured tasks are close — hardware floor drops to a $50 device.

In-browser vector search is maturing. Local LLM plus local vector store equals a fully offline RAG system — a knowledge base that lives on the device, reasons over local documents, never sends a query anywhere.

A field medic with a tablet: local model for clinical reasoning, local vector store for medical guidelines, full capability with zero connectivity.

An engineer on a platform between satellite windows: local model interpreting equipment telemetry, local knowledge base of fault histories, full diagnostic capability when the uplink is down.

The browser became a valid AI runtime quietly, while everyone was watching the cloud.

It runs where the work happens. It works when the network doesn't. It keeps data where it belongs.

Built with @mlc-ai/web-llm. Model specs and browser support: webllm.mlc.ai. For native mobile and embedded targets: MLC-LLM.

Designing Zero-Downtime Behavioral Migrations in Distributed Systems

Thomas John — Thu, 12 Feb 2026 03:24:32 +0000

Formalizing safe, deterministic migration workflows for production environments

Modern distributed systems evolve continuously. Configuration models
change, abstractions are redesigned, and legacy structures must
eventually be replaced.

However, when a system is live, and high-availability is mandatory,
Migration becomes far more than a data transformation exercise.

It becomes a behavioral transition problem.

Unlike schema migration, behavioral migration modifies how a system
executes in production. The system must remain available, correct, and
consistent while its underlying configuration model changes. This
introduces failure modes that traditional migration literature does not fully address.

Through repeated architectural refinement, I formalized a reusable framework or pattern for safe, resumable, zero-downtime behavioral migration in
distributed systems.

This article outlines that framework.

Why Behavioral Migration Is Harder Than It Looks

Behavioral migration differs from simple data movement in several ways important ways:

The system continues executing while migration runs
Partial activation can cause duplicate execution
Missing relationships can cause silent non-execution
Crashes must not require a full rollback
Re-running migration must be safe and deterministic

The risk is not visible downtime.

The risk is inconsistent behavior.

In high-availability systems, "almost correct" is unacceptable.

The Behavioral Migration Framework

The framework is structured around five architectural principles.

1. Idempotent Step Isolation

Migration should not be implemented as a monolithic script. Instead, it
should be decomposed into deterministic, independently verifiable steps.

Each step must:

Detect prior completion
Cache its output
Skip safely if already executed

async def step(job, name, func):
    if await job.completed(name):
        return await job.cached(name)

    result = await func()
    await job.mark_completed(name)
    await job.cache(name, result)
    return result

This guarantees:

Safe restarts
Deterministic outcomes
Protection against duplicate writes
Operational resilience under failure

Without idempotent step isolation, migration reliability depends on
process stability --- which is never guaranteed in distributed systems.

2. Atomic Activation Boundary

One of the most dangerous migration mistakes is partial activation.

If new entities are created and activated incrementally, the system may
begin executing against an incomplete state.

The solution is strict separation:

Create all new entities in an inert state
Establish all relationships
Validate structural completeness
Activate everything in one atomic boundary

This eliminates:

Partial behavior shifts
Duplicate execution
Inconsistent state windows

The activation boundary becomes the single, well-defined moment when
execution transitions from legacy logic to the new model.

In distributed environments, activation control is more important than
creation logic.

3. Deterministic Configuration Normalization

Legacy systems accumulate structural redundancy. Equivalent
configurations may exist under slightly different wrappers.

Migration provides an opportunity to normalize equivalent logic without
altering behavior.

Using deterministic grouping keys such as:

key = (type, priority, schedule)

key = frozenset(sorted(attributes))

ensures consistent consolidation.

Normalization during migration produces a cleaner target model and
reduces long-term technical debt. It transforms migration from
replication into architectural refinement.

4. Bounded Concurrent Retrieval

Behavioral migration frequently requires retrieving the configuration from
distributed sources.

Sequential retrieval is inefficient at scale.
Unbounded concurrency risks overwhelming upstream systems.

Bounded concurrency provides balance:

semaphore = asyncio.Semaphore(N)

When combined with exponential backoff retries, this approach maintains
throughput while preserving system stability.

Migration logic must scale without destabilizing the environment it is attempting to modernize.

5. Pre-Mutation Observability

Before modifying the production state, a read-only analysis mode should
exist.

This mode should answer:

What would be created?
What would be grouped?
What anomalies exist?
What would be skipped?

Observation precedes mutation.

Pre-mutation observability reduces uncertainty and surfaces structural
inconsistencies before they become runtime failures.

In complex distributed systems, analysis tooling is often more valuable
than mutation tooling.

The Hidden Risk: Data Path Integrity

Many migration failures are not caused by flawed algorithms.

They are caused by incomplete data propagation.

Conditional logic may be correct while upstream parsing silently fails, resulting in entire configuration segments being omitted.

Therefore, validation must extend beyond:

Logical correctness

to:

End-to-end data path verification

Integration-level validation is critical for behavioral migration
safety.

Conclusion

Zero-downtime migration is not about moving data.

It is about moving behavior — without breaking operational guarantees.

That requires:

Determinism
Isolation
Explicit transition boundaries
Controlled execution
Observability before change

In high-availability systems, migration safety cannot be delegated to a deployment checklist.

It must be embedded into the architecture itself.

A migration should never be an ad-hoc script.

It should be a designed workflow — predictable, resumable, and activation-safe — treated as a first-class architectural concern.