Forem: System Rationale

RAG vs GraphRAG: When to Use What (From a Builder’s Perspective)

System Rationale — Mon, 13 Apr 2026 23:27:00 +0000

I wasted time overengineering a GraphRAG system…
when a simple RAG pipeline would’ve done the job better.

If you’re building with LLMs, you’ll hit this question:

“Should I use RAG or GraphRAG?”

Let’s break it down without hype.

⚙️ What RAG actually is (in real systems)

RAG is simple:
1. Chunk your data
2. Convert to embeddings
3. Store in vector DB
4. Retrieve top-k chunks
5. Send to LLM

simplified flow

query_embedding = embed(query)
docs = vector_db.search(query_embedding, top_k=5)
response = llm.generate(query, context=docs)

👉 That’s it.

And honestly?

This solves 80–90% of real-world use cases.

🧠 What GraphRAG adds (and why it exists)

GraphRAG introduces:
• entities
• relationships
• graph traversal

Instead of:

“Find similar text”

It does:

“Find related concepts and how they connect”

This enables:
• multi-hop reasoning
• cross-document understanding
• better context stitching

But there’s a catch 👇

⚠️ The hidden cost nobody talks about

GraphRAG is NOT just “RAG + graph”

You now need:
• entity extraction pipelines
• relationship modeling
• graph database (Neo4j etc.)
• community detection / summaries
• sync between vector + graph

👉 This is real engineering overhead.

And in many cases… unnecessary.

🧠 When you should use RAG

Use RAG if your problem is:
• “Find answer from documents”
• “Summarize this content”
• “Search internal knowledge base”
• “Answer FAQ / support queries”

👉 RAG is faster, cheaper, easier

Also:
• updates = reindex
• no schema headache

When GraphRAG actually makes sense

Use GraphRAG ONLY if:
• relationships matter more than text
• queries require multi-step reasoning
• data is highly interconnected

Examples:
• fraud detection (who is linked to whom)
• research analysis (connecting papers, concepts)
• enterprise knowledge graphs
• supply chain / dependency mapping

👉 If your question is:

“How are A, B, and C connected?”

You need GraphRAG.

🔥 The mistake most devs make

They do this:

“GraphRAG is more advanced → I should use it”

Wrong.

GraphRAG is:
• slower
• more expensive
• harder to maintain

And for simple Q&A…

👉 it can perform worse than RAG

The real-world architecture (what actually works)

Best systems don’t choose.

They combine:
• RAG → for fast retrieval
• Graph → for reasoning

Flow:

Query → Vector Search → Relevant chunks

↓

Graph traversal → relationships

↓

LLM → final answer

👉 Hybrid is where things get powerful

⚠️** One more thing (security)**

RAG systems can be attacked via:
• prompt injection in documents

So always:
• sanitize inputs
• separate instructions from data

** Final takeaway**
• Start with RAG
• Add Graph only if needed
• Don’t overengineer early

Useful resources

RAG
• https://www.ibm.com/think/topics/retrieval-augmented-generation
• https://weaviate.io/blog/introduction-to-rag
• https://www.pinecone.io/learn/retrieval-augmented-generation/

GraphRAG
• https://microsoft.github.io/graphrag/
• https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/
• https://github.com/neo4j/neo4j-graphrag-python

Frameworks
• https://docs.langchain.com/oss/python/langchain/rag
• https://developers.llamaindex.ai/

👋 If you’re building

I’m building AI systems in public — sharing:
• what works
• what breaks
• what scales

Let’s connect if you’re in the same space.

you can follow me on x: [https://x.com/systemRationale]

Running AI Fully Offline on Mobile with Gemma 4 (Android + iOS)

System Rationale — Mon, 13 Apr 2026 11:01:00 +0000

I used to think “AI in apps” meant calling an API.
Then I tried running the model inside the app itself.

No network. No latency spikes. No sending user data anywhere.

That’s when things started to feel… different.

Why this shift actually matters

Most mobile AI apps today work like this:

User → App → API → Cloud Model → Response

Which means:
• unpredictable latency
• ongoing cost
• user data leaves the device

Now compare that with:

User → App → Local Model → Response

No round trips. No dependency.

That’s what Gemma 4 enables with its edge-optimized variants (E2B / E4B).

⚙️ How you actually run it on mobile

There are two real paths here. Everything else is noise.

Android (Best path): System-level AI via AICore

If you’re targeting modern Android:
• The model runs as part of the system
• You don’t bundle anything heavy
• OS handles optimization (CPU/GPU scheduling)

👉 This is the cleanest architecture:
• smaller APK
• better performance
• less maintenance

Cross-platform: MediaPipe / AI Edge (Android + iOS)

This is where most devs will start.

You:
• download a Gemma model (optimized format)
• run it locally via inference API
• stream responses into your UI

What the code actually looks like

Let’s keep it real — not pseudo code.

🔹 Android (Kotlin)

val llm = LlmInference.createFromOptions(
    context = appContext,
    options = LlmInferenceOptions.builder()
        .setModelPath("/data/user/0/app/files/gemma-4-E2B.litertlm")
        .setMaxTokens(256)
        .build()
)

// Run inference off main thread
CoroutineScope(Dispatchers.IO).launch {
    val response = llm.generateResponse("Explain event-driven architecture simply")

    withContext(Dispatchers.Main) {
        textView.text = response
    }
}

👉 Important:
• Never run this on main thread
• Keep responses streamed if possible

🔹 iOS (Swift)

let llm = try LlmInference(
    modelPath: "gemma-4-E2B.litertlm",
    maxTokens: 256
)

DispatchQueue.global(qos: .userInitiated).async {
    let response = try? llm.generateResponse(
        input: "Explain microservices vs monolith"
    )

    DispatchQueue.main.async {
        self.outputLabel.text = response
    }
}

👉 Same rule applies:
• background execution is mandatory
• UI must stay responsive

⚡ Performance reality (this is where most fail)

Let’s be honest — running LLMs on phones is not “free”.

Model size is your first constraint

Even optimized models:
• can be hundreds of MB

👉 Practical approach:
• Default → E2B
• Optional → E4B (for high-end devices)

First response latency matters more than speed

Users don’t care about tokens/sec.
They care about:

“How fast did I get the first answer?”

👉 Fix:
• warm up model with a tiny prompt
• preload when app is idle

GPU / Metal is not optional

If you rely only on CPU:
• performance drops hard
• battery drains faster

👉 Always enable:
• GPU backend (Android)
• Metal (iOS)

Threading mistakes will break your app

If you run inference on UI thread:
• frame drops
• ANRs
• crashes

👉 Treat model inference like:
• network call
• or heavy computation

Privacy becomes a feature (finally)

This is the part most people underestimate.

When everything runs locally:
• user input stays on device
• no logs
• no external dependency

👉 This unlocks real use cases:
• private note summarization
• personal AI assistants
• sensitive chat analysis
• offline learning tools

App size strategy (critical decision)

This is where many implementations go wrong.

❌ Don’t do this
• bundle model inside APK/IPA
• force download during install

👉 You’ll kill install conversion.

✅ Do this instead
• download model after user opts in
• store in app-specific storage
• allow deletion / re-download

🧠 Even better (Android)

If available:
• use system model (AICore)

👉 Zero model shipping
👉 Zero storage overhead

** Where this actually makes sense**

Not every app needs on-device AI.

But for these, it’s a serious advantage:
• EdTech (offline tutor, quizzes)
• Productivity (notes, summaries)
• Messaging (privacy-first features)
• Dating apps (local intelligence, no data leak)

⚠️ Hard truth

This is not magic.

Avoid if:
• targeting low-end devices
• need heavy multi-agent orchestration
• require massive context windows

🚀 What changed for me

After experimenting with this setup, one thing became clear:

The future of mobile AI isn’t “better APIs”
It’s “less APIs”

🔚 So We’re moving from:

“Send data → wait → receive response”

to:

“Compute locally → respond instantly”

And the teams that design for this early
will build products that feel fundamentally faster and more trustworthy.

👋 If you’re building in this space

I’m building an AI-powered learning system in public.

Sharing:
• what I build
• what breaks
• what actually scales

If that’s your space too → let’s connect.

Running AI in the Browser with Gemma 4 (No API, No Server)

System Rationale — Sat, 11 Apr 2026 12:48:50 +0000

Most “AI apps” today are just API wrappers.
That’s fine… until you care about latency, cost, or privacy.

I’ve been exploring what it actually takes to run LLMs inside the browser, and Gemma 4 completely changes what’s possible.

This is not theory this is what actually works.

Why Gemma 4 is different

Gemma 4 isn’t just another model release.

It’s designed for:
• on-device inference
• agentic workflows
• multimodal tasks (text, audio, vision)

The important part?

👉 The E2B / E4B variants are small enough to run inside a browser tab.

No backend required.

⚙️ How it actually runs in the browser

Let’s cut the hype.

There are only 2 real approaches:

1. MediaPipe LLM Inference (Recommended)

• WebAssembly + WebGPU under the hood
• Load model like:

const llm = await LlmInference.createFromOptions({ modelAssetPath: "/models/gemma-4-E2B.litertlm", });

That’s it.

You now have:
• streaming responses
• token control
• temperature, top-k, etc.

2. WebGPU (Transformers.js style)

More control, more pain.
• You host quantized model
• Run inference via WebGPU
• Manage decoding loop yourself

👉 Only use this if you need custom pipelines.

⚡ Performance Reality (What nobody tells you)

Running LLMs in browser ≠ free magic.

Here’s what actually matters:

1. Model size will kill you if you’re careless

• Raw models → GBs
• Optimized (4-bit) → hundreds of MB

👉 Rule:
• E2B → default
• E4B → only for high-end devices

2. Token limits = UX

Don’t blindly use 128K context.

You’ll:
• increase latency
• kill memory
• freeze UI

👉 Cap aggressively:

maxTokens: 512

3. Main thread blocking = bad UX

If you don’t handle this:
• UI freezes
• typing lag
• users drop

👉 Always:
• stream tokens
• use Web Workers if custom setup

## You need device intelligence

Don’t assume every device can handle it.

👉 Do this:
• Check WebGPU support
• Estimate memory
• Fallback → API model

🔐 Privacy = Your biggest advantage

This is where things get interesting.

With browser-based Gemma:
• No API calls
• No prompt logging
• No server dependency

Your pitch becomes :

“Your data never leaves your device.”

That’s not marketing — that’s architecture.

** How to keep your app lightweight**

If you mess this up, your app is dead.

❌ Wrong approach:
• Bundle model in JS
• Load on startup

✅ Correct approach:
1. Lazy load model

if (userClicksAI) { loadModel(); }

Separate asset hosting

• /models/gemma-4-E2B.litertlm

Cache aggressively

• long cache headers
• avoid re-download

Progressive upgrade

• start small → offer bigger model later

🧠 Real use cases (not demos)

Where this actually makes sense:
• Private note summarizer
• Offline AI assistant
• In-browser coding helper
• Document parsing (OCR + reasoning)

⚠️ Brutal truth

This is NOT for:
• low-end phones
• heavy reasoning tasks
• large-scale SaaS

🚀 Where this fits in real products

If you’re building something like:
• productivity tools
• education apps
• private assistants

This is a massive differentiator.

🔚 Final thought

We’re moving from:

“AI as API” to “AI as runtime”

And browsers are becoming compute platforms.

If you’re building something real (not demos),
this shift matters more than any model benchmark.

agent workflows
on-device AI
system design decisions
mistakes & trade-offs

→ Follow me on X: [(https://x.com/systemRationale)]

Part 3 — Making Gemma 4 Agents Production-Ready: Guardrails, Structured Outputs, and Self-Healing Systems

System Rationale — Fri, 10 Apr 2026 02:58:00 +0000

The uncomfortable truth about AI agents

By the time most teams reach this stage, they’ve already built:
• a multi-step workflow
• a supervisor + worker setup
• integration with tools and APIs

And yet, the system still fails in production.

Not because the model is weak.

But because the system is non-deterministic.

⸻

Where reliability actually breaks

In real deployments, failures don’t come from “bad reasoning”.

They come from:
• malformed outputs (invalid JSON, missing fields)
• inconsistent decisions across steps
• uncontrolled retries and loops
• unsafe or duplicated side effects

You can’t patch these with better prompts.

You need contracts, validation, and control layers.

⸻

From probabilistic outputs → deterministic contracts

The first shift is simple but critical:

Treat every model output as untrusted input

Instead of accepting free-form text, define strict schemas using
Pydantic or
PydanticAI.

⸻

Example: Root Cause Contract

class RootCause(BaseModel):
service: str
confidence: float
error_type: Literal["OOM", "MemoryLeak", "Config", "Network"]
evidence: list[str]
next_steps: list[str]

This does three things:
1. Forces the model into a structured format
2. Enables automatic validation
3. Creates a stable interface between system components

⸻

What this looks like in practice

A production pipeline becomes:

LLM Output → Schema Validation → Accept / Reject → Retry / Escalate

This is no longer “AI responding”.

It’s a controlled data pipeline.

⸻

The self-healing loop

Validation is only half the system.

The real reliability comes from how you handle failure.

⸻

Controlled retry pattern
1. Generate output
2. Validate against schema
3. Capture validation error
4. Feed error back into model
5. Retry with constraints
6. Stop after N attempts

⸻

Example failure feedback

Instead of:

“Try again”

You send:

“Field confidence must be a float between 0 and 1.
error_type must be one of [OOM, MemoryLeak, Config, Network].
Fix the JSON.”

This transforms the model into a self-correcting system.

⸻

Why Gemma 4 fits this model well

With Gemma 4, this loop becomes practical at scale.

Because:
• thinking mode improves structured reasoning
• MoE architecture reduces cost per retry
• long context allows passing validation history
• tool calling aligns with structured outputs

This is critical.

Self-healing systems require multiple attempts.
Cost-efficient inference makes that viable.

⸻

Guardrails are not optional

Without guardrails, your system will eventually:
• loop indefinitely
• call the wrong tools
• execute unsafe actions

⸻

Minimum guardrail layer

You should implement:

Step limits
• Hard cap on number of node executions
Error classification
• Retry: timeouts, rate limits
• Fail: schema errors, auth issues
Circuit breakers
• Stop calling failing dependencies
Human-in-the-loop
• Required for destructive actions

⸻

Visualizing guardrails in the system

Think of your system as:

State Machine
↓
Validation Layer
↓
Guardrails
↓
Execution

Each layer reduces uncertainty.

⸻

Going beyond validation: adaptive systems with DSPy

Validation ensures correctness.

But how do you improve the system over time?

⸻

Enter DSPy

DSPy treats your pipeline as a program:
• inputs → outputs
• defined signatures
• measurable metrics

It allows you to:
• run evaluation datasets
• measure output quality
• optimize prompts automatically

⸻

What this unlocks

Instead of manual tuning:
• the system detects failures
• adjusts prompts / examples
• improves over time

This is the missing layer in most agent systems.

⸻

Combining everything: the deterministic stack

A production-ready Gemma 4 system looks like:

State Graph (LangGraph)
↓
Supervisor (Gemma 4 thinking mode)
↓
Workers (task-specific agents)
↓
Pydantic Validation
↓
Guardrails
↓
DSPy Evaluation + Optimization

Each layer solves a specific failure mode.

⸻

Real-world application: autonomous DevOps agent

Example workflow:

Trace
• collect logs, metrics, events

RootCause
• detect anomalies (OOMKilled, memory leaks)

Plan
• decide corrective action

Fix
• restart pods, scale services, or open PR

Verify
• confirm system recovery

⸻

Why this works

Because:
• every step is validated
• every action is controlled
• every failure is recoverable

This is not an “AI agent”.

It’s a deterministic system with AI inside it.

⸻

Practical implementation stack

If you’re building this today:
• Model: Gemma 4 (26B MoE)
• Orchestration: LangGraph
• Validation: Pydantic / PydanticAI
• Guardrails: custom + middleware
• Evaluation: DSPy

⸻

Resources

Core
• https://github.com/google-deepmind/gemma
• https://github.com/google/gemma_pytorch

Orchestration
• https://github.com/langchain-ai/langgraph
• https://github.com/langchain-ai/langgraph-example

Validation & Guardrails
• https://github.com/pydantic/pydantic-ai
• https://github.com/jagreehal/pydantic-ai-guardrails

Evaluation & Optimization
• https://github.com/stanfordnlp/dspy
• https://github.com/Scale3-Labs/dspy-examples

Real-world systems
• https://github.com/qicesun/SRE-Agent-App

⸻

Final perspective

Most teams are still chasing:
• better prompts
• better models
• better outputs

That’s not where reliability comes from.

⸻

Reliability comes from:
• explicit state
• strict contracts
• controlled execution
• continuous evaluation

Designing Multi-Agent Systems with Gemma 4: Supervisor and Worker Pattern

System Rationale — Wed, 08 Apr 2026 02:49:00 +0000

Most agent implementations fail for a simple reason:

They try to make one model do everything.

That approach does not scale.

⸻

The limitation of single-agent systems

When one agent is responsible for:
• understanding context
• making decisions
• calling tools
• validating outputs
• executing actions

you introduce uncontrolled complexity.

The result is:
• inconsistent behavior
• hallucinated decisions
• poor failure recovery

This is not a model limitation. It’s a design issue.

⸻

The correct pattern: separation of responsibilities

A more stable architecture separates concerns into two layers:

Worker agents

Each worker is narrowly scoped:
• log analysis
• root cause detection
• code or PR generation
• infrastructure interaction

Workers should be predictable and task-specific.

⸻

Supervisor agent

The supervisor coordinates the system.

With Gemma 4, this becomes significantly more powerful due to its thinking mode.

The supervisor:
• reads the global system state
• decides which worker to invoke
• validates outputs before progressing
• handles retries and escalation

⸻

Why thinking mode matters

Gemma 4 introduces structured reasoning behavior, often referred to as a “thinking” phase.

In practice, this allows the supervisor to:
1. evaluate multiple possible actions
2. internally reason about risks and outcomes
3. select the next state transition

This creates a separation between:
• internal reasoning
• external actions

That separation is critical for reliability.

⸻

Putting it together: state-driven execution

A typical flow looks like this:
• Trace — collect logs, metrics, events
• RootCause — identify likely issue
• Plan — decide next action
• Fix / Escalate — execute or request approval
• Verify — confirm resolution

Each step is a node in a state machine.

The supervisor controls transitions between nodes.

⸻

What this architecture fixes

This approach eliminates common issues:
• uncontrolled loops → bounded by state transitions
• inconsistent decisions → centralized in supervisor
• retry chaos → handled explicitly in graph
• unclear execution → traceable at each node

⸻

What most teams still get wrong

Even with this architecture, many implementations fail because they:
• skip output validation
• allow unlimited retries
• treat tool calls as always safe
• don’t distinguish between reversible and irreversible actions

These are not optional concerns.

They define whether your system is production-ready.

⸻

Resources
• https://github.com/langchain-ai/langgraph
• https://github.com/emarco177/langgraph-course
• https://codelabs.developers.google.com/aidemy-multi-agent/instructions

⸻

In the final part:

How to make Gemma 4 agents deterministic using structured outputs, guardrails, and self-healing pipelines

Gemma 4 MoE: frontier quality at 1/10th the API cost

System Rationale — Tue, 07 Apr 2026 02:43:00 +0000

Gemma 4 MoE: frontier quality at 1/10th the API cost

gemma4 #moe #llm #openweights #aiinfra

Continuing from Part 1 — once you have a proper state machine architecture, the next question is: which model runs inside it?

For high-volume agent workloads, my pick is Gemma 4 26B MoE.

Here's the actual reasoning.

What MoE means (no marketing)

Most LLMs are dense. A 30B dense model activates 30B parameters per token — every single one, every single call.

Mixture-of-Experts works differently:

Total parameters: ~25B
Active parameters per token: ~3.8B
A router picks 8 experts out of 128 per token

Near-30B quality. ~4B compute per token.

Not a trick. Just a better architecture for inference-heavy workloads.

The real cost math

GPT-4o: $2.50 per 1M input tokens, $10 per 1M output tokens.

Gemma 4 is open-weight. Host it yourself on an A100. At volume — thousands of agent runs per day — the math flips hard in your favor.

This matters specifically for agents because agents are token-heavy. One agent run might involve 5–20 LLM calls, each with a full context window. At GPT-4o pricing, that adds up fast. On self-hosted Gemma 4, it stays manageable.

What Gemma 4 gives you specifically for agents

256K context window — feed full log files, traces, conversation history in one shot
Native function calling — no wrapper hacks for tool use
Thinking mode — model reasons privately before acting (critical for Supervisor agents — Part 3)
Multimodal input — pass Grafana screenshots directly to it

When GPT-4o still wins

Being honest here:

Need sub-second latency, don't control infra → GPT-4o
Need best reasoning with zero setup → GPT-4o
Running under 10k tokens/day → pricing doesn't matter, use anything

Gemma 4 wins when:

You need cost control at volume
Data can't leave your infra (regulated, private)
You're comfortable with GPU infra or a cloud GPU provider

Getting started

ollama pull gemma4:26b

Local testing done. For production throughput, pair with vLLM.

Part 3 is the architecture — Supervisor + Worker agents using Gemma 4's thinking mode inside a LangGraph state machine. That's where 99.9% reliability actually becomes achievable.

— System Rationale

Gemma 4 on Mobile: Which Model to Load (E2B vs E4B) + Real Implementation Guide

System Rationale — Mon, 06 Apr 2026 18:09:21 +0000

Hey devs 👋
I’ve been hands-on with Gemma 4 since it dropped 4 days ago and honestly — the E2B and E4B variants are the first models that actually feel practical for real mobile apps.
Here’s the no-BS guide I wish I had: which model to load for your use case + exactly how to load it on Android, iOS, React Native, and web.

Which Gemma 4 model should you actually load?

E2B (≈5.1B total params, only 2.3B active thanks to Per-Layer Embeddings)
→ Your default for phones.
Use cases: offline tutor, smart replies, chat rephrasing, note summarization, safety filters, anything battery/RAM sensitive.
Cold start is fast, runs smooth on mid-range devices.
E4B (≈8B total, 4.5B effective)
→ Sweet spot for flagship phones or when you need noticeably better reasoning + native audio + image understanding.
Use cases: multimodal (photo → description), longer context tasks, or when E2B feels a bit “light”.
26B A4B MoE or 31B
→ Skip these on mobile. Only for laptops, desktops, or server-side heavy lifting.

Rule of thumb I use: start with E2B. Only bump to E4B if users complain about quality or you need audio/image input.

How to actually load the model (the part that matters) Android

Easiest path: AICore Developer Preview (system-wide Gemma 4, zero weights to ship).
Just call the ML Kit GenAI Prompt API — Google handles hardware delegation (NPU/GPU).
For full control in your app: LiteRT-LM
Download the quantized .task file (4-bit) from HF
Use on-demand Play Asset Delivery so your APK stays <100 MB
Load in background with Coroutines → never block UI
Use streaming callback so tokens appear live

iOS

MediaPipe LLM Inference API is the official way.
Convert to MediaPipe task format → memory-map the weights → Metal/MPS acceleration.
Warm up the model during app idle time so first token feels instant.

React Native

Native TurboModule (Kotlin + Swift) is non-negotiable.
Keep the entire model + inference in native code.
Expose only generateResponse(prompt, options) and onToken events back to JS.
Never run inference on the JS thread — you will OOM and crash.

Web

MediaPipe + WebGPU (works surprisingly well in Chrome).

Universal tips that saved my ass:

Always use 4-bit quantized version (Q4_K_M or LiteRT equivalent)
Never bundle the full model in the APK/IPA — download on first user opt-in
Cap context at 4K–8K for mobile (128K is possible but eats RAM)
Stream tokens. Always. Users hate staring at a blank screen.

Security bonus: because E2B/E4B run 100% offline, user data (exam answers, private notes, photos) never touches your servers. Huge privacy win.
I’m using this exact stack right now for an offline-first tutor app and it’s buttery smooth.
Drop your use case below and I’ll tell you which variant + exact loading path I’d pick for it.
Useful resources (all fresh as of April 2026):

Official Gemma 4 announcement: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
Model card + sizes: https://ai.google.dev/gemma/docs/core/model_card_4
Full model overview (E2B/E4B details): https://ai.google.dev/gemma/docs/core
Android AICore + ML Kit guide: https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html
LiteRT-LM mobile deployment: https://ai.google.dev/edge/litert-lm
Hugging Face E2B/E4B quantized models: https://huggingface.co/google/gemma-4-E2B-it

Who’s actually shipping Gemma 4 on device right now? Show me your stack 🙌

Why your LLM agent fails at 3 AM (and how state machines fix it)

System Rationale — Mon, 06 Apr 2026 09:35:26 +0000

Why your LLM agent fails at 3 AM (and how state machines fix it)

agents #llm #langgraph #systemdesign #aiinfra

I've been reading postmortems from teams running LLM agents in production.

Same failure every time.

Not model quality. Not prompt engineering. The architecture.

Most AI agents today still look like this:

User Input → LLM Call → Tool Call → LLM Call → Output

A chain. Linear. Stateless. Hopeful.

Works great in a notebook. Breaks under real load.

The 4 ways chains die in production

1. Infinite loops
Agent calls a tool → tool fails → agent retries → tool fails → agent retries.
No exit condition. You're burning tokens at 3 AM while sleeping.

2. No checkpoint on failure
Step 7 of 10 fails. You restart from step 1. Every. Single. Time.
Duplicate side effects — emails, API writes, deploys — retried blindly.

3. Opaque debugging
You see the final error. Not which step poisoned the state.
No trace. No replay. Just vibes.

4. Mixed mutation semantics
Read-only and write steps treated identically.
A retry re-applies a deployment or a payment. You've now deployed twice.

The mental model shift

Stop thinking: "prompt chain"
Start thinking: "distributed system with state"

A state machine models your workflow as:

States — Idle, Planning, Executing, Validating, Recovering
Transitions — conditional, guarded, audited
Persisted state — survives crashes, enables checkpointing, replay

LangGraph made this practical. Every node writes to a shared state object. Every edge is conditional.

If a node fails → resume from the last checkpoint. Not from scratch.

What this actually looks like

Chain: A → B → C → D → Error (restart from A)

Graph: A → B → C → Error → Retry(C) → D
↓
HumanApproval → D

The graph knows where it failed. It knows what to do next.
The chain just panics.

This is Part 1 of a series on building deterministic, production-grade multi-agent systems.

Next up: Why I'm using Gemma 4 26B MoE as the reasoning engine — and how it compares to GPT-4o on real cost.

If you're building AI systems that need to work under an SLA — follow along.

— System Rationale