<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: System Rationale</title>
    <description>The latest articles on Forem by System Rationale (@system_rationale).</description>
    <link>https://forem.com/system_rationale</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F404282%2F25c84903-9f4f-4a6c-a886-c324eebf901c.png</url>
      <title>Forem: System Rationale</title>
      <link>https://forem.com/system_rationale</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/system_rationale"/>
    <language>en</language>
    <item>
      <title>Part 3 — Making Gemma 4 Agents Production-Ready: Guardrails, Structured Outputs, and Self-Healing Systems</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Fri, 10 Apr 2026 02:58:00 +0000</pubDate>
      <link>https://forem.com/system_rationale/part-3-making-gemma-4-agents-production-ready-guardrails-structured-outputs-and-self-healing-575n</link>
      <guid>https://forem.com/system_rationale/part-3-making-gemma-4-agents-production-ready-guardrails-structured-outputs-and-self-healing-575n</guid>
      <description>&lt;p&gt;The uncomfortable truth about AI agents&lt;/p&gt;

&lt;p&gt;By the time most teams reach this stage, they’ve already built:&lt;br&gt;
    • a multi-step workflow&lt;br&gt;
    • a supervisor + worker setup&lt;br&gt;
    • integration with tools and APIs&lt;/p&gt;

&lt;p&gt;And yet, the system still fails in production.&lt;/p&gt;

&lt;p&gt;Not because the model is weak.&lt;/p&gt;

&lt;p&gt;But because the system is non-deterministic.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Where reliability actually breaks&lt;/p&gt;

&lt;p&gt;In real deployments, failures don’t come from “bad reasoning”.&lt;/p&gt;

&lt;p&gt;They come from:&lt;br&gt;
    • malformed outputs (invalid JSON, missing fields)&lt;br&gt;
    • inconsistent decisions across steps&lt;br&gt;
    • uncontrolled retries and loops&lt;br&gt;
    • unsafe or duplicated side effects&lt;/p&gt;

&lt;p&gt;You can’t patch these with better prompts.&lt;/p&gt;

&lt;p&gt;You need contracts, validation, and control layers.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;From probabilistic outputs → deterministic contracts&lt;/p&gt;

&lt;p&gt;The first shift is simple but critical:&lt;/p&gt;

&lt;p&gt;Treat every model output as untrusted input&lt;/p&gt;

&lt;p&gt;Instead of accepting free-form text, define strict schemas using&lt;br&gt;
Pydantic or&lt;br&gt;
PydanticAI.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Example: Root Cause Contract&lt;/p&gt;

&lt;p&gt;class RootCause(BaseModel):&lt;br&gt;
    service: str&lt;br&gt;
    confidence: float&lt;br&gt;
    error_type: Literal["OOM", "MemoryLeak", "Config", "Network"]&lt;br&gt;
    evidence: list[str]&lt;br&gt;
    next_steps: list[str]&lt;/p&gt;

&lt;p&gt;This does three things:&lt;br&gt;
    1.  Forces the model into a structured format&lt;br&gt;
    2.  Enables automatic validation&lt;br&gt;
    3.  Creates a stable interface between system components&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What this looks like in practice&lt;/p&gt;

&lt;p&gt;A production pipeline becomes:&lt;/p&gt;

&lt;p&gt;LLM Output → Schema Validation → Accept / Reject → Retry / Escalate&lt;/p&gt;

&lt;p&gt;This is no longer “AI responding”.&lt;/p&gt;

&lt;p&gt;It’s a controlled data pipeline.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The self-healing loop&lt;/p&gt;

&lt;p&gt;Validation is only half the system.&lt;/p&gt;

&lt;p&gt;The real reliability comes from how you handle failure.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Controlled retry pattern&lt;br&gt;
    1.  Generate output&lt;br&gt;
    2.  Validate against schema&lt;br&gt;
    3.  Capture validation error&lt;br&gt;
    4.  Feed error back into model&lt;br&gt;
    5.  Retry with constraints&lt;br&gt;
    6.  Stop after N attempts&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Example failure feedback&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;“Try again”&lt;/p&gt;

&lt;p&gt;You send:&lt;/p&gt;

&lt;p&gt;“Field confidence must be a float between 0 and 1.&lt;br&gt;
error_type must be one of [OOM, MemoryLeak, Config, Network].&lt;br&gt;
Fix the JSON.”&lt;/p&gt;

&lt;p&gt;This transforms the model into a self-correcting system.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why Gemma 4 fits this model well&lt;/p&gt;

&lt;p&gt;With Gemma 4, this loop becomes practical at scale.&lt;/p&gt;

&lt;p&gt;Because:&lt;br&gt;
    • thinking mode improves structured reasoning&lt;br&gt;
    • MoE architecture reduces cost per retry&lt;br&gt;
    • long context allows passing validation history&lt;br&gt;
    • tool calling aligns with structured outputs&lt;/p&gt;

&lt;p&gt;This is critical.&lt;/p&gt;

&lt;p&gt;Self-healing systems require multiple attempts.&lt;br&gt;
Cost-efficient inference makes that viable.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Guardrails are not optional&lt;/p&gt;

&lt;p&gt;Without guardrails, your system will eventually:&lt;br&gt;
    • loop indefinitely&lt;br&gt;
    • call the wrong tools&lt;br&gt;
    • execute unsafe actions&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Minimum guardrail layer&lt;/p&gt;

&lt;p&gt;You should implement:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Step limits&lt;br&gt;
• Hard cap on number of node executions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Error classification&lt;br&gt;
• Retry: timeouts, rate limits&lt;br&gt;
• Fail: schema errors, auth issues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Circuit breakers&lt;br&gt;
• Stop calling failing dependencies&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Human-in-the-loop&lt;br&gt;
• Required for destructive actions&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Visualizing guardrails in the system&lt;/p&gt;

&lt;p&gt;Think of your system as:&lt;/p&gt;

&lt;p&gt;State Machine&lt;br&gt;
   ↓&lt;br&gt;
Validation Layer&lt;br&gt;
   ↓&lt;br&gt;
Guardrails&lt;br&gt;
   ↓&lt;br&gt;
Execution&lt;/p&gt;

&lt;p&gt;Each layer reduces uncertainty.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Going beyond validation: adaptive systems with DSPy&lt;/p&gt;

&lt;p&gt;Validation ensures correctness.&lt;/p&gt;

&lt;p&gt;But how do you improve the system over time?&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Enter DSPy&lt;/p&gt;

&lt;p&gt;DSPy treats your pipeline as a program:&lt;br&gt;
    • inputs → outputs&lt;br&gt;
    • defined signatures&lt;br&gt;
    • measurable metrics&lt;/p&gt;

&lt;p&gt;It allows you to:&lt;br&gt;
    • run evaluation datasets&lt;br&gt;
    • measure output quality&lt;br&gt;
    • optimize prompts automatically&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What this unlocks&lt;/p&gt;

&lt;p&gt;Instead of manual tuning:&lt;br&gt;
    • the system detects failures&lt;br&gt;
    • adjusts prompts / examples&lt;br&gt;
    • improves over time&lt;/p&gt;

&lt;p&gt;This is the missing layer in most agent systems.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Combining everything: the deterministic stack&lt;/p&gt;

&lt;p&gt;A production-ready Gemma 4 system looks like:&lt;/p&gt;

&lt;p&gt;State Graph (LangGraph)&lt;br&gt;
      ↓&lt;br&gt;
Supervisor (Gemma 4 thinking mode)&lt;br&gt;
      ↓&lt;br&gt;
Workers (task-specific agents)&lt;br&gt;
      ↓&lt;br&gt;
Pydantic Validation&lt;br&gt;
      ↓&lt;br&gt;
Guardrails&lt;br&gt;
      ↓&lt;br&gt;
DSPy Evaluation + Optimization&lt;/p&gt;

&lt;p&gt;Each layer solves a specific failure mode.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Real-world application: autonomous DevOps agent&lt;/p&gt;

&lt;p&gt;Example workflow:&lt;/p&gt;

&lt;p&gt;Trace&lt;br&gt;
    • collect logs, metrics, events&lt;/p&gt;

&lt;p&gt;RootCause&lt;br&gt;
    • detect anomalies (OOMKilled, memory leaks)&lt;/p&gt;

&lt;p&gt;Plan&lt;br&gt;
    • decide corrective action&lt;/p&gt;

&lt;p&gt;Fix&lt;br&gt;
    • restart pods, scale services, or open PR&lt;/p&gt;

&lt;p&gt;Verify&lt;br&gt;
    • confirm system recovery&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why this works&lt;/p&gt;

&lt;p&gt;Because:&lt;br&gt;
    • every step is validated&lt;br&gt;
    • every action is controlled&lt;br&gt;
    • every failure is recoverable&lt;/p&gt;

&lt;p&gt;This is not an “AI agent”.&lt;/p&gt;

&lt;p&gt;It’s a deterministic system with AI inside it.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Practical implementation stack&lt;/p&gt;

&lt;p&gt;If you’re building this today:&lt;br&gt;
    • Model: Gemma 4 (26B MoE)&lt;br&gt;
    • Orchestration: LangGraph&lt;br&gt;
    • Validation: Pydantic / PydanticAI&lt;br&gt;
    • Guardrails: custom + middleware&lt;br&gt;
    • Evaluation: DSPy&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Resources&lt;/p&gt;

&lt;p&gt;Core&lt;br&gt;
    • &lt;a href="https://github.com/google-deepmind/gemma" rel="noopener noreferrer"&gt;https://github.com/google-deepmind/gemma&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/google/gemma_pytorch" rel="noopener noreferrer"&gt;https://github.com/google/gemma_pytorch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Orchestration&lt;br&gt;
    • &lt;a href="https://github.com/langchain-ai/langgraph" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/langgraph&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/langchain-ai/langgraph-example" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/langgraph-example&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Validation &amp;amp; Guardrails&lt;br&gt;
    • &lt;a href="https://github.com/pydantic/pydantic-ai" rel="noopener noreferrer"&gt;https://github.com/pydantic/pydantic-ai&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/jagreehal/pydantic-ai-guardrails" rel="noopener noreferrer"&gt;https://github.com/jagreehal/pydantic-ai-guardrails&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Evaluation &amp;amp; Optimization&lt;br&gt;
    • &lt;a href="https://github.com/stanfordnlp/dspy" rel="noopener noreferrer"&gt;https://github.com/stanfordnlp/dspy&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/Scale3-Labs/dspy-examples" rel="noopener noreferrer"&gt;https://github.com/Scale3-Labs/dspy-examples&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Real-world systems&lt;br&gt;
    • &lt;a href="https://github.com/qicesun/SRE-Agent-App" rel="noopener noreferrer"&gt;https://github.com/qicesun/SRE-Agent-App&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Final perspective&lt;/p&gt;

&lt;p&gt;Most teams are still chasing:&lt;br&gt;
    • better prompts&lt;br&gt;
    • better models&lt;br&gt;
    • better outputs&lt;/p&gt;

&lt;p&gt;That’s not where reliability comes from.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Reliability comes from:&lt;br&gt;
    • explicit state&lt;br&gt;
    • strict contracts&lt;br&gt;
    • controlled execution&lt;br&gt;
    • continuous evaluation&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productionagent</category>
      <category>agentdesign</category>
      <category>multiagent</category>
    </item>
    <item>
      <title>Designing Multi-Agent Systems with Gemma 4: Supervisor and Worker Pattern</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Wed, 08 Apr 2026 02:49:00 +0000</pubDate>
      <link>https://forem.com/system_rationale/designing-multi-agent-systems-with-gemma-4-supervisor-and-worker-pattern-2ckh</link>
      <guid>https://forem.com/system_rationale/designing-multi-agent-systems-with-gemma-4-supervisor-and-worker-pattern-2ckh</guid>
      <description>&lt;p&gt;Most agent implementations fail for a simple reason:&lt;/p&gt;

&lt;p&gt;They try to make one model do everything.&lt;/p&gt;

&lt;p&gt;That approach does not scale.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The limitation of single-agent systems&lt;/p&gt;

&lt;p&gt;When one agent is responsible for:&lt;br&gt;
    • understanding context&lt;br&gt;
    • making decisions&lt;br&gt;
    • calling tools&lt;br&gt;
    • validating outputs&lt;br&gt;
    • executing actions&lt;/p&gt;

&lt;p&gt;you introduce uncontrolled complexity.&lt;/p&gt;

&lt;p&gt;The result is:&lt;br&gt;
    • inconsistent behavior&lt;br&gt;
    • hallucinated decisions&lt;br&gt;
    • poor failure recovery&lt;/p&gt;

&lt;p&gt;This is not a model limitation. It’s a design issue.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The correct pattern: separation of responsibilities&lt;/p&gt;

&lt;p&gt;A more stable architecture separates concerns into two layers:&lt;/p&gt;

&lt;p&gt;Worker agents&lt;/p&gt;

&lt;p&gt;Each worker is narrowly scoped:&lt;br&gt;
    • log analysis&lt;br&gt;
    • root cause detection&lt;br&gt;
    • code or PR generation&lt;br&gt;
    • infrastructure interaction&lt;/p&gt;

&lt;p&gt;Workers should be predictable and task-specific.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Supervisor agent&lt;/p&gt;

&lt;p&gt;The supervisor coordinates the system.&lt;/p&gt;

&lt;p&gt;With Gemma 4, this becomes significantly more powerful due to its thinking mode.&lt;/p&gt;

&lt;p&gt;The supervisor:&lt;br&gt;
    • reads the global system state&lt;br&gt;
    • decides which worker to invoke&lt;br&gt;
    • validates outputs before progressing&lt;br&gt;
    • handles retries and escalation&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why thinking mode matters&lt;/p&gt;

&lt;p&gt;Gemma 4 introduces structured reasoning behavior, often referred to as a “thinking” phase.&lt;/p&gt;

&lt;p&gt;In practice, this allows the supervisor to:&lt;br&gt;
    1.  evaluate multiple possible actions&lt;br&gt;
    2.  internally reason about risks and outcomes&lt;br&gt;
    3.  select the next state transition&lt;/p&gt;

&lt;p&gt;This creates a separation between:&lt;br&gt;
    • internal reasoning&lt;br&gt;
    • external actions&lt;/p&gt;

&lt;p&gt;That separation is critical for reliability.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Putting it together: state-driven execution&lt;/p&gt;

&lt;p&gt;A typical flow looks like this:&lt;br&gt;
    • Trace — collect logs, metrics, events&lt;br&gt;
    • RootCause — identify likely issue&lt;br&gt;
    • Plan — decide next action&lt;br&gt;
    • Fix / Escalate — execute or request approval&lt;br&gt;
    • Verify — confirm resolution&lt;/p&gt;

&lt;p&gt;Each step is a node in a state machine.&lt;/p&gt;

&lt;p&gt;The supervisor controls transitions between nodes.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What this architecture fixes&lt;/p&gt;

&lt;p&gt;This approach eliminates common issues:&lt;br&gt;
    • uncontrolled loops → bounded by state transitions&lt;br&gt;
    • inconsistent decisions → centralized in supervisor&lt;br&gt;
    • retry chaos → handled explicitly in graph&lt;br&gt;
    • unclear execution → traceable at each node&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What most teams still get wrong&lt;/p&gt;

&lt;p&gt;Even with this architecture, many implementations fail because they:&lt;br&gt;
    • skip output validation&lt;br&gt;
    • allow unlimited retries&lt;br&gt;
    • treat tool calls as always safe&lt;br&gt;
    • don’t distinguish between reversible and irreversible actions&lt;/p&gt;

&lt;p&gt;These are not optional concerns.&lt;/p&gt;

&lt;p&gt;They define whether your system is production-ready.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Resources&lt;br&gt;
    • &lt;a href="https://github.com/langchain-ai/langgraph" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/langgraph&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/emarco177/langgraph-course" rel="noopener noreferrer"&gt;https://github.com/emarco177/langgraph-course&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://codelabs.developers.google.com/aidemy-multi-agent/instructions" rel="noopener noreferrer"&gt;https://codelabs.developers.google.com/aidemy-multi-agent/instructions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Next&lt;/p&gt;

&lt;p&gt;In the final part:&lt;/p&gt;

&lt;p&gt;How to make Gemma 4 agents deterministic using structured outputs, guardrails, and self-healing pipelines&lt;/p&gt;

</description>
      <category>gemma4</category>
      <category>agentdesign</category>
      <category>agentworker</category>
      <category>multiagent</category>
    </item>
    <item>
      <title>Gemma 4 MoE: frontier quality at 1/10th the API cost</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Tue, 07 Apr 2026 02:43:00 +0000</pubDate>
      <link>https://forem.com/system_rationale/gemma-4-moe-frontier-quality-at-110th-the-api-cost-2oan</link>
      <guid>https://forem.com/system_rationale/gemma-4-moe-frontier-quality-at-110th-the-api-cost-2oan</guid>
      <description>&lt;p&gt;Gemma 4 MoE: frontier quality at 1/10th the API cost&lt;/p&gt;

&lt;h1&gt;
  
  
  gemma4 #moe #llm #openweights #aiinfra
&lt;/h1&gt;

&lt;p&gt;Continuing from Part 1 — once you have a proper state machine architecture, the next question is: which model runs inside it?&lt;/p&gt;

&lt;p&gt;For high-volume agent workloads, my pick is Gemma 4 26B MoE.&lt;/p&gt;

&lt;p&gt;Here's the actual reasoning.&lt;/p&gt;




&lt;h2&gt;
  
  
  What MoE means (no marketing)
&lt;/h2&gt;

&lt;p&gt;Most LLMs are dense. A 30B dense model activates 30B parameters per token — every single one, every single call.&lt;/p&gt;

&lt;p&gt;Mixture-of-Experts works differently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total parameters: ~25B&lt;/li&gt;
&lt;li&gt;Active parameters per token: ~3.8B&lt;/li&gt;
&lt;li&gt;A router picks 8 experts out of 128 per token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Near-30B quality. ~4B compute per token.&lt;/p&gt;

&lt;p&gt;Not a trick. Just a better architecture for inference-heavy workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real cost math
&lt;/h2&gt;

&lt;p&gt;GPT-4o: $2.50 per 1M input tokens, $10 per 1M output tokens.&lt;/p&gt;

&lt;p&gt;Gemma 4 is open-weight. Host it yourself on an A100. At volume — thousands of agent runs per day — the math flips hard in your favor.&lt;/p&gt;

&lt;p&gt;This matters specifically for agents because agents are token-heavy. One agent run might involve 5–20 LLM calls, each with a full context window. At GPT-4o pricing, that adds up fast. On self-hosted Gemma 4, it stays manageable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Gemma 4 gives you specifically for agents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;256K context window — feed full log files, traces, conversation history in one shot&lt;/li&gt;
&lt;li&gt;Native function calling — no wrapper hacks for tool use&lt;/li&gt;
&lt;li&gt;Thinking mode — model reasons privately before acting (critical for Supervisor agents — Part 3)&lt;/li&gt;
&lt;li&gt;Multimodal input — pass Grafana screenshots directly to it&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When GPT-4o still wins
&lt;/h2&gt;

&lt;p&gt;Being honest here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need sub-second latency, don't control infra → GPT-4o&lt;/li&gt;
&lt;li&gt;Need best reasoning with zero setup → GPT-4o&lt;/li&gt;
&lt;li&gt;Running under 10k tokens/day → pricing doesn't matter, use anything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gemma 4 wins when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need cost control at volume&lt;/li&gt;
&lt;li&gt;Data can't leave your infra (regulated, private)&lt;/li&gt;
&lt;li&gt;You're comfortable with GPU infra or a cloud GPU provider&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4:26b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Local testing done. For production throughput, pair with vLLM.&lt;/p&gt;




&lt;p&gt;Part 3 is the architecture — Supervisor + Worker agents using Gemma 4's thinking mode inside a LangGraph state machine. That's where 99.9% reliability actually becomes achievable.&lt;/p&gt;

&lt;p&gt;— System Rationale&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Gemma 4 on Mobile: Which Model to Load (E2B vs E4B) + Real Implementation Guide</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Mon, 06 Apr 2026 18:09:21 +0000</pubDate>
      <link>https://forem.com/system_rationale/gemma-4-on-mobile-which-model-to-load-e2b-vs-e4b-real-implementation-guide-2i1k</link>
      <guid>https://forem.com/system_rationale/gemma-4-on-mobile-which-model-to-load-e2b-vs-e4b-real-implementation-guide-2i1k</guid>
      <description>&lt;p&gt;Hey devs 👋&lt;br&gt;
I’ve been hands-on with Gemma 4 since it dropped 4 days ago and honestly — the E2B and E4B variants are the first models that actually feel practical for real mobile apps.&lt;br&gt;
Here’s the no-BS guide I wish I had: which model to load for your use case + exactly how to load it on Android, iOS, React Native, and web.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which Gemma 4 model should you actually load?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;E2B (≈5.1B total params, only 2.3B active thanks to Per-Layer Embeddings)&lt;br&gt;
→ Your default for phones.&lt;br&gt;
Use cases: offline tutor, smart replies, chat rephrasing, note summarization, safety filters, anything battery/RAM sensitive.&lt;br&gt;
Cold start is fast, runs smooth on mid-range devices.&lt;br&gt;
E4B (≈8B total, 4.5B effective)&lt;br&gt;
→ Sweet spot for flagship phones or when you need noticeably better reasoning + native audio + image understanding.&lt;br&gt;
Use cases: multimodal (photo → description), longer context tasks, or when E2B feels a bit “light”.&lt;br&gt;
26B A4B MoE or 31B&lt;br&gt;
→ Skip these on mobile. Only for laptops, desktops, or server-side heavy lifting.&lt;/p&gt;

&lt;p&gt;Rule of thumb I use: start with E2B. Only bump to E4B if users complain about quality or you need audio/image input.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How to actually load the model (the part that matters)
Android&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Easiest path: AICore Developer Preview (system-wide Gemma 4, zero weights to ship).&lt;br&gt;
Just call the ML Kit GenAI Prompt API — Google handles hardware delegation (NPU/GPU).&lt;br&gt;
For full control in your app: LiteRT-LM&lt;br&gt;
Download the quantized .task file (4-bit) from HF&lt;br&gt;
Use on-demand Play Asset Delivery so your APK stays &amp;lt;100 MB&lt;br&gt;
Load in background with Coroutines → never block UI&lt;br&gt;
Use streaming callback so tokens appear live&lt;/p&gt;

&lt;p&gt;iOS&lt;/p&gt;

&lt;p&gt;MediaPipe LLM Inference API is the official way.&lt;br&gt;
Convert to MediaPipe task format → memory-map the weights → Metal/MPS acceleration.&lt;br&gt;
Warm up the model during app idle time so first token feels instant.&lt;/p&gt;

&lt;p&gt;React Native&lt;/p&gt;

&lt;p&gt;Native TurboModule (Kotlin + Swift) is non-negotiable.&lt;br&gt;
Keep the entire model + inference in native code.&lt;br&gt;
Expose only generateResponse(prompt, options) and onToken events back to JS.&lt;br&gt;
Never run inference on the JS thread — you will OOM and crash.&lt;/p&gt;

&lt;p&gt;Web&lt;/p&gt;

&lt;p&gt;MediaPipe + WebGPU (works surprisingly well in Chrome).&lt;/p&gt;

&lt;p&gt;Universal tips that saved my ass:&lt;/p&gt;

&lt;p&gt;Always use 4-bit quantized version (Q4_K_M or LiteRT equivalent)&lt;br&gt;
Never bundle the full model in the APK/IPA — download on first user opt-in&lt;br&gt;
Cap context at 4K–8K for mobile (128K is possible but eats RAM)&lt;br&gt;
Stream tokens. Always. Users hate staring at a blank screen.&lt;/p&gt;

&lt;p&gt;Security bonus: because E2B/E4B run 100% offline, user data (exam answers, private notes, photos) never touches your servers. Huge privacy win.&lt;br&gt;
I’m using this exact stack right now for an offline-first tutor app and it’s buttery smooth.&lt;br&gt;
Drop your use case below and I’ll tell you which variant + exact loading path I’d pick for it.&lt;br&gt;
Useful resources (all fresh as of April 2026):&lt;/p&gt;

&lt;p&gt;Official Gemma 4 announcement: &lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" rel="noopener noreferrer"&gt;https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/&lt;/a&gt;&lt;br&gt;
Model card + sizes: &lt;a href="https://ai.google.dev/gemma/docs/core/model_card_4" rel="noopener noreferrer"&gt;https://ai.google.dev/gemma/docs/core/model_card_4&lt;/a&gt;&lt;br&gt;
Full model overview (E2B/E4B details): &lt;a href="https://ai.google.dev/gemma/docs/core" rel="noopener noreferrer"&gt;https://ai.google.dev/gemma/docs/core&lt;/a&gt;&lt;br&gt;
Android AICore + ML Kit guide: &lt;a href="https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html" rel="noopener noreferrer"&gt;https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html&lt;/a&gt;&lt;br&gt;
LiteRT-LM mobile deployment: &lt;a href="https://ai.google.dev/edge/litert-lm" rel="noopener noreferrer"&gt;https://ai.google.dev/edge/litert-lm&lt;/a&gt;&lt;br&gt;
Hugging Face E2B/E4B quantized models: &lt;a href="https://huggingface.co/google/gemma-4-E2B-it" rel="noopener noreferrer"&gt;https://huggingface.co/google/gemma-4-E2B-it&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Who’s actually shipping Gemma 4 on device right now? Show me your stack 🙌&lt;/p&gt;

</description>
      <category>gemma4</category>
      <category>ondeviceai</category>
      <category>mobile</category>
      <category>offlineai</category>
    </item>
    <item>
      <title>Why your LLM agent fails at 3 AM (and how state machines fix it)</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Mon, 06 Apr 2026 09:35:26 +0000</pubDate>
      <link>https://forem.com/system_rationale/why-your-llm-agent-fails-at-3-am-and-how-state-machines-fix-it-3691</link>
      <guid>https://forem.com/system_rationale/why-your-llm-agent-fails-at-3-am-and-how-state-machines-fix-it-3691</guid>
      <description>&lt;p&gt;Why your LLM agent fails at 3 AM (and how state machines fix it)&lt;/p&gt;

&lt;h1&gt;
  
  
  agents #llm #langgraph #systemdesign #aiinfra
&lt;/h1&gt;

&lt;p&gt;I've been reading postmortems from teams running LLM agents in production.&lt;/p&gt;

&lt;p&gt;Same failure every time.&lt;/p&gt;

&lt;p&gt;Not model quality. Not prompt engineering. The architecture.&lt;/p&gt;

&lt;p&gt;Most AI agents today still look like this:&lt;/p&gt;

&lt;p&gt;User Input → LLM Call → Tool Call → LLM Call → Output&lt;/p&gt;

&lt;p&gt;A chain. Linear. Stateless. Hopeful.&lt;/p&gt;

&lt;p&gt;Works great in a notebook. Breaks under real load.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 4 ways chains die in production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Infinite loops&lt;/strong&gt;&lt;br&gt;
Agent calls a tool → tool fails → agent retries → tool fails → agent retries.&lt;br&gt;
No exit condition. You're burning tokens at 3 AM while sleeping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. No checkpoint on failure&lt;/strong&gt;&lt;br&gt;
Step 7 of 10 fails. You restart from step 1. Every. Single. Time.&lt;br&gt;
Duplicate side effects — emails, API writes, deploys — retried blindly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Opaque debugging&lt;/strong&gt;&lt;br&gt;
You see the final error. Not which step poisoned the state.&lt;br&gt;
No trace. No replay. Just vibes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Mixed mutation semantics&lt;/strong&gt;&lt;br&gt;
Read-only and write steps treated identically.&lt;br&gt;
A retry re-applies a deployment or a payment. You've now deployed twice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The mental model shift
&lt;/h2&gt;

&lt;p&gt;Stop thinking: "prompt chain"&lt;br&gt;
Start thinking: "distributed system with state"&lt;/p&gt;

&lt;p&gt;A state machine models your workflow as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;States — Idle, Planning, Executing, Validating, Recovering&lt;/li&gt;
&lt;li&gt;Transitions — conditional, guarded, audited&lt;/li&gt;
&lt;li&gt;Persisted state — survives crashes, enables checkpointing, replay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangGraph made this practical. Every node writes to a shared state object. Every edge is conditional.&lt;/p&gt;

&lt;p&gt;If a node fails → resume from the last checkpoint. Not from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this actually looks like
&lt;/h2&gt;

&lt;p&gt;Chain:  A → B → C → D → Error (restart from A)&lt;/p&gt;

&lt;p&gt;Graph:  A → B → C → Error → Retry(C) → D&lt;br&gt;
                    ↓&lt;br&gt;
               HumanApproval → D&lt;/p&gt;

&lt;p&gt;The graph knows where it failed. It knows what to do next.&lt;br&gt;
The chain just panics.&lt;/p&gt;




&lt;p&gt;This is Part 1 of a series on building deterministic, production-grade multi-agent systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt; Why I'm using Gemma 4 26B MoE as the reasoning engine — and how it compares to GPT-4o on real cost.&lt;/p&gt;

&lt;p&gt;If you're building AI systems that need to work under an SLA — follow along.&lt;/p&gt;

&lt;p&gt;— System Rationale&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
