Forem: Hady Walied

Pre-Workflow Conversations: The Controller Pattern in CodeMachine

Hady Walied — Sun, 18 Jan 2026 16:00:00 +0000

When building agentic workflows, a common pattern emerges: before the autonomous pipeline runs, someone needs to gather requirements. CodeMachine v0.8.0 introduces the Controller pattern, a first-class way to have a conversation with an AI product owner before your workflow executes.

This walkthrough demonstrates building a spec-driven development workflow where a PO agent gathers requirements, then hands off to specialized agents for analysis, architecture, and implementation.

The Problem

Multi-agent workflows typically start with a prompt and run autonomously. But real development doesn't work that way. Requirements need clarification. Scope needs calibration. Assumptions need validation.

The naive solution: prepend a "requirements gathering" step. But this has issues:

Session management: The PO conversation should persist across interruptions
Handoff clarity: When does conversation end and execution begin?
Return capability: What if you need to talk to the PO mid-workflow?

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    Controller Phase                         │
│              (Interactive PO conversation)                  │
│                                                             │
│   User ←→ spec-po agent (session persisted)                 │
│                                                             │
│   [Enter with empty input] → Confirmation Dialog            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼ workflow:controller-continue
┌─────────────────────────────────────────────────────────────┐
│                   Execution Phase                           │
│              (Autonomous agent pipeline)                    │
│                                                             │
│   spec-analyst → spec-architect → spec-api-designer → ...   │
│                                                             │
│   [Press R] → Return to Controller Dialog                   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼ workflow:return-to-controller
┌─────────────────────────────────────────────────────────────┐
│                 Return to Controller                        │
│          (Pause workflow, resume PO session)                │
│                                                             │
│   User ←→ spec-po agent (same session)                      │
│                                                             │
│   [Enter with empty input] → Resume Workflow                │
└─────────────────────────────────────────────────────────────┘

Defining a Controller

The controller() function declares which agent handles pre-workflow conversation:

export default {
    name: 'Spec-Driven Development',
    controller: controller('spec-po', {}),
    specification: false,

    steps: [
        separator("∴ Discovery Phase ∴"),
        resolveStep('spec-po', {}),
        resolveStep('spec-analyst', {}),

        separator("∴ Design Phase ∴"),
        resolveStep('spec-architect', {}),
        resolveStep('spec-api-designer', {}),

        separator("∴ Implementation Phase ∴"),
        resolveStep('spec-setup', {}),
        resolveStep('spec-impl-orchestrator', {}),

        separator("⟲ Review Loop ⟲"),
        resolveStep('spec-tester', { interactive: false }),
        resolveModule('spec-review', { interactive: false, loopSteps: 2 }),
    ],
};

Key points:

controller('spec-po', {}) - Declares the PO agent for pre-workflow conversation
Same agent in steps - The spec-po agent appears both as controller AND as step 1
Automatic skip - When controller phase runs, step 1 is auto-completed (no redundant execution)

Controller Options

The controller() function accepts options for engine and model overrides:

controller('spec-po', {
    engine: 'claude',           // Override engine
    model: 'claude-4.5-sonnet'      // Override model
})

This is useful when your PO conversation requires different capabilities than the workflow steps.

The Controller Conversation Flow

When you start a workflow with a controller:

┌──────────────────────────────────────────────────────────────┐
│  CodeMachine v0.8.0                                          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ▸ spec-po (Running)                                         │
│                                                              │
│  What would you like to build today?                         │
│──────────────────────────────────────────────────────────────│
│                                                              │
│  > A todo list app with Next.js and SQLite                   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

The conversation continues until you press Enter with an empty input. A confirmation dialog appears:

┌────────────────────────────────────────┐
│   Ready to start the workflow?         │
│                                        │
│   [Start]  [Continue Chat]             │
└────────────────────────────────────────┘

Selecting "Start" triggers workflow:controller-continue, transitioning to the execution phase.

Session Persistence

The controller session persists to disk:

{
  "controllerConfig": {
    "agentId": "spec-po",
    "sessionId": "ses_44785e25dffeDZqs8kVN7KbfIx",
    "monitoringId": 1
  },
  "autonomousMode": "true"
}

This enables:

Resume on crash: If the CLI crashes mid-conversation, the session resumes
Return to controller: Mid-workflow return uses the same session
Log viewing: Clicking the completed PO step shows the full conversation

Returning to Controller Mid-Workflow

Press R during workflow execution to pause and return to the controller:

┌──────────────────────────────────────────────────────────────┐
│  Workflow Pipeline (8 items)                                 │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ✓ Hady [PO] (completed)                                     │
│  ▸ Moaz [Analyst] (running)         ← Paused                 │
│  ○ Atef [Architect]                                          │
│  ○ Essam [API]                                               │
│                                                              │
├──────────────────────────────────────────────────────────────┤
│  [R] Controller  [Esc] Stop                                  │
└──────────────────────────────────────────────────────────────┘

When R is pressed:

Current step is aborted
Workflow state machine enters PAUSE
Phase switches to "onboarding"
Controller session resumes

After the conversation, pressing Enter resumes the workflow from the paused step.

Complete Workflow Example

Here's the full spec-driven workflow with controller:

export default {
    name: 'Spec-Driven Development',
    controller: controller('spec-po', {}),
    specification: false,

    tracks: {
        question: 'What are we working on?',
        options: {
            new_project: { label: 'New Project' },
            existing_app: { label: 'Existing App' },
            refactor: { label: 'Refactor' },
        },
    },

    conditionGroups: [
        {
            id: 'features',
            question: 'What features does your project have?',
            multiSelect: true,
            conditions: {
                has_ui: { label: 'Has UI' },
                has_auth: { label: 'Has Authentication' },
                has_db: { label: 'Has Database' },
            },
        },
    ],

    steps: [
        separator("∴ Discovery Phase ∴"),
        resolveStep('spec-po', {}),
        resolveStep('spec-analyst', {}),

        separator("∴ Design Phase ∴"),
        resolveStep('spec-architect', {}),
        resolveStep('spec-api-designer', {}),

        separator("∴ Implementation Phase ∴"),
        resolveStep('spec-setup', {}),
        resolveStep('spec-impl-orchestrator', {}),

        separator("⟲ Review Loop ⟲"),
        resolveStep('spec-tester', { interactive: false }),
        resolveModule('spec-review', { interactive: false, loopSteps: 2 }),
    ],

    subAgentIds: [
        'spec-dev-data',
        'spec-dev-api',
        'spec-dev-ui',
        'spec-dev-tests',
    ],
};

Key Benefits

Requirement	Solution
Requirements clarification	Controller conversation before execution
Session persistence	SessionID stored in template.json
Clear handoff	Confirmation dialog before transition
Mid-workflow return	`R` key pauses and resumes controller session
Log viewing	MonitoringID registered for completed step
Same agent, no duplication	Controller step auto-completed after phase

Conclusion

The Controller pattern addresses a fundamental gap in agentic workflows: the need for human-AI conversation before autonomous execution. By treating this as a first-class primitive, with session persistence, clear handoff UX, and mid-workflow return capability, CodeMachine enables workflows that match how real development actually works.

CodeMachine on Github: moazbuilds/codemachinecli

Workflow Examples: github.com/hadywalied/codemachine_example_workflows

Documentation: codemachine.co

Enterprise AI Agents: Traceability, Atomicity, and Memory Persistence with AgentHelm

Hady Walied — Thu, 25 Dec 2025 22:52:55 +0000

When deploying AI agents in enterprise environments, three requirements typically surface: every action must be traceable, multi-step operations must be atomic, and context must persist across sessions. AgentHelm v0.3.0 addresses all three.

This walkthrough demonstrates building a contract processing system with two specialized agents and full observability.

The Enterprise Requirements

Before diving into code, let's establish what enterprise deployments demand:

Traceability: Every tool call must be logged with inputs, outputs, timing, and cost
Atomicity: If step 3 of 5 fails, steps 1-2 must be rolled back
Memory Persistence: Agent context survives restarts and can be audited
Cost Visibility: Know exactly what each operation costs before the invoice arrives

Architecture Overview

┌────────────────────────────────────────────────────────────┐
│                      PlannerAgent                          │
│              (Generates execution blueprint)               │
└────────────────────────────────────────────────────────────┘
                              │
                              ▼
              ┌───────────────┴───────────────┐
              │         Orchestrator          │
              │   (Saga pattern execution)    │
              └───────────────┬───────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────┐         ┌─────────────────────┐
│   AnalysisAgent     │         │   DocumentAgent     │
│ (Extract & analyze) │         │  (Generate & save)  │
└─────────────────────┘         └─────────────────────┘
              │                               │
              └───────────────┬───────────────┘
                              ▼
              ┌───────────────────────────────┐
              │          MemoryHub            │
              │  (SQLite short-term + Qdrant) │
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │       ExecutionTracer         │
              │  (SQLite + OpenTelemetry)     │
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │           Jaeger              │
              │   (Trace visualization)       │
              └───────────────────────────────┘

Setup

pip install agenthelm

import dspy
from agenthelm import (
    ToolAgent, PlannerAgent, Orchestrator, AgentRegistry,
    tool, MemoryHub, ExecutionTracer
)
from agenthelm.core.storage import SqliteStorage
from agenthelm.tracing import init_tracing

lm = dspy.LM("mistral/mistral-large-latest")

Configure Observability First

In enterprise deployments, observability is not an afterthought.

# Initialize OpenTelemetry with Jaeger
init_tracing(
    service_name="contract-processor",
    otlp_endpoint="http://jaeger:4317",
    enabled=True
)

# Create execution tracer with persistent storage
tracer = ExecutionTracer(
    storage=SqliteStorage("/var/log/agenthelm/traces.db"),
    session_id="contract-batch-2025-12-26"
)

Every tool execution is now:

Logged to SQLite with full inputs/outputs
Exported to Jaeger for distributed tracing
Tagged with session ID for batch correlation

Configure Memory Persistence

# Production memory configuration
memory = MemoryHub(
    data_dir="/var/data/agenthelm",  # Local persistence
    # Or for network mode:
    # redis_url="redis://cache.internal:6379",
    # qdrant_url="http://qdrant.internal:6333"
)

MemoryHub provides:

Short-term memory: Session state with TTL (SQLite locally, Redis for distributed)
Semantic memory: Vector search for context retrieval (Qdrant with FastEmbed)

Define Tools with Compensation

Atomicity requires compensating actions for rollback.

@tool()
def extract_contract_data(document_path: str) -> dict:
    """Extract structured data from a contract document."""
    # Simulated extraction
    return {
        "parties": ["Acme Corp", "Widget Inc"],
        "value": 150000,
        "terms": "12 months",
        "effective_date": "2025-01-01"
    }

@tool()
def validate_compliance(contract_data: dict) -> dict:
    """Validate contract against compliance rules."""
    issues = []
    if contract_data.get("value", 0) > 100000:
        issues.append("Requires senior approval for contracts > $100k")
    return {"valid": len(issues) == 0, "issues": issues}

@tool(compensating_tool="delete_record")
def create_record(contract_data: dict, record_type: str) -> str:
    """Create a record in the system."""
    record_id = f"REC-{hash(str(contract_data)) % 10000:04d}"
    # In production: database insert
    return record_id

@tool()
def delete_record(record_id: str) -> str:
    """Delete a record (compensation action)."""
    # In production: database delete
    return f"Deleted {record_id}"

@tool(compensating_tool="archive_document")
def generate_summary(contract_data: dict, output_path: str) -> str:
    """Generate and save a contract summary document."""
    content = f"Contract Summary: {contract_data}"
    with open(output_path, "w") as f:
        f.write(content)
    return output_path

@tool()
def archive_document(output_path: str) -> str:
    """Archive a document (compensation action)."""
    import os
    if os.path.exists(output_path):
        os.rename(output_path, f"{output_path}.archived")
    return f"Archived {output_path}"

Note the compensating_tool parameter. When the Orchestrator detects a failure, it automatically calls these in reverse order.

Create Specialized Agents

# Agent 1: Analysis specialist
analysis_agent = ToolAgent(
    name="analyst",
    lm=lm,
    tools=[extract_contract_data, validate_compliance],
    tracer=tracer,
    memory=memory,
    role="You are a contract analysis specialist. Extract data accurately."
)

# Agent 2: Document specialist  
document_agent = ToolAgent(
    name="documenter",
    lm=lm,
    tools=[create_record, generate_summary],
    tracer=tracer,
    memory=memory,
    role="You are a document management specialist. Create records precisely."
)

# Register agents
registry = AgentRegistry()
registry.register(analysis_agent)
registry.register(document_agent)

Both agents share:

The same tracer (unified trace storage)
The same memory hub (shared context)

Generate the Execution Plan

# Planner agent has visibility into all tools
planner = PlannerAgent(
    name="planner",
    lm=lm,
    tools=[
        extract_contract_data, validate_compliance,
        create_record, generate_summary
    ]
)

plan = planner.plan(
    "Process contract.pdf: extract data, validate compliance, "
    "create a system record, and generate a summary document"
)

print(plan.to_yaml())

Generated plan:

goal: Process contract and create records
reasoning: |
  Sequential process: extract first, then validate, then create
  record, finally generate summary. Each step depends on previous.
steps:
  - id: extract
    agent: analyst
    tool: extract_contract_data
    description: Extract structured data from contract
    args:
      document_path: "contract.pdf"

  - id: validate
    agent: analyst
    tool: validate_compliance
    description: Check against compliance rules
    depends_on: [extract]
    args:
      contract_data: "${extract.result}"

  - id: record
    agent: documenter
    tool: create_record
    description: Create system record
    depends_on: [validate]
    args:
      contract_data: "${extract.result}"
      record_type: "contract"

  - id: summary
    agent: documenter
    tool: generate_summary
    description: Generate summary document
    depends_on: [record]
    args:
      contract_data: "${extract.result}"
      output_path: "/output/contract_summary.md"

Review and Edit the Plan

Before execution, the plan can be reviewed and modified.

# Save plan for review
plan_path = "/reviews/contract_plan.yaml"
with open(plan_path, "w") as f:
    f.write(plan.to_yaml())

# Manual review happens here...
# Reviewer can edit the YAML, add steps, modify args

# Load reviewed plan
from agenthelm import Plan
reviewed_plan = Plan.from_yaml(open(plan_path).read())

# Approve for execution
reviewed_plan.approved = True

In production, this review step integrates with your approval workflow; Slack notifications, PR-based reviews, or manual sign-off.

Execute with Saga Rollback

orchestrator = Orchestrator(
    registry=registry,
    enable_rollback=True  # Saga pattern enabled
)

result = await orchestrator.execute(reviewed_plan)

If generate_summary fails after create_record succeeds:

generate_summary marked as FAILED
Orchestrator triggers rollback
delete_record called automatically (compensating action for create_record)
System returns to consistent state

Inspect Traces

After execution, full traceability:

# Programmatic access
for event in result.events:
    print(f"{event.tool_name}: {event.execution_time:.3f}s, ${event.estimated_cost_usd:.4f}")

# Summary
print(f"Total cost: ${result.total_cost_usd:.4f}")
print(f"Total tokens: {result.token_usage.total_tokens}")

Via CLI:

# List recent traces
agenthelm traces list -s /var/log/agenthelm/traces.db

# Filter by tool
agenthelm traces filter --tool create_record --status success

# Export for audit
agenthelm traces export -o audit_report.json -f json
agenthelm traces export -o audit_report.csv -f csv

Memory for Context Continuity

Store and retrieve context across sessions:

from agenthelm.memory import MemoryContext

async with MemoryContext(memory, session_id="contract-batch-2025-12-26") as ctx:
    # Store processing context
    await ctx.set("last_processed_contract", "contract.pdf")
    await ctx.set("batch_status", {"processed": 1, "failed": 0})

    # Store semantic memory for future retrieval
    await ctx.store_memory(
        "Contract with Acme Corp processed successfully. Value: $150k, 12 months.",
        metadata={"contract_id": "contract.pdf", "status": "complete"}
    )

# Later, in another session
async with MemoryContext(memory, session_id="new-session") as ctx:
    # Recall relevant past contracts
    results = await ctx.recall("Acme Corp contracts", top_k=5)
    for r in results:
        print(f"Score: {r.score:.2f} - {r.text}")

Jaeger Integration

With OpenTelemetry configured, view traces in Jaeger:

# Start Jaeger
docker run -d -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one

# Run your agent workflow
python process_contracts.py

# Open Jaeger UI
open http://localhost:16686

Each agent execution appears as a span with:

Tool name and arguments
Execution duration
Success/failure status
Cost attribution

Key Enterprise Benefits

Requirement	AgentHelm Feature
Audit trail	ExecutionTracer + SQLite storage
Distributed tracing	OpenTelemetry + Jaeger
Atomic operations	Saga pattern with compensating tools
Session persistence	MemoryHub with Redis/SQLite
Context search	Semantic memory with Qdrant
Cost control	Built-in pricing for 20+ LLM providers
Human review	Plan YAML export/import workflow

Conclusion

Enterprise AI agent deployments require more than just "an agent that works." They require:

Traceability for compliance and debugging
Atomicity for data consistency
Memory persistence for context continuity

AgentHelm v0.3.0 provides these as first-class features, not afterthoughts.

Documentation: hadywalied.github.io/agenthelm

GitHub: github.com/hadywalied/agenthelm

Announcing AgentHelm v0.3.0: Production-Ready AI Agent Orchestration

Hady Walied — Wed, 24 Dec 2025 15:49:07 +0000

After months of iteration, I'm excited to release AgentHelm v0.3.0, a significant step toward making AI agents production-ready.

The Problem

Building AI agents is easy. Running them reliably in production is hard.

You need to handle:

Multi-step execution with failure recovery
Memory persistence across sessions
Cost and token tracking
Observability and debugging
Tool orchestration at scale

Most agent frameworks focus on the first mile (getting an agent to work) but neglect the last mile (keeping it running reliably).

What's New in v0.3.0

Plan-Driven Execution

Instead of letting agents run wild, AgentHelm introduces a plan-first approach:

Task → Plan Generation → Human Review → Execution

The PlannerAgent generates a structured plan with steps and dependencies. You review it. Then the Orchestrator executes it with parallel execution where possible.

If something fails mid-execution, the Saga pattern kicks in, compensating actions roll back completed steps automatically.

Unified Memory Hub

Memory is no longer an afterthought. MemoryHub provides:

Short-term memory: Key-value storage with TTL (InMemory, SQLite, Redis)
Semantic memory: Vector search with Qdrant and FastEmbed

Zero-config by default, but scales to production with Redis and network Qdrant.

Full CLI

Everything works from the command line:

agenthelm run "Analyze this quarter's sales data"
agenthelm plan "Build a customer support bot" -o plan.yaml
agenthelm execute plan.yaml --dry-run
agenthelm chat
agenthelm traces list

OpenTelemetry Integration

Every tool execution is traced. Export to Jaeger for visualization. Track costs across 20+ LLM providers with built-in pricing.

MCP Support

Connect to Model Context Protocol servers and use their tools directly in your agents.

Architecture

Getting Started

pip install agenthelm
agenthelm init
agenthelm chat

What's Next

v0.4.0 will focus on:

Web dashboard for trace visualization
Advanced conflict resolution in multi-agent workflows
Policy engine for budget and constraint enforcement
Webhook integrations

AgentHelm is open source. Contributions welcome.

From 40% to 100% SQL Generation Accuracy: Why Local AI Needs Self-Correction, Not Perfect Prompts

Hady Walied — Tue, 02 Dec 2025 22:14:13 +0000

I spent 12 hours fighting a local AI model to generate valid SQL queries. My success rate went from 40% to 100%, not by prompt engineering, but by teaching the model to learn from its own mistakes.

Key Takeaways TL;DR:

• Self-correction loops beat perfect-first-time approaches for local AI
• DSPy optimization improved SQL accuracy from 40% to 100% without manual prompt tuning
• Defensive parsing is critical, LLM output is probabilistic, your code isn't
• When to use this: privacy-critical data, edge deployment, vendor lock-in avoidance

The Problem: Local Models Are "Creative" (and that's bad for SQL)

Building a Retail Analytics Copilot that runs entirely on a laptop (using a quantized 24B model) sounds great for privacy, but it's a nightmare for reliability. Unlike GPTs, which follows instructions like a senior engineer, local models are like enthusiastic interns: they try hard, but they hallucinate syntax, forget schema details, and love to chat when they should be coding.

My initial baseline was dismal: only 40% of generated SQL queries actually executed. The rest were plagued by syntax errors, hallucinated columns, or conversational fluff ("Here is your query...").

Here are the three counterintuitive lessons I learned getting that number to 100%.

Lesson 1: Self-Correction > Perfect Prompts

I started where everyone starts: Prompt Engineering. I spent hours crafting the "perfect" system prompt, begging the model to "ONLY output SQL" and "CHECK the schema." It didn't work. The model would obey for 5 queries and fail on the 6th.

The Shift: Instead of trying to prevent errors, I built a system that expects them.

I implemented a LangGraph workflow with a dedicated Repair Loop.

Generate: The model writes a query.
Execute: We try to run it against SQLite.
Catch: If it fails, we catch the error (e.g., no such column: 'Price').
Feedback: We feed the exact error message back to the model: "The previous query failed with error X. Fix it."

This pattern generalizes beyond SQL. Anytime you're working with probabilistic systems, external APIs that can fail, or ambiguous user input, you should design for graceful degradation.
Amazon's mantra: "Everything fails all the time." Your architecture should assume failure and route around it.

The Hidden Benefit: This loop generates training data. Every (failed_query, error_message, corrected_query) triple becomes a potential few-shot example for optimization. You're not just fixing bugs; you're building a self-improving system.

The Code

Here is the actual logic for the repair loop:

def sql_execution_node(state: AgentState) -> AgentState:
    """Execute SQL and handle errors gracefully."""
    query = state["sql_query"]

    try:
        cursor.execute(query)
        state["sql_results"] = cursor.fetchall()
        state["errors"] = []
    except sqlite3.OperationalError as e:
        # Don't crash—capture and route to repair
        state["sql_results"] = []
        state["errors"].append(str(e))
        state["feedback"] = f"SQL execution failed: {e}. Fix the query."
        state["repair_count"] = state.get("repair_count", 0) + 1

    return state

def should_repair(state: AgentState) -> str:
    """Conditional edge: repair or continue?"""
    if state["errors"] and state["repair_count"] < 2:
        return "sql_generator"  # Loop back
    return "synthesizer"  # Give up or continue

Lesson 2: The "ELECT" Bug (Defensive Parsing)

Local models generate creative variations: SELECT, Select, SQL: SELECT. I wrote defensive code to strip prefixes using lstrip("SQL:"). Seems reasonable, right?

Wrong.

lstrip removes any character in the set from the left. Since SELECT starts with S, and S is in SQL:, it got stripped. My code turned SELECT * FROM... into ELECT * FROM....

The model was doing its job,my parsing broke it.

The Lesson: LLMs are probabilistic. Your parsing must be deterministic.

Don't use character-level stripping (lstrip).
Do use Regex extraction (re.search(r'(SELECT\s+.*)', ...)).
Better yet: Enforce structured output (JSON mode) so you aren't parsing freeform text at all.

Lesson 3: Optimization Is Code, Not Magic

I stopped hand-writing prompts. Instead, I used DSPy.

Think of traditional prompt engineering like manually tuning hyperparameters in a neural network by running experiments and eyeballing loss curves. DSPy is like having backpropagation,it automates the search for optimal prompts using gradient-free optimization over a metric you define.

The mental shift: You're not writing prompts anymore. You're writing loss functions.

I defined a metric: "A query is good if it executes AND returns non-empty results." Then, I used the BootstrapFewShot optimizer. It acted like a teacher, generating multiple potential SQL queries, running them, and keeping only the ones that passed my metric to use as "few-shot examples".

The Results

Metric	Baseline	After Optimization	After Repair Loop
Valid SQL (%)	40%	85%	100%
Correct Format (%)	30%	60%	95%
End-to-End Success (%)	12%	51%	66%*

Why 100% SQL but only 66% end-to-end?
Because SQL accuracy is necessary but not sufficient. The failures now happen at different layers: the router misclassifies 15% of hybrid questions as pure SQL, the retriever returns irrelevant docs 10% of the time, and the synthesizer occasionally combines SQL results with RAG context incorrectly. This reveals a critical insight: optimizing one component creates new bottlenecks elsewhere. The system is now limited by orchestration logic, not model quality.

3 Things You Can Steal Today

The Repair Pattern: Add a feedback field to your LLM state. On failure, inject the error message and retry. Costs ~1 extra LLM call but 10x's reliability.

The ELECT Test: Run this on your parsing logic:

assert clean_sql_output("SQL: SELECT * FROM orders") == "SELECT * FROM orders"
assert clean_sql_output("SELECT * FROM orders") == "SELECT * FROM orders"

If it fails, you're using string methods wrong.

The DSPy Starter Template: Define your task as (inputs, output, metric). Let the optimizer find examples instead of writing them manually.

Why This Matters (Beyond This Project)

The AI landscape is bifurcating:

Cloud-first: GPT-5.1, Claude, etc.
- powerful but expensive and privacy-risky.
Edge-first: Local models
- cheaper, private, but harder to wrangle.

Companies that master local AI will own regulated industries (healthcare, finance, government) where cloud LLMs are non-starters. The bottleneck isn't model weights, it's reliability engineering.

If you can build systems that make 7B models behave like 70B models through clever orchestration, you're solving a $100B problem. This isn't just a technical exercise, it's a strategic moat.

The skills that matter:

Systems thinking: Understanding failure modes and designing around them.
Optimization: Treating prompts as learnable parameters, not art.
Defensive engineering: Building for the probabilistic world.

Master these, and you're not competing with prompt engineers. You're competing with infrastructure engineers at AI-native companies.

you can find this project on GitHub here hadywalied/RACopilot

The Production AI Agent Checklist

Hady Walied — Mon, 10 Nov 2025 08:23:30 +0000

Why This Checklist Exists

AI agents are moving from demos to production. But most frameworks are optimized for prototyping, not reliability.

This checklist comes from real production deployments, the failures, and the incidents. It's what I wish I had before deploying my first agent to a live system.

Use this before deploying any AI agent that:

Modifies state (databases, APIs, files)
Handles money (payments, refunds, billing)
Sends communications (emails, SMS, notifications)
Makes decisions with business impact

If you can answer YES to everything here, your agent is probably production-ready. If not, you know exactly what needs fixing.

1. Safety & Guardrails

Can your agent safely fail?

[ ] Every state-modifying operation has a rollback procedure
- Example: charge_customer has corresponding refund_customer
- Rollbacks execute automatically on downstream failures
- Rollbacks are tested and verified to work
[ ] High-risk operations require human approval
- Financial transactions above threshold
- Data deletion or modification
- External communications to customers
- Clear approval workflow with full context shown to approver
[ ] Agent has hard budget limits
- Maximum API tokens per execution
- Maximum execution time (timeout)
- Maximum API calls per minute/hour
- Budget violations halt execution, don't just log
[ ] Agent cannot access forbidden resources
- Production databases are read-only or require approval
- Customer PII is redacted in logs
- Secrets are not exposed to LLM prompts
- File system access is sandboxed to specific directories
[ ] Critical operations are idempotent
- Retrying the same operation produces the same result
- No duplicate charges, emails, or state changes
- Request IDs or transaction IDs prevent duplicates

2. Observability & Debugging

Can you debug a failure in under 60 seconds?

[ ] Every LLM call is logged with full context
- Complete prompt sent to LLM
- Complete response received
- Model name and parameters
- Timestamp and execution time
- Cost in tokens and dollars
[ ] Every tool execution is traced
- Tool name and parameters
- Return value or error
- Execution time
- Success/failure status
- Stack trace on errors
[ ] You can replay any execution from logs
- Stored traces include enough information to reproduce the execution
- You can step through agent reasoning (ReAct pattern)
- You can see why the LLM chose each action
[ ] You can query failures efficiently
- Filter by: status, tool name, date range, execution time
- Find patterns: "Which tools fail most often?"
- Export for analysis: CSV, JSON, or database format
[ ] Logs are structured, not just text
- JSON or JSONL format
- Consistent schema across executions
- Queryable with standard tools (jq, SQL, Elasticsearch)

3. Reliability & Resilience

Does your agent recover gracefully from failures?

[ ] Transient failures trigger automatic retries
- Network timeouts retry with exponential backoff
- Rate limit errors wait and retry
- Configurable max retries per operation
- Different retry policies for different tools
[ ] Partial failures trigger automatic rollback
- If step 3 fails, steps 2 and 1 are undone
- Compensating transactions restore consistent state
- Rollback failures are logged and alerted
[ ] Critical paths have timeout protection
- No operation runs indefinitely
- Timeout values are tuned per operation type
- Timeouts trigger rollback, not just errors
[ ] System degrades gracefully when dependencies fail
- Non-critical tools can fail without stopping workflow
- Agent provides partial results when possible
- Clear error messages explain what failed and why
[ ] Concurrent executions don't interfere
- Multiple agent runs don't corrupt shared state
- Lock mechanisms prevent race conditions
- Workflow isolation is tested

4. Compliance & Audit

Can you prove what your agent did and why?

[ ] Full audit trail of agent decisions
- Who/what triggered the execution
- Every decision point with LLM reasoning
- Every action taken and its result
- Timestamps for entire chain of events
[ ] Human-in-the-loop for regulated operations
- GDPR data deletion requires approval
- Financial transactions require review
- Approval records stored permanently
- Approver identity and timestamp logged
[ ] Cost tracking per execution
- Total API cost (tokens × price per token)
- Per-model cost breakdown
- Cost alerts when budget is exceeded
- Historical cost trends
[ ] Reproducible execution from stored traces
- Exact LLM version and parameters recorded
- Tool versions and dependencies captured
- Can reproduce execution for compliance review
- Traces stored for required retention period
[ ] Sensitive data is handled correctly
- PII is redacted in logs
- Secrets never appear in traces
- Compliance with GDPR, CCPA, etc.
- Data retention policies enforced

5. Testing & Validation

Have you tested what happens when things break?

[ ] Unit tests for every tool
- Test success cases
- Test error cases
- Test edge cases (empty inputs, null values)
- Mock external dependencies
[ ] Integration tests for multi-step workflows
- Test complete end-to-end flows
- Test with real LLM (not mocked)
- Test with production-like data
- Verify rollback works correctly
[ ] Failure injection tests
- What if step 3 fails? Does rollback work?
- What if external API is down?
- What if LLM returns malformed response?
- What if network timeout occurs?
[ ] Cost estimation before production
- Run workflow with test data
- Measure token usage and API costs
- Project costs at production scale
- Set up cost alerts
[ ] Load testing for expected traffic
- Can system handle peak load?
- Are rate limits respected?
- Does performance degrade gracefully?
- Are bottlenecks identified?

6. Monitoring & Alerting

Will you know immediately when something goes wrong?

[ ] Real-time alerts for critical failures
- Agent execution failures
- Budget overruns
- Rollback failures
- Suspicious patterns (unusual cost, repeated failures)
[ ] Dashboards for operational visibility
- Success/failure rates
- Average execution time
- Cost trends over time
- Most common failure modes
[ ] On-call runbooks for common issues
- What to do when agent fails
- How to manually trigger rollback
- How to disable agent in emergency
- Escalation paths for different failure types
[ ] Regular review of agent behavior
- Weekly review of failure patterns
- Monthly cost analysis
- Quarterly security audit
- Continuous improvement based on incidents

7. Human Oversight

Can humans intervene when needed?

[ ] Clear mechanism to pause/stop execution
- Emergency kill switch
- Graceful shutdown (finish current step, then stop)
- Preserve state for later resumption
[ ] Ability to manually override decisions
- Human can approve/reject specific actions
- Override decisions are logged
- Override doesn't break workflow
[ ] Escalation paths for edge cases
- Agent can ask for human help
- Clear SLAs for human response time
- Workflow pauses until human responds
[ ] Regular human review of agent outputs
- Spot checks of decisions
- Review of edge cases
- Validation that agent behavior matches intent

8. Deployment & Operations

Is your deployment process safe and repeatable?

[ ] Staging environment for testing
- Separate from production
- Representative data (but not real customer data)
- Test all changes in staging first
[ ] Gradual rollout strategy
- Deploy to 1% of traffic first
- Monitor for issues before full rollout
- Easy rollback if problems detected
[ ] Version control for prompts and tools
- Prompts are versioned
- Tool definitions are versioned
- Can roll back to previous versions
[ ] Clear deployment documentation
- Step-by-step deployment guide
- Rollback procedures
- Contact information for incidents
[ ] Post-deployment validation
- Smoke tests run automatically
- Verify agent is working as expected
- Alert if smoke tests fail

Implementation Guide

Using AgentHelm for Production Readiness

AgentHelm was built specifically to help you check these boxes. Here's how it maps to the checklist:

Safety & Guardrails:

@tool(
    requires_approval=True,  # Human approval for high-risk ops
    compensate_with=rollback_action,  # Automatic rollback
    max_retries=3,  # Retry transient failures
    timeout=30.0  # Hard timeout
)
def risky_operation(data: dict):
    return api.modify(data)

Observability:

# Query failures
agenthelm traces filter --status failed --date-from 2025-11-01

# Export for analysis
agenthelm traces export --output report.csv --format csv

# Replay execution
agenthelm traces show <trace_id>

Reliability:

Compensating transactions for rollback
Automatic retry with exponential backoff
Timeout protection per tool
SQLite/JSON storage for traces

Compliance:

Full audit trail with LLM reasoning
Approval records with timestamps
Cost tracking per execution
Reproducible from stored traces

Real-World Example: Customer Refund Agent

Here's how this checklist applies to a real production agent:

from agenthelm import tool

# ✅ Safety: Rollback defined
# ✅ Observability: Automatically logged
# ✅ Reliability: Retries configured
@tool(
    compensate_with=log_verification_failure,
    max_retries=2
)
def verify_order(order_id: str) -> dict:
    """Step 1: Verify order exists and is refundable"""
    return orders_api.verify(order_id)

# ✅ Safety: Requires approval, has rollback
# ✅ Compliance: Approval logged with identity
@tool(
    requires_approval=True,  # Manager approves refunds >$100
    compensate_with=reverse_refund,
    timeout=30.0
)
def process_refund(order_id: str, amount: float) -> dict:
    """Step 2: Issue refund to customer"""
    return payments_api.refund(order_id, amount)

@tool()
def reverse_refund(order_id: str, amount: float) -> dict:
    """Rollback: Re-charge customer if needed"""
    return payments_api.charge(order_id, amount)

# ✅ Reliability: Has rollback for email failure
@tool(compensate_with=send_cancellation_email)
def send_refund_email(customer_email: str, amount: float):
    """Step 3: Notify customer"""
    return email_api.send(customer_email, f"Refund of ${amount} processed")

What this checklist ensures:

If email fails → refund reversed → customer re-charged → consistent state
If refund >$100 → human approval required → logged with timestamp
Every LLM decision → stored in SQLite → queryable for compliance
Network timeout → automatic retry → exponential backoff
Complete failure → full rollback → audit trail preserved

Your Next Steps

Print this checklist
Go through it for your agent - Honestly mark YES/NO for each item
Fix the NOs - Start with safety, then observability, then reliability
Test failure scenarios - Inject failures and verify rollback works
Deploy to staging - Run for 1 week, monitor closely
Gradual production rollout - 1%, 10%, 50%, 100%

If you can't check all these boxes, you're not ready for production. That's not a judgment but it's a fact. Production agents that modify real state need the same rigor as traditional software.

Contributing to This Checklist

This checklist is a living document based on real production experience. If you've deployed agents in production and learned lessons not captured here, please contribute:

Add items: What's missing from this checklist?
Share war stories: What failures have you seen?
Improve examples: How can this be more practical?

GitHub: https://github.com/hadywalied/agenthelm/blob/main/CHECKLIST.md
GitHub Discussion: https://github.com/hadywalied/agenthelm/discussions/12

Conclusion

Production AI agents are not demos. They need:

Transactional safety (rollback on failure)
observability (debug in seconds, not hours)
Human oversight (approval for high-risk operations)
Operational discipline (monitoring, alerting, incident response)

This checklist captures what I've learned deploying agents. Use it to avoid the mistakes I made.

The goal isn't perfection. The goal is production-readiness.

GitHub: github.com/hadywalied
LinkedIn: linkedin.com/in/hadywalied
AgentHelm: github.com/hadywalied/agenthelm

Why Your AI Agent Needs a Ctrl+Z (And How I Built It)

Hady Walied — Thu, 06 Nov 2025 02:10:31 +0000

The 3 AM incident that taught me production agents need database-style transactions

The Problem Nobody Talks About

We're all building AI agents now. LangChain makes it easy to prototype. Claude and GPT-5 are surprisingly capable. Everyone's shipping agents to production.

But here's what the tutorials don't tell you: AI agents fail in ways traditional software doesn't.

A bug in your web app? Roll back the deployment. A database migration fails? Transactions ensure consistency. But an AI agent that:

Charges a customer the wrong amount
Deletes production data
Sends 1,000 emails to the wrong list
Makes an irreversible API call

There's no Ctrl+Z. You're left manually reversing actions, apologizing to customers, and trying to piece together what the LLM was "thinking" from scattered logs.

I learned this the hard way while building automation for e-commerce systems. An agent processed refunds slightly wrong, and we spent hours manually correcting transactions. That's when I realized: AI agents need the same safety guarantees as databases.

What Production Agents Actually Need

After deploying agents in real systems, here's what I found you absolutely need:

1. Automatic Rollback (Like Database Transactions)

When something fails, every previous step should automatically reverse. Not with manual cleanup scripts. Not with post-incident fire drills. Automatically.

# If step 3 fails, steps 2 and 1 automatically rollback
@tool(compensate_with=cancel_hotel_booking)
def book_hotel(hotel_id: str, dates: dict):
    return hotel_api.book(hotel_id, dates)

@tool(compensate_with=refund_customer)  
def charge_customer(amount: float, customer_id: str):
    return payment_api.charge(amount, customer_id)

This is called the Saga pattern (or compensating transactions). It's battle-tested in distributed systems. But nobody's applying it to AI agents.

2. Human Approval for High-Risk Actions

Some operations are too risky to let an LLM decide alone:

@tool(requires_approval=True)
def charge_credit_card(amount: float, card_id: str):
    """Requires human approval before executing"""
    return payment_api.charge(amount, card_id)

One decorator. That's it. The framework handles the approval flow, shows you full context, and waits for your decision.

3. Forensic-Grade Observability

When an agent fails (and they will), you need to answer:

Which LLM call made the wrong decision?
What was the exact input that triggered the error?
How much did this cost in API tokens?
What did the agent do before failing?

You need this information in 60 seconds, not 60 minutes of log spelunking.

# Find the bug in seconds, not hours
agenthelm traces show <trace_id>
agenthelm traces filter --status failed --tool-name process_payment
agenthelm traces export --output incident_report.csv

How I Built AgentHelm

I started with a simple question: What would a database-style transaction look like for an AI agent?

The Core Idea: Compensating Transactions

For every tool that modifies state, you define a compensating action:

from agenthelm import tool

@tool(compensate_with=reverse_refund)
def process_refund(order_id: str, amount: float):
    """Issue a refund to customer"""
    return payment_api.refund(order_id, amount)

@tool()
def reverse_refund(order_id: str, amount: float):
    """Rollback: re-charge if something fails"""
    return payment_api.charge(order_id, amount)

When the workflow executes:

Agent calls process_refund → succeeds ✓
Agent calls next tool → fails ✗
Framework automatically calls reverse_refund ↩️
System returns to consistent state ✓

Real Example: Customer Refund Agent

Here's a realistic scenario—processing customer refunds with validation, approval, and notifications:

# Step 1: Verify order exists
@tool(compensate_with=log_verification_failure)
def verify_order(order_id: str) -> dict:
    return orders_api.verify(order_id)

# Step 2: Process refund (requires approval if >$100)
@tool(requires_approval=True, compensate_with=reverse_refund)
def process_refund(order_id: str, amount: float) -> dict:
    return payments_api.refund(order_id, amount)

@tool()
def reverse_refund(order_id: str, amount: float) -> dict:
    return payments_api.charge(order_id, amount)

# Step 3: Notify customer
@tool(compensate_with=send_cancellation_email)
def send_refund_email(customer_email: str, amount: float):
    return email_api.send(customer_email, f"Refund of ${amount} processed")

@tool()
def send_cancellation_email(customer_email: str, amount: float):
    return email_api.send(customer_email, "Refund was reversed due to error")

What AgentHelm guarantees:

If verification fails → nothing happens (atomic)
If refund >$100 → blocks for approval before executing
If refund succeeds but email fails → refund is reversed, customer is re-charged
Every step is logged with full LLM reasoning for compliance

The agent either completes all steps or none. Just like a database transaction.

The Architecture

Agent Task
    ↓
Execute Step 1 → Success
    ↓
Execute Step 2 → Success  
    ↓
Execute Step 3 → FAILURE
    ↓
Compensate Step 2 (automatic)
    ↓
Compensate Step 1 (automatic)
    ↓
Return to Consistent State

The orchestrator tracks every step and its compensation function. On failure, it executes compensations in reverse order to safely unwind the workflow.

What I Learned Building This

1. Most Agent Failures Are Predictable

After analyzing dozens of agent failures, patterns emerged:

Rate limits: External APIs hit rate limits
Partial failures: Step 2 succeeds but step 3 fails
Bad LLM decisions: Model misunderstands instructions
Transient errors: Network timeouts, service unavailability

All of these can be handled with: retries, compensations, and structured traces.

2. Developers Want Boring Reliability Over Clever Features

Early feedback: developers didn't ask for more LLM providers or fancier orchestration. They asked:

"Can I see exactly why it failed?"
"Can I prevent it from making this mistake again?"
"Can I undo what it did?"

Production engineers want boring, predictable, debuggable systems.

3. The Trace CLI Is Surprisingly Powerful

Being able to run agenthelm traces filter --status failed and see every failure with full context is a game-changer for debugging. It's like having a time-travel debugger for your agent.

The SQLite storage backend means you can use standard SQL tools to analyze patterns:

SELECT tool_name, COUNT(*) as failures 
FROM traces 
WHERE status = 'failed' 
GROUP BY tool_name 
ORDER BY failures DESC;

What's Next

I'm working on v0.3.0 with:

Plan-driven execution: LLM generates a plan, you approve it, then execution is deterministic
Enhanced safety: Budget enforcement (token limits, time limits, I/O constraints)
Multi-LLM support: Adding Claude and Gemini (both excel at tool use)

But I'm not shipping any of this until I validate that transactional safety is actually what production teams need.

Try It Yourself

If you're deploying agents that touch real systems—payments, databases, APIs—I'd love your feedback.

Install:

pip install agenthelm

Quick example:

from agenthelm import tool

@tool(compensate_with=undo_action)
def risky_action(data: dict):
    return api.modify(data)

@tool()
def undo_action(data: dict):
    return api.revert(data)

Run:

agenthelm run --agent-file agent.py --task "Your task here"

GitHub: github.com/hadywalied/agenthelm

Docs: hadywalied.github.io/agenthelm

The Bigger Picture

AI agents are moving from demos to production. But most frameworks are still optimized for prototyping, not reliability.

We need to bring software engineering discipline to AI agents:

Transactional guarantees
Structured observability
Human oversight for high-risk actions
Reproducible execution

AgentHelm is my attempt at this. It's not the most feature-rich framework. It's the framework designed to prevent 3 AM production incidents.

If you're building production agents, I want to hear from you:

What breaks in production?
What's missing from existing frameworks?
What would make you trust an agent with real business logic?

Drop a comment or open an issue on GitHub. Let's figure this out together.

Announcing AgentHelm v0.2.0: Enhanced Observability and Developer Experience

Hady Walied — Mon, 03 Nov 2025 22:43:29 +0000

Have you ever deployed an AI agent only to wonder what it's actually doing? Or struggled to debug why it made a particular decision? You're not alone.

We are excited to announce the release of AgentHelm v0.2.0! This release tackles the observability challenge head-on with powerful new features to help you build, debug, and monitor your agents with confidence.

What's New in v0.2.0?

This release brings a host of new features designed to give you deeper insights into your agent's execution and streamline your development workflow.

🚀 Storage Abstraction Layer

We've introduced a flexible storage abstraction layer that allows you to choose how your agent's execution traces are persisted. Whether you prefer simple JSON files for local development or a more robust SQLite database for efficient querying, AgentHelm has you covered.

JsonStorage: The default, file-based storage for traces.
SqliteStorage: A new backend that stores traces in a SQLite database, enabling powerful querying capabilities.

# Using JSON storage (default)
from orchestrator.core.storage import JsonStorage
storage = JsonStorage("my_traces.json")

# Or switch to SQLite for better querying
from orchestrator.core.storage import SqliteStorage
storage = SqliteStorage("my_traces.db")

🔍 CLI Trace Explorer

Debugging and analyzing agent behavior is now easier than ever with the new agenthelm traces command. This powerful CLI tool allows you to inspect, filter, and export your agent's execution traces directly from the command line.

Here's a quick look at what you can do:

List Traces: Get a quick overview of recent traces.

$ agenthelm traces list --trace-file examples/observability_example/observability_trace.db

```
+------+---------------------+------------------+----------+----------------------+ 
|   ID | Timestamp           | Tool Name        | Status   |   Execution Time (s) |
+======+=====================+==================+==========+======================+
|    0 | 2025-11-03 21:23:00 | undo_create_task | SUCCESS  |                    0 |
+------+---------------------+------------------+----------+----------------------+ 
|    1 | 2025-11-03 21:23:00 | create_task      | SUCCESS  |                    0 |
+------+---------------------+------------------+----------+----------------------+ 
|    2 | 2025-11-03 21:23:00 | search_web       | SUCCESS  |                    0 |
+------+---------------------+------------------+----------+----------------------+ 
|    3 | 2025-11-03 21:23:00 | get_current_time | SUCCESS  |                    0 |
+------+---------------------+------------------+----------+----------------------+ 
```

Show Trace Details: Dive deep into a specific trace to understand the agent's reasoning and actions.
```
$ agenthelm traces show 1 --trace-file examples/observability_example/observability_trace.db
```
Filter Traces: Find exactly what you're looking for with powerful filtering options.
```
$ agenthelm traces filter --tool-name create_task --status SUCCESS
```
Export Traces: Export your traces to JSON, CSV, or Markdown for further analysis or reporting.
```
$ agenthelm traces export --output failed_traces.csv --format csv --status FAILED
```

✨ New Observability Example

To help you get started with these new features, we've added a new example in examples/observability_example/. This example demonstrates how to use the new storage backends and provides a hands-on way to explore the CLI trace explorer.

Check out the observability_agent.py file to see how you can easily switch between JsonStorage and SqliteStorage and generate traces for different scenarios.

# from examples/observability_example/observability_agent.py

def run_example_agent(storage_type: str = "json"):
    logging.info(f"\n--- Running AgentHelm Observability Example with {storage_type.upper()} Storage ---")

    # 1. Setup Storage
    if storage_type == "json":
        trace_file = "observability_trace.json"
        # Clean up previous trace file if it exists
        if os.path.exists(trace_file):
            os.remove(trace_file)
        storage = JsonStorage(trace_file)
    elif storage_type == "sqlite":
        trace_file = "observability_trace.db"
        # Clean up previous trace file if it exists
        if os.path.exists(trace_file):
            os.remove(trace_file)
        storage = SqliteStorage(trace_file)
    else:
        logging.error("Invalid storage type. Use 'json' or 'sqlite'.")
        return

    # ... rest of the agent setup and execution

Why SQLite?

For production deployments with thousands of traces:

JSON: ⏱️ ~2.5s to filter 10,000 traces
SQLite: ⚡ ~0.05s to filter 10,000 traces (50x faster!)

Plus, SQLite enables powerful queries like finding all failed traces with low confidence scores in under a second.

Upgrading from v0.1.0

Good news! v0.2.0 is fully backward compatible. Simply upgrade:

Get Started Today!

We believe these new observability features will significantly improve your experience building and deploying reliable AI agents. To get started with AgentHelm v0.2.0, simply upgrade your installation:

pip install --upgrade agenthelm

What's Next?

This is just the beginning! We're already working on:

🌐 OpenTelemetry integration for distributed tracing
🔌 Model Context Protocol (MCP) support
📊 Web-based trace visualization dashboard
🎯 Advanced constraint validation system

Want to influence the roadmap? Join the discussion on GitHub!

Join the AgentHelm Community

Built something cool with AgentHelm? We'd love to hear about it! Share your story in our Show and Tell section.

Building a Production-Ready Refund Agent That Won't Break Your Business

Hady Walied — Mon, 27 Oct 2025 01:58:31 +0000

AI agents can automate business processes. But most demos ignore a critical question: What happens when something goes wrong mid-workflow?

Consider a customer refund process:

Process refund via payment gateway → succeeds
Send confirmation email → fails

Your customer now has money refunded but no notification. Your support team has no record of what happened. Your compliance team can't audit the decision.

This is the production reality that demos skip over.

Today I'm showing you how to build a customer refund agent that handles these failure modes correctly using AgentHelm an open-source framework I built for production-ready agent orchestration.

What Makes a Refund Agent "Production-Ready"?

A toy demo refund agent calls a few tools and returns a result. A production refund agent needs:

Transactional safety: If step 3 fails, steps 1 and 2 are automatically undone
Human approval: High-value refunds require manager sign-off
Audit trails: Every decision is logged for compliance
Error recovery: Failures don't leave the system in an inconsistent state

Most agent frameworks (LangChain, AutoGPT) handle the first part—calling tools. None of them handle the second part—making it safe for production.

That's what AgentHelm solves.

The Refund Workflow

Here's what our agent needs to do:

1. Verify order is eligible for refund
2. Check customer account status
3. Validate refund amount (requires approval if >$100)
4. Process refund via payment gateway
5. Send confirmation email
6. Log audit record

The critical part: If step 5 (email) fails, we need to automatically undo step 4 (refund). We can't leave a customer refunded without notification.

Building the Agent: Tool by Tool

Step 1: Define the Refund Tool with Rollback

from agenthelm.orchestrator.core.tool import tool

@tool(requires_approval=True, compensating_tool="reverse_refund_transaction")
def process_refund(order_id: str, customer_id: str, refund_amount: float, reason: str) -> dict:
    """Process a refund for an order.
    This is the main action that processes the payment refund.
    Requires approval if the refund amount exceeds $100.
    """
    logger.info(f"Processing refund of ${refund_amount:.2f} for order {order_id}")

    # Get order and customer details
    order = order_db.get_order(order_id)
    customer = customer_db.get_customer(customer_id)

    if not order or not customer:
        return {
            "success": False,
            "error": "Invalid order or customer"
        }

    # Process the refund through the payment processor
    payment_result = payment_processor.process_refund(
        order_id=order_id,
        amount=refund_amount,
        payment_method=order["payment_method"]
    )

    if not payment_result["success"]:
        return {
            "success": False,
            "error": f"Payment processing failed: {payment_result.get('error', 'Unknown error')}"
        }

    # Create refund record
    refund_data = {
        "order_id": order_id,
        "customer_id": customer_id,
        "amount": refund_amount,
        "reason": reason,
        "payment_transaction": payment_result,
        "status": "completed"
    }

    refund_id = refund_db.create_refund(refund_data)

    if not refund_id:
        # If refund record creation fails, we should reverse the payment transaction
        payment_processor.reverse_transaction(payment_result["transaction_id"])
        return {
            "success": False,
            "error": "Failed to create refund record"
        }

    # Update customer's refund history
    customer["refund_history"].append({
        "order_id": order_id,
        "amount": refund_amount,
        "date": datetime.now().isoformat(),
        "reason": reason,
        "refund_id": refund_id
    })

    customer_db.update_customer(customer_id, customer)

    # Update order status
    order["refund_status"] = "refunded"
    order["refund_amount"] = refund_amount
    order["refund_date"] = datetime.now().isoformat()

    order_db.update_order(order_id, order)

    return {
        "success": True,
        "refund_id": refund_id,
        "transaction_id": payment_result["transaction_id"],
        "customer_email": customer["email"]
    }

@tool()
def reverse_refund_transaction(transaction_id: str) -> dict:
    """
    Compensating action for process_refund.
    Reverses a refund transaction if something goes wrong after the payment processing.
    """
    logger.info(f"Reversing refund transaction {transaction_id}")

    result = payment_processor.reverse_transaction(transaction_id)

    if not result:
        return {
            "success": False,
            "error": "Failed to reverse transaction"
        }

    return {
        "success": True,
        "transaction_reversed": True,
        "transaction_id": transaction_id
    }

Key features:

requires_approval=True → Agent pauses and asks for human confirmation
set_compensator() → Automatic rollback if later steps fail

Step 2: Add the Notification Tool (Also with Rollback)

@tool(retries=2, compensating_tool="send_correction_email")
def send_refund_confirmation(customer_email: str, order_id: str, refund_amount: float, refund_id: str) -> dict:
    """Send a confirmation email to the customer about their refund."""
    logger.info(f"Sending refund confirmation email to {customer_email}")

    subject = f"Your Refund for Order {order_id} Has Been Processed"

    body = f"""
Dear Customer,

We're writing to confirm that your refund for Order {order_id} has been processed.

Refund Details:
- Refund ID: {refund_id}
- Amount: ${refund_amount:.2f}
- Date: {datetime.now().strftime('%Y-%m-%d')}

The refund has been issued to your original payment method. Please allow 3-5 business days for the funds to appear in your account.

If you have any questions about this refund, please contact our customer service team and reference your Refund ID.

Thank you for your business.

Best regards,
The Customer Service Team
"""

    email_result = email_service.send_email(customer_email, subject, body)

    if not email_result:
        return {
            "success": False,
            "error": "Failed to send confirmation email"
        }

    return {
        "success": True,
        "email_sent": True,
        "recipient": customer_email
    }

@tool()
def send_correction_email(customer_email: str, order_id: str) -> dict:
    """
    Compensating action for send_refund_confirmation.
    Sends a correction email if the original confirmation had issues.
    """
    logger.info(f"Sending correction email to {customer_email}")

    subject = f"Important Update About Your Refund for Order {order_id}"

    body = f"""
Dear Customer,

We're writing to inform you about an important update regarding your recent refund for Order {order_id}.

There was a technical issue with our refund processing system. Our team is working to resolve this issue as quickly as possible.

Please disregard any previous communication about this refund. We will send you a new confirmation once the refund has been properly processed.

We apologize for any inconvenience this may cause.

If you have any questions, please contact our customer service team.

Best regards,
The Customer Service Team
"""

    email_result = email_service.send_email(customer_email, subject, body)

    if not email_result:
        return {
            "success": False,
            "error": "Failed to send correction email"
        }

    return {
        "success": True,
        "email_sent": True,
        "recipient": customer_email
    }

Key features:

retries=2 → Automatically retry if email service has transient failure
Compensator sends a "we had an issue" email if rollback happens

Step 3: Add Supporting Tools

@tool()
def verify_order_status(order_id: str) -> dict:
    """Verify that an order exists and is eligible for refund."""
    order = order_db.get_order(order_id)

    if not order:
        return {"success": False, "error": f"Order {order_id} not found"}

    if order["status"] not in ["delivered", "shipped"]:
        return {
            "success": False, 
            "error": f"Order {order_id} has status '{order['status']}' which is not eligible for refund"
        }

    return {"success": True, "order_details": order}

@tool()
def verify_customer_eligibility(customer_id: str) -> dict:
    """Verify that a customer is eligible for refunds."""
    customer = customer_db.get_customer(customer_id)

    if not customer:
        return {"success": False, "error": f"Customer {customer_id} not found"}

    if customer["account_status"] != "active":
        return {
            "success": False,
            "error": f"Customer {customer_id} has account status '{customer['account_status']}' which is not eligible for refund"
        }

    return {"success": True, "customer_details": customer}

@tool()
def validate_refund_amount(order_id: str, refund_amount: float) -> dict:
    """Validate that the refund amount is valid for the given order."""
    order = order_db.get_order(order_id)

    if not order:
        return {
            "success": False,
            "error": f"Order {order_id} not found"
        }

    if refund_amount <= 0:
        return {"success": False, "error": "Refund amount must be greater than zero"}

    if refund_amount > order["total_amount"]:
        return {
            "success": False,
            "error": f"Refund amount ${refund_amount:.2f} exceeds order total ${order['total_amount']:.2f}"
        }

    return {
        "success": True,
        "valid_amount": True,
        "requires_approval": refund_amount > 100,
        "order_total": order["total_amount"]
    }

@tool()
def log_audit_record(action: str, details: dict) -> dict:
    """Log action for compliance."""
    audit_record = {
        "timestamp": datetime.now().isoformat(),
        "action": action,
        "details": details
    }

    # For simulation, we read the list, append, and write back.
    try:
        with open("audit_log.json", 'r') as f:
            log = json.load(f)
    except FileNotFoundError:
        log = []

    log.append(audit_record)

    with open("audit_log.json", 'w') as f:
        json.dump(log, f, indent=2)

    return {"logged": True, "action": action}

Step 4: Create the Agent

from agenthelm.orchestrator.core.storage import FileStorage
from agenthelm.orchestrator.core.tracer import ExecutionTracer
from agenthelm.orchestrator.agent import Agent
from agenthelm.orchestrator.llm.mistral_client import MistralClient
from agenthelm.orchestrator.core.handlers import ApprovalHandler
# Setup storage and tracer
storage = FileStorage('refund_agent_trace.json')
tracer = ExecutionTracer(storage)

# Get API key from environment variable
api_key = os.environ.get("MISTRAL_API_KEY")
if not api_key:
    raise ValueError("MISTRAL_API_KEY environment variable not set.")

# Initialize LLM client
client = MistralClient(model_name="mistral-small-latest", api_key=api_key)

# Define the list of tools for the agent
agent_tools = [
        verify_order_status,
        verify_customer_eligibility,
        validate_refund_amount,
        process_refund,
        send_refund_confirmation,
        reverse_refund_transaction,
        send_correction_email,
        log_audit_record
]

# Set up approval handler for tools that require approval
approval_handler = EmailApprovalHandler()
tracer.approval_handler = approval_handler

# Instantiate the Agent
agent = Agent(tools=agent_tools, tracer=tracer, client=client)

Step 5: Run It

result = agent.run(
    "Process a $450 refund for order ORD-1001, customer CUST-001. "
    "Reason: product not as described. Customer email: customer@example.com"
)

How the Agent Thinks: The ReAct Framework

Running the agent is simple, but how does it decide what to do? It uses a reasoning process called ReAct (for Reason and Act).

At each step, the agent doesn't just blindly pick a tool. Instead, it follows a think-act loop:

Reason: The LLM first thinks about the overall goal, what it has done so far, and what the next logical step should be. It generates short, internal monologue explaining its reasoning.
Act: Based on its reasoning, it chooses a tool and executes it.

This reasoning is not hidden. AgentHelm captures it in the trace file, giving you an incredible tool for debugging and understanding the agent's behavior.

For example, here is the agent's first thought when given the refund task:

Agent's Thought: To process a refund, we first need to verify the customer's eligibility. This is a prerequisite step before processing the refund.

Based on this thought, it correctly chooses the verify_customer_eligibility tool. After that tool succeeds, the agent thinks again:

Agent's Thought: The customer eligibility has been verified. The next step is to verify the order status to ensure that the order is eligible for a refund.
And so it calls verify_order_status. This step-by-step reasoning makes the agent's behavior predictable and auditable.

What Happens: Three Scenarios

Scenario 1: Happy Path (Everything Works)

Verifying order ORD-1001 status... ✓
Checking customer eligibility... ✓
Validating refund amount $450... ✓

Approval Required for Tool: process_refund
   Order ID: ORD-1001
   Amount: $450.00
   Reason: product not as described

   Do you approve this action? [y/N]: y

Processing $450 refund for order ORD-1001... ✓
Sending confirmation to customer@example.com... ✓
AUDIT: refund_completed

Refund processed successfully

Key point: The agent paused and asked for approval because the amount was >$100.

Scenario 2: Email Fails → Automatic Rollback

Verifying order ORD-1001 status... ✓
Checking customer eligibility... ✓
Validating refund amount $450... ✓

Approval Required for Tool: process_refund
Do you approve this action? [y/N]: y

Processing $450 refund for order ORD-1001... ✓
Transaction ID: TXN-12345

Sending confirmation to customer@example.com... ✗ FAILED
Error: Email service unavailable

Workflow failed at step: send_refund_confirmation
Triggering compensating actions...

Calling reverse_refund_transaction(transaction_id='TXN-12345')
Transaction TXN-12345 reversed

Calling send_correction_email(customer_email='customer@example.com')
Correction notice sent

Refund reversed due to email delivery failure

This is the killer feature. The refund was processed, then automatically reversed when the email failed. The system is never in an inconsistent state.

Scenario 3: Approval Denied

Approval Required for Tool: process_refund
Order ID: ORD-1001
Amount: $450.00

Do you approve this action? [y/N]: n

Approval denied by user
Workflow terminated

The refund never happened. Human oversight prevented the transaction.

The Audit Trail

Every action is automatically logged. Here's what the trace file looks like:

[
    {
      "tool_name": "verify_order_status",
      "timestamp": "2025-10-28T10:15:30Z",
      "inputs": {"order_id": "ORD-1001"},
      "outputs": {"result": {"success": true, "order_details": {}}},
      "execution_time": 0.045,
      "error_state": null
    },
    {
      "tool_name": "process_refund",
      "timestamp": "2025-10-28T10:16:12Z",
      "inputs": {
        "order_id": "ORD-1001", 
        "refund_amount": 450.0,
        "reason": "product not as described"
      },
      "outputs": {
        "result": {
            "refund_id": "REF-1001",
            "transaction_id": "TXN-12345"
        }
      },
      "execution_time": 0.230,
      "error_state": null
    },
    {
      "tool_name": "reverse_refund_transaction",
      "timestamp": "2025-10-28T10:16:45Z",
      "inputs": {"transaction_id": "TXN-12345"},
      "outputs": {"result": {"status": "reversed"}},
      "execution_time": 0.180,
      "error_state": "Compensating action for failed step: send_refund_confirmation"
    }
]

For compliance teams, this is gold. You can prove:

Who approved what
When each action occurred
Why rollbacks happened
Exact inputs/outputs for every step

Lessons Learned Building This

1. Test Rollbacks in Development

Don't wait until production to find out your compensating actions don't work.

Solution: Add a flag to simulate failures:

@tool()
def send_refund_confirmation(customer_email: str, simulate_failure: bool = False):
    if simulate_failure:
        raise Exception("Simulated failure for testing")
    # Normal logic

Run your workflows with simulate_failure=True during development.

2. Separate Validation from Action

My first process_refund tool both validated the amount and processed it. This made approvals tricky.

Solution: A dedicated validate_refund_amount tool checks the business rules (e.g., amount < order total, amount > $100 requires approval). The process_refund tool, which requires approval, can then focus only on the transaction. The agent uses the output of the validation to know when to proceed.

3. Atomic Tools Are Easier to Debug

My first version had a single process_refund_workflow tool that did everything. When it failed, I couldn't tell which step broke.

Solution: Each tool does ONE thing. Easier to test, debug, and reuse.

4. Retries Save You From Flaky APIs

Email services, payment gateways, and external APIs fail randomly. retries=2 saved me countless debugging sessions.

5. The Trace File is Your Best Friend

When something goes wrong in production, the trace file tells you EXACTLY what happened. Invest time making it readable and queryable.

Why This Matters

Most companies won't deploy AI agents because they can't trust them.

They've seen demos. They know agents CAN work. But they don't know:

What happens when the agent makes a mistake
How to audit agent decisions for compliance
How to prevent catastrophic failures

AgentHelm solves these problems by bringing distributed systems reliability patterns to AI agents:

Transactional semantics (like databases)
Structured observability (like OpenTelemetry)
Policy enforcement (like API gateways)

This isn't new technology. It's applying 20 years of production systems engineering to agents.

Try It Yourself

The full code for this refund agent is on GitHub:

Repo: https://github.com/hadywalied/agenthelm
Example: https://github.com/hadywalied/agenthelm/blob/main/examples/customer_refund_agent/refund_agent.py
Docs: https://hadywalied.github.io/agenthelm/

Install it:

pip install agenthelm

Run the example:

export MISTRAL_API_KEY='your_key_here'
cd examples/refund_agent
python refund_agent.py

If you're deploying agents in production, I'd love your feedback. What's the biggest blocker you're facing: observability, safety, or reliability?

Open an issue on GitHub or comment below. Let's build reliable AI together.

Introducing AgentHelm: Production-Ready Orchestration for AI Agents

Hady Walied — Sat, 25 Oct 2025 23:16:36 +0000

The infrastructure layer that every production agent system needs but nobody's building.

A few months ago, I wrote a deep dive analyzing why most AI agent deployments fail. The thesis was simple: the bottleneck isn't model capability rather it's orchestration.

Agents fail in ways that are catastrophically expensive and impossible to debug. They leave systems in inconsistent states. They make decisions you can't audit. They perform sensitive actions without approval mechanisms.

The response was overwhelming. CTOs, engineering leads, and developers building production agent systems all said the same thing: "Where's the framework that solves this?"

It didn't exist. So, I built it.

Today, I'm open-sourcing AgentHelm, a lightweight Python framework that brings production-grade orchestration to AI agents.

The Problem: Production Agents Need Infrastructure

Here's the disconnect: We have excellent frameworks for building agents (LangChain, LlamaIndex, AutoGPT). We have powerful models (GPT-4, Claude, Mistral). But we have nothing that makes agents safe to deploy at scale.

Try deploying an agent in an environment where:

A failed workflow costs real money
You need to explain to auditors why the agent made a specific decision
Compliance requires you to roll back transactions if any step fails
Certain actions (like deleting data or charging cards) need human approval

The existing frameworks can't handle this. They're optimized for prototyping and demos, not production reliability.

Most agent failures follow this pattern:

Agent calls Tool A (succeeds)
Agent calls Tool B (succeeds)
Agent calls Tool C (fails due to network timeout)
Your system is now in an inconsistent state
You have no structured logs showing what the agent was thinking
You spend hours manually debugging and fixing

This isn't theoretical. This is what every team deploying production agents hits immediately.

The Solution: Production-Grade Orchestration

AgentHelm provides the infrastructure layer that production agents require. It's not another agent-building framework—it's the orchestration harness that makes any agent reliable.

Core thesis: You should be able to deploy AI agents with the same confidence you deploy microservices.

1. Automatic Execution Tracing

Every tool call is automatically logged with structured data:

Inputs and outputs (sanitized for PII)
Execution time and timestamps
Success/failure state
The agent's reasoning (chain-of-thought)
Correlation IDs for distributed tracing

This gives you complete audit trails for compliance and debugging.

from agenthelm.orchestration.tool import tool

@tool
def charge_customer(amount: float, customer_id: str) -> dict:
    """Charge a customer's card via Stripe."""
    # Your payment logic here
    return {"transaction_id": "txn_123", "status": "success"}

When this executes, AgentHelm automatically creates a structured log:

{
  "tool": "charge_customer",
  "inputs": {"amount": 50.0, "customer_id": "cust_abc"},
  "output": {"transaction_id": "txn_123", "status": "success"},
  "execution_time_ms": 245,
  "timestamp": "2025-10-26T14:32:01Z"
}

No extra code. No manual logging. Just add the @tool decorator.

2. Human-in-the-Loop Safety

For high-stakes operations, you can require manual confirmation before execution:

@tool(requires_approval=True)
def delete_user_data(user_id: str) -> dict:
    """Permanently delete all user data."""
    # Deletion logic here
    pass

When the agent attempts to call this tool, AgentHelm pauses and prompts for approval:

   Approval Required for Tool: delete_user_data
   User ID: user_12345

   Do you approve this action? [y/N]:

The workflow doesn't proceed until confirmed. No surprise deletions. No accidental charges.

3. Resilience Through Retries

APIs fail. Networks timeout. AgentHelm handles this automatically:

@tool(retries=3, retry_delay=2.0)
def fetch_user_data(user_id: str) -> dict:
    """Fetch user data from external API."""
    # API call that might fail transiently
    pass

If the call fails, AgentHelm retries up to 3 times with exponential backoff. Transient failures no longer kill your workflows.

4. Transactional Rollbacks

The most critical feature: compensating transactions.

You can link any tool to a "compensating action" that reverses its effects:

@tool
def charge_customer(amount: float, customer_id: str) -> dict:
    # Charge the card
    return {"transaction_id": "txn_123"}

@tool
def refund_customer(transaction_id: str) -> dict:
    # Refund the transaction
    return {"status": "refunded"}

# Link them together
charge_customer.set_compensator(refund_customer)

Now, if your workflow is:

charge_customer() → succeeds
provision_server() → fails

AgentHelm automatically calls refund_customer() to undo the charge. Your system stays consistent.

This is transactional semantics for AI agents. It's what makes them safe for production.

Getting Started in 60 Seconds

Install AgentHelm from PyPI:

pip install agenthelm

Define your tools with type hints (AgentHelm automatically generates contracts):

# my_tools.py
from agenthelm.orchestration.tool import tool

@tool(requires_approval=True)
def post_tweet(message: str) -> dict:
    """Post a message to Twitter."""
    print(f"POSTING: {message}")
    return {"status": "posted"}

Run your agent from the command line:

export MISTRAL_API_KEY='your_key_here'
agenthelm run my_tools.py "Post a tweet announcing AgentHelm!"

AgentHelm will:

Parse your request using the configured LLM
Identify the right tool to call
Pause and ask for your approval
Execute the tool
Log everything to cli_trace.json -> or to a database in a future update.

That's it. No complex setup. No configuration files. Just reliable agent execution.

Why This Matters Now

The agent market is bifurcating:

Consumer agents (ChatGPT, Siri, Alexa) can tolerate failures because the stakes are low. Users just try again.

Enterprise agents require guarantees that existing frameworks don't provide:

Observability: Can you debug what went wrong?
Safety: Can you prevent catastrophic mistakes?
Compliance: Can you prove the agent followed policies?
Reliability: Can you trust the agent won't leave systems inconsistent?

AgentHelm is built specifically for the enterprise use case—agents where failure has consequences.

The Architecture Philosophy

I'm an optimization engineer working in electronics automation. In my domain, systems need to be observable, debuggable, and reliable. When I started working with AI agents, I was struck by how fragile they are compared to traditional distributed systems.

AgentHelm applies the lessons from decades of distributed systems engineering to the agent paradigm:

Structured logging (like OpenTelemetry)
Transactional semantics (like databases)
Circuit breakers and retries (like service meshes)
Policy enforcement (like API gateways)

These aren't new concepts. We just haven't applied them to agents yet.

What's Next

This is v0.1.0, the foundation. The roadmap includes:

Web-based Observability Dashboard: Visualize agent traces, compare failed vs. successful executions, identify failure patterns
Policy Engine: Define complex constraints that agents cannot violate
Multi-Agent Coordination: Enable multiple agents to collaborate with conflict resolution and resource locking

But I'm shipping the core functionality now because teams are deploying agents today and hitting these problems immediately.

This Is Open Source

AgentHelm is MIT-licensed and available today:

Install: pip install agenthelm
GitHub: https://github.com/hadywalied/agenthelm
Documentation: https://hadywalied.github.io/agenthelm/

I'd love your feedback, bug reports, and contributions. If you're deploying agents in production, I want to hear about your challenges.

Star us on GitHub if this solves a problem you're facing. Better yet—try it and tell me what breaks.

The Honest Pitch

If you're building toy projects or weekend demos, you don't need AgentHelm. Existing frameworks are great for prototyping.

But if you're deploying agents where failure has consequences, where you need audit trails, approval workflows, and transactional guarantees, AgentHelm is built for you.

We're not the most feature-rich framework. We're not the easiest to learn. But we're a framework designed from the ground up to make agents production-ready.

Try it. Break it. Tell me what's missing.

Let's build reliable AI together.

Hady Walied

Software Engineer

GitHub | LinkedIn | Twitter

Why Most AI Agents Will Fail: The Orchestration Problem Nobody's Solving

Hady Walied — Thu, 23 Oct 2025 23:11:18 +0000

ChatGPT Atlas launched last week to predictable fanfare. Tech Twitter celebrated. LinkedIn exploded with hot takes about "the future of work." But everyone's missing the real story.

The question isn't whether AI agents can click buttons in your browser. They can. The question is: why hasn't anyone built a production-grade agent system that enterprises actually trust?

The answer reveals the hardest unsolved problem in AI agents—and it's not what you think.

The Capability Illusion

Here's what the hype cycle wants you to believe: agents are held back by model intelligence. Just wait for GPT-5.1, Claude 5, or whatever comes next, and suddenly agents will be reliable.

This is wrong.

Today's frontier models can already solve most agent tasks in isolation. The failure mode isn't "the model can't figure out what to do." It's that agents fail in ways that are catastrophically expensive and impossible to debug.

Consider a real scenario: You deploy an agent to handle customer refunds. It needs to:

Read the support ticket
Verify the purchase in your database
Check refund eligibility against your policy
Process the refund via Stripe
Send a confirmation email

On paper, this is trivial. In production, here's what actually happens:

The agent misreads ambiguous ticket language and processes a refund outside policy (cost: $200)
A Stripe API timeout occurs mid-transaction, but the agent doesn't implement retry logic (cost: angry customer, manual intervention)
The confirmation email sends before the refund completes due to race conditions (cost: trust erosion)
You have no audit trail showing why the agent made each decision (cost: compliance failure)

The core problem: agents are probabilistic systems operating in deterministic environments that require guarantees.

The Three Layers of Agent Failure

Most agent frameworks fail because they're solving the wrong problem. They optimize for "can it work?" instead of "can it work reliably at scale?"

Here's the actual architecture challenge:

Layer 1: Tool Execution (Everyone focuses here)

Can the agent call the right API with the right parameters? This is table stakes. Every framework from LangChain to AutoGPT handles this.

Layer 2: State Management (Some people think about this)

Can the agent maintain coherent state across a multi-step workflow? Can it recover from failures? Can it handle partial completions?

This is where most "demos" break. They work in happy-path scenarios and catastrophically fail when:

Network calls timeout
APIs return unexpected responses
The agent encounters states not in its training data

Layer 3: Observability & Controllability (Almost nobody solves this)

Can you audit why the agent made each decision? Can you roll back a sequence of actions? Can you set hard constraints the agent cannot violate?

This is the real bottleneck. Enterprise won't deploy agents they can't observe, debug, and control.

Why ChatGPT Atlas Doesn't Matter (Yet)

ChatGPT Atlas is impressive UX. It's not impressive architecture.

It's a consumer product designed for low-stakes tasks: "Research this topic," "Fill out this form," "Summarize these emails." The failure mode is "the user redoes it manually." Annoying, but not existential.

This works because OpenAI can get away with:

No formal verification of agent behavior
No rollback mechanisms
No audit trails
No compliance guarantees

Try deploying this in healthcare, finance, or government. You can't. The liability profile is unacceptable.

The companies that win the agent era won't be the ones with the best chat interface. They'll be the ones who solve industrial-grade orchestration.

What Winning Looks Like: The Orchestration Stack

Here's what a production-grade agent system actually requires:

1. Verifiable Tool Chains

Every tool the agent uses needs formal contracts:

Inputs: typed, validated, with clear bounds
Outputs: typed, with error states explicitly modeled
Side effects: declaratively specified (this writes to DB, this charges money)

Think of it like static typing for agent behavior. You can't deploy code to production without type checking—why would you deploy agents without tool checking?

2. Transactional Semantics

Agents need database-style guarantees:

Atomicity: Either the entire workflow completes or it's cleanly rolled back
Idempotency: Running the same workflow twice produces the same result
Consistency: Agents cannot leave systems in invalid states

This isn't theoretical. Stripe's API is built on these principles. If agents interact with systems of record, they need the same guarantees.

3. Observability Infrastructure

Every agent decision needs to be:

Traceable: What tool did it call, with what inputs, and why?
Replayable: Can you reconstruct the exact sequence of decisions?
Explainable: Can a human understand the reasoning chain?

This is the hard part. LLMs are black boxes. But you can build observability around them:

Log every tool call with full context
Store the reasoning trace (chain-of-thought)
Implement decision provenance (why did the agent choose this action over alternatives?)

4. Constraint Systems

Agents need guardrails that cannot be bypassed:

Hard limits (max transaction value, allowed API endpoints)
Approval workflows (certain actions require human sign-off)
Escape hatches (abort if confidence drops below threshold)

This is where formal verification, sandboxing, and policy enforcement intersect.

The Real Technical Bottleneck

If you forced me to name the single biggest blocker to agent adoption, it's this:

We have no standard framework for agent observability.

Datadog, New Relic, and observability tools are built for deterministic systems. They can't handle:

Non-deterministic decision trees
Probabilistic reasoning chains
Context windows that change behavior

The company that builds "Datadog for Agents"—a system that makes agent behavior transparent, debuggable, and auditable—will be worth billions.

What Happens Next

The agent market will bifurcate:

Consumer agents (ChatGPT Atlas, Perplexity Comet) will proliferate. They'll be useful for low-stakes tasks. They'll fail often. Users will tolerate it because the cost of failure is low.

Enterprise agents will remain niche until someone solves orchestration. The first company to ship a production-grade agent framework with:

Verifiable tool execution
Transactional guarantees
Full observability
Constraint enforcement

...will capture the enterprise market entirely.

My bet? It won't be OpenAI or Anthropic. They're infrastructure providers. It'll be a new company that sits on top of LLMs and provides the orchestration layer—the way Stripe sits on top of payment processors.

The Question You Should Be Asking

Not "can agents do X?" but "how do I make agent failures survivable?"

Because agents will fail. The model will hallucinate. APIs will timeout. Users will input garbage. The question is: can your system handle it gracefully?

That's the problem worth solving.

What's missing from this analysis? Where am I wrong? I'm specifically interested in technical counterarguments—not philosophical debates about whether agents will replace jobs. We're past that. The engineering problem is what matters now.

Building a Simple Modern RAG Application with Asyncio and Chainlit

Hady Walied — Mon, 13 Oct 2025 23:10:31 +0000

Welcome to the next chapter of our RAG journey! In the previous tutorial series, we built a complete command-line RAG system from the ground up here. Today, we're taking that foundation and transforming it into a modern, user-friendly web application using Chainlit and asyncio.

Why This Evolution Matters

Our original total_rag_app.py was powerful but limited to command-line interactions. The new chainlit_app.py brings three game-changing improvements:

Web-Based UI: A beautiful, chat-like interface that anyone can use
Dynamic Document Upload: Users can upload their own PDFs on the fly
Asynchronous Processing: Blazing-fast performance through concurrent operations

Full Implementation: Github Repo

The Power of Asyncio in RAG Systems

The biggest technical leap in this update is the shift from synchronous to asynchronous code. Let's understand why this matters.

The Problem with Blocking Operations

In our original RAG pipeline, operations happened sequentially:

Wait for the keyword search to complete
Wait for the semantic search to complete
Wait for re-ranking to complete
Wait for LLM to generate a response

Total time = Sum of all operations

The Asyncio Solution

With async/await, we can run independent operations concurrently:

Keyword search and semantic search happen simultaneously
While we wait for the LLM, the UI remains responsive
Streaming responses appear token-by-token in real-time

Total time ≈ Longest operation + coordination overhead

Key Architectural Changes

1. Async Retrieval

The Retriever class now features an aretrieve() method:

async def aretrieve(self, query: str, top_n_hybrid=10) -> List[dict]:
    # Run CPU-bound keyword search in a thread pool
    keyword_docs = await asyncio.to_thread(
        self._get_keyword_docs, query, top_n_hybrid
    )

    # Run I/O-bound semantic search concurrently
    semantic_docs_langchain = await self.semantic_retriever.ainvoke(query)

    # Fast: merge results
    combined_docs = {doc['content']: doc for doc in keyword_docs + semantic_docs}.values()
    return list(combined_docs)

What's happening here?

asyncio.to_thread() runs the CPU-intensive BM25 calculation in a background thread
ainvoke() leverages LangChain's native async support for the vector database query
Both operations run concurrently, cutting retrieval time nearly in half

2. Async Re-Ranking

Similarly, the cross-encoder prediction (a heavy ML operation) runs in the background:

async def arerank(self, query: str, docs: List[dict], top_n=3) -> List[dict]:
    pairs = [[query, doc['content']] for doc in docs]

    # Offload the blocking model prediction
    scores = await asyncio.to_thread(
        self.cross_encoder.predict, pairs
    )

    scored_docs = list(zip(scores, docs))
    scored_docs.sort(key=lambda x: x[0], reverse=True)
    return [doc for score, doc in scored_docs[:top_n]]

3. Streaming Responses

The crown jewel is astream_answer(), which yields LLM tokens as they're generated:

async def astream_answer(self, query: str) -> AsyncIterator[str]:
    retrieved_docs = await self.retriever.aretrieve(query)
    reranked_docs = await self.reranker.arerank(query, retrieved_docs)
    context_str = self._format_context(reranked_docs)

    chain = (self.prompt | self.llm | StrOutputParser())

    # Stream tokens one by one
    async for chunk in chain.astream({
        "context": context_str,
        "question": query
    }):
        yield chunk

This creates the "ChatGPT-like" effect where text appears progressively, keeping users engaged.

The Chainlit Integration

Chainlit provides the web framework, but we've enhanced it with thoughtful UX touches.

Dynamic File Upload

@cl.on_chat_start
async def on_chat_start():
    # Ask for files with generous timeout
    files = await cl.AskFileMessage(
        content="Please upload up to 3 text or PDF files to begin!",
        accept=["text/plain", "application/pdf"],
        max_files=3,
        timeout=300
    ).send()

    # Process uploads in background threads
    for file in files:
        await asyncio.to_thread(shutil.copy, file.path, dest_path)

Users don't need to pre-populate a source_documents folder. They simply drag and drop PDFs into the chat.

Visual Progress with Steps

async with cl.Step(name="Processing Documents", show_input=False) as step:
    step.output = "Chunking and preparing your documents..."
    await asyncio.to_thread(doc_processor.process)

async with cl.Step(name="Initializing RAG Pipeline", show_input=False) as step:
    step.output = "Loading models and building the vector database..."
    pipeline = await asyncio.to_thread(initialize_pipeline)

Chainlit's Step API shows users exactly what's happening during setup, transforming what used to be mysterious loading time into a transparent process.

Real-Time Streaming

@cl.on_message
async def main(message: cl.Message):
    chain = cl.user_session.get("chain")
    msg = cl.Message(content="")

    # Stream each token to the UI
    async for chunk in chain.astream_answer(message.content):
        await msg.stream_token(chunk)

    await msg.send()

Each LLM token appears instantly, making the interaction feel conversational and responsive.

How to Use Your New RAG App

Installation

pip install -r requirements.txt

Running the Application

chainlit run chainlit_app.py -w

The -w flag enables watch mode for development (auto-reloads on code changes).

User Workflow

Open your browser to http://localhost:8000
Upload documents: Drag and drop up to 3 PDFs into the chat
Wait for initialization: Watch the progress steps (typically 30-60 seconds)
Start asking questions: Type naturally, as you would in any chat interface
Watch answers stream in: See the response appear word by word, with source citations

Under the Hood: Asyncio Best Practices

When to Use `asyncio.to_thread()`

Use it for CPU-bound or blocking I/O operations that don't have native async support:

BM25 calculations
Cross-encoder predictions
File operations (copying, reading)
Document processing

When to Use Native Async

Use it for inherently asynchronous operations:

LangChain's ainvoke() and astream() methods
Chainlit's UI updates
Network requests (when using aiohttp, etc.)

The Golden Rule

Never await a synchronous function directly. Either wrap it with asyncio.to_thread() or make it truly async.

Performance Comparison*

Operation	Synchronous (old)	Asynchronous (new)	Speedup
Hybrid retrieval	~2.5s	~1.3s	1.9x
Full query + answer	~8s	~4s (to first token)	2x
User experience	Frozen UI	Responsive streaming	∞

The actual wall-clock time improvement is impressive, but the perceived performance is transformational. Users see progress immediately instead of staring at a blank screen.

* numbers depend on machine specs and document size.

Conclusion

By combining the robust RAG pipeline from our original tutorial with asyncio's concurrency model and Chainlit's elegant UI framework, we've created something special: a RAG system that's both powerful and delightful to use.

The async patterns we've implemented here, offloading blocking operations, streaming responses, running independent tasks concurrently, are applicable far beyond RAG. They're fundamental techniques for building any modern Python application that needs to handle multiple operations efficiently.

Ready to build your own async-powered RAG app? The complete code is in chainlit_app.py.

Happy coding! 🚀

From Documents to Dialogue: A step-by-step RAG Journey

Hady Walied — Fri, 10 Oct 2025 14:00:00 +0000

Welcome to this complete guide on building an advanced Retrieval-Augmented Generation (RAG) system from scratch. In this tutorial series, we'll go from raw PDF documents to a sophisticated chatbot that can answer questions about them, citing its sources.

We'll be using Python, LangChain, and a local LLM (powered by LM Studio) to build our project. Let's get started!

Part 1: The Foundation - Processing Your Documents

Before we can ask questions about our documents, we need to prepare them. Large documents are too big to fit into the context window of most LLMs. The solution is to break them down into smaller, manageable chunks.

The Concept: We'll load PDF files, split them into overlapping text chunks, and save them to a JSON file. This overlap is crucial to ensure that we don't lose context between chunks.

The Code: This is phase1_process_docs.py:

# phase1_process_docs.py
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import json

def process_documents(source_dir="source_documents", output_file="chunks.json"):
    all_chunks = []
    for filename in os.listdir(source_dir):
        if filename.endswith(".pdf"):
            loader = PyPDFLoader(os.path.join(source_dir, filename))
            documents = loader.load()

            text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
            chunks = text_splitter.split_documents(documents)

            for chunk in chunks:
                all_chunks.append({
                    "content": chunk.page_content,
                    "metadata": chunk.metadata
                })

    with open(output_file, 'w') as f:
        json.dump(all_chunks, f, indent=2)
    print(f"Successfully processed {len(all_chunks)} chunks.")

if __name__ == "__main__":
    process_documents()

Outcome: Run this script, and you'll have a chunks.json file. This is the knowledge base for our RAG system.

Part 2: The Classic Approach - Keyword Search with BM25

The simplest way to find relevant information is through keyword search. BM25 is a powerful algorithm that ranks documents based on the terms they contain.

The Concept: We'll use the rank_bm25 library to create a search index from our document chunks. This will allow us to find chunks that contain specific keywords from our query.

The Code: This is phase2_keywordsearch.py:

# phase2_keyword_search.py
import json
from rank_bm25 import BM25Okapi

# Load the processed chunks
with open("chunks.json", 'r') as f:
    chunks_data = json.load(f)

# Get the content of each chunk
corpus = [chunk['content'] for chunk in chunks_data]
tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

def search_keyword(query, top_n=3):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)

    top_indexes = sorted(range(len(doc_scores)), key=lambda i: doc_scores[i], reverse=True)[:top_n]

    results = [chunks_data[i] for i in top_indexes]
    return results

Outcome: You can now perform basic keyword searches. However, you'll notice that this approach fails to capture the meaning behind the words.

Part 3: A Leap in Understanding - Semantic Search with Vector Embeddings

To search by meaning, we need to enter the world of vector embeddings. An embedding is a numerical representation of a piece of text. By comparing these numbers, we can find text that is semantically similar.

The Concept: We'll use a local LLM via LM Studio to generate embeddings for our chunks. Then, we'll store these embeddings in a vector database (ChromaDB) for efficient searching.

The Code: This is phase3_semantic_search.py:

# phase3_semantic_search.py
import json
import os
from typing import List

import requests
from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma
from langchain_core.embeddings import Embeddings

DB_PATH = "chroma_db"
api_base = "http://26.186.178.211:1234/v1"

class LMStudioEmbeddings(Embeddings):
    # ... (embedding generation logic) ...

def build_or_load_db():
    if os.path.exists(DB_PATH):
        return Chroma(persist_directory=DB_PATH, embedding_function=LMStudioEmbeddings(api_base))

    with open("chunks.json", 'r') as f:
        chunks_data = json.load(f)
    documents = [Document(page_content=chunk['content'], metadata=chunk['metadata']) for chunk in chunks_data]

    db = Chroma.from_documents(
        documents,
        LMStudioEmbeddings(api_base),
        persist_directory=DB_PATH
    )
    return db

Outcome: You now have a powerful semantic search engine. You can query for concepts and ideas, and get much more relevant results than with keyword search alone.

Part 4: The Generative Leap - Building Your First RAG Chatbot

Now it's time to bring in the "G" in RAG: Generation. We'll use an LLM to generate human-like answers based on the documents our retriever finds.

The Concept: We'll create a simple chain using LangChain. The chain will first retrieve relevant documents (using our semantic search from Part 3), then "stuff" them into a prompt for the LLM, and finally, get the answer.

The Code: This is phase4_rag_chat.py:

# phase4_rag_chat.py
# ... (imports and setup) ...

template = """
Answer the question based ONLY on the following context.
If you don't know the answer, just say that you don't know. Do not make up an answer.
Cite the sources used in your answer.

Context:
{context}

Question:
{question}
"""
prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join([f"Source: {d.metadata['source']}\n{d.page_content}" for d in docs])

# RAG Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Outcome: You have a working chatbot! It can answer questions about your documents and point to the sources it used.

Part 5: The Best of Both Worlds - Advanced Retrieval

Keyword search is good at finding specific terms, while semantic search is good at finding related concepts. Why not use both? This is called hybrid search. We'll also add a re-ranking step to further improve the relevance of our retrieved documents.

The Concept: We'll first retrieve documents using both BM25 and semantic search. Then, we'll use a special type of model called a cross-encoder to re-rank the combined results before passing them to the LLM.

The Code: This is from phase5_advanced_rag_chat.py:

# phase5_advanced_rag.py
# ... (imports and setup) ...

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def advanced_retriever(query, top_n_hybrid=10, top_n_rerank=3):
    # ... (hybrid search logic) ...

    # Re-ranking
    pairs = [[query, doc['content']] for doc in combined_docs]
    scores = cross_encoder.predict(pairs)

    scored_docs = list(zip(scores, combined_docs))
    scored_docs.sort(key=lambda x: x[0], reverse=True)

    reranked_docs = [doc for score, doc in scored_docs[:top_n_rerank]]
    return reranked_docs

Outcome: Your RAG system is now much more accurate. The hybrid search ensures you don't miss any relevant documents, and the re-ranker makes sure the LLM gets the best possible context.

The Final Product - A Complete RAG Application

We've built all the components. Now, let's put them together into a single, robust application.

The Concept: We'll refactor our code into classes, making it more organized and reusable. We'll have a DocumentProcessor, a Retriever, a ReRanker, and a main RAGPipeline that orchestrates everything.

The Code: This is the structure of our final total_rag_app.py:

# total_rag_app.py

# --- 1. Configuration ---
# ...

# --- 2. Embedding Model ---
class LMStudioEmbeddings(Embeddings):
    # ...

# --- 3. Document Processing ---
class DocumentProcessor:
    # ...

# --- 4. Retrieval System ---
class Retriever:
    # ...

# --- 5. Re-ranking System ---
class ReRanker:
    # ...

# --- 6. RAG Pipeline ---
class RAGPipeline:
    # ...

# --- 7. Main Execution ---
if __name__ == "__main__":
    # ... (Initialize and run the pipeline) ...

Outcome: You have a complete, command-line RAG application. It's well-structured, easy to run, and represents the culmination of all our work.

Running the Final Application

Full implementation: hadywalied/Total_RAG

Install dependencies:
```
pip install -r requirements.txt
```
Run the app:
```
python total_rag_app.py
```

Example:

RAG Application Ready. Ask a question about your documents.
> what's attention ?
--- Answer ---
Answer:
Attention is a function that maps a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
of the values, where the weight assigned to each value is determined by the compatibility between the
query and the corresponding key.

Source(s):
source_documents\1706.03762v7.pdf
--- End of Answer ---
>

Conclusion

Congratulations! You've built an advanced RAG system from the ground up. You've learned how to process documents, use both keyword and semantic search, re-rank results for relevance, and generate answers with a local LLM.

From here, you can explore many improvements, such as a web interface, more advanced retrieval strategies, or support for different document types. The possibilities are endless. Happy coding!