Forem: Abhishek Chatterjee

My Friend Is 40 and Drowning in Job Applications. So I Built Him an AI Agent.

Abhishek Chatterjee — Wed, 13 May 2026 08:43:55 +0000

Part 1 of 5: The problem, the mental model, and what an autonomous AI agent actually is under the hood.

A few weeks ago, a friend of mine — let’s call him R — sent me a voice note at an unreasonable hour.

He’s 40. Smart guy. Spent the last decade running operations at two different companies, one of which exited. Legitimately good at what he does. But the kind of person who built his career through referrals and relationships and had never really needed to job-hunt the traditional way. Until now.

The note was about seven minutes long. I’ll spare you the full transcript, but the gist: he’d been applying for three months. Targeting AI-adjacent product and operations roles. Had a strong profile on paper. Was getting almost nowhere.

“I’m sending 15 applications a week,” he said. “Every resume looks basically the same because I don’t have time to rewrite it properly each time. I don’t know which ones to prioritize. I’m tracking everything in a spreadsheet that’s become a nightmare. And I feel like I’m just shouting into a void.”

Seven minutes of this. He ended with: “You build AI stuff. Is there something that could help?”

I sat with that for a day.

The honest answer was: nothing good enough exists yet. The tools in the space are either janky resume spinners with no real intelligence, or “AI job search” features bolted onto job boards that mostly just mean keyword matching with extra steps.

And I could see the actual problem clearly — not just “job searching is hard” but the specific mechanical reasons why his approach wasn’t working and why no amount of extra effort would fix it.

So I said: let me just build it.

This is Part 1 of a 5-part series documenting that build. JobFlow AI — an autonomous AI agent that finds roles, scores them against your profile, tailors your resume for each one, writes outreach emails, tracks follow-ups, and preps you for interviews. Built in public. Will be open sourced.

I’m sharing the full engineering process: the architecture decisions, the agent design patterns, the prompt engineering, the orchestration, the deployment, the things that went wrong. Everything.

But before we get into any of that — let’s talk about what’s actually broken with job searching right now. Because understanding the problem is what made the design obvious.

Part 1: The Actual Problem (It’s Not What You Think)
R’s instinct was that he needed to apply to more jobs, or write better cover letters, or optimize his LinkedIn profile.

All wrong. Not because those things don’t matter, but because he was trying to win a game whose rules had changed without anyone sending the memo.

Here’s what happened.

For about two decades, job applications were filtered by ATS software — Applicant Tracking Systems — that parsed resumes and matched them against keywords in job descriptions. The career advice was: “get the right keywords in.” Reasonable. Played the game. Rules were clear.

Then the rules changed.

Most serious ATS platforms now run semantic search — vector embeddings and similarity scoring, not keyword matching. The system understands meaning. “Led revenue operations overhaul” and “built CRM pipeline from scratch” can map to the same concept. The system knows.

The problem: the career advice industry hasn’t caught up.

Most resume coaches and LinkedIn experts are still giving keyword-stuffing advice. For an algorithm that doesn’t use keywords anymore. You’re being told to optimize for a game that no longer exists. And a resume that’s been obviously keyword-engineered can actively hurt you — it reads as manipulative to both the AI doing first-pass screening and the human reviewing finalists.

Simultaneously: AI tools made it trivially easy to spray-apply to 300 jobs. Which means recruiters are drowning in volume. Which means the actual human review window for your application — before someone moves on — might be 15 seconds.

You’re competing with 400 applicants. In 15 seconds. For a system using semantic matching that most of the advice you’ve received doesn’t account for.

R wasn’t doing anything wrong. He was doing the right things for the wrong era.

Solving this properly means: the right information, in the right format, for the right role, without spending three hours per application. A system that reads the job, understands your full profile, picks the right angle, and generates exactly what that specific opportunity requires.

Not one generic resume. Not manual tailoring that eats your evenings. An agent that handles the work so you can handle the judgment calls.

Part 2: The Mental Model That Makes Everything Else Click
I’ve been building agent systems for a while now — Praxiom, our AI product manager at Einstein Labs, runs 36 agent tools in production. Polaris is a spec-driven AI coding agent. agent-stream is a protocol we open sourced for streaming agent events. The agent architecture patterns in this series aren’t new territory for me, but I want to explain them from first principles because the framing matters.

The mental model that makes everything else make sense:

An AI agent = a brain + a harness.

Full stop. That’s the whole thing.

The brain is the Large Language Model — Claude, GPT-4, whatever you’re using. Text goes in, text comes out. Extraordinary at reasoning, synthesis, generation when given proper context.

The harness is everything you build around it. The database that gives it memory. The scrapers and APIs that feed it information. The orchestration layer that decides what runs and in what order. The validation layer that checks its output. The task queue that lets it run in the background. The UI that makes it usable. The retry logic. The logging. The error handling.

The brain: maybe 5% of what you build.

The harness: the other 95%.

This is the thing that gets underestimated, consistently, by everyone writing about “building AI products.” The content focuses on prompts — because prompts are visible and legible and feel like the interesting part. But the prompt is downstream of everything the harness has to do. If the harness fails to assemble the right context, the best prompt in the world produces noise.

Here’s why.

The LLM is stateless. Completely. Every call starts with no memory of anything that happened before. It doesn’t know what job you looked at yesterday. Doesn’t know what your resume says. Doesn’t know which companies you’ve already applied to. Doesn’t know anything about you at all.

Clean slate. Every time.

So when you want an agent to tailor a resume for a specific role, you can’t just say “tailor my resume for this job.” Before that call can produce anything useful, the harness needs to have assembled:

Your full work history, projects, and skills
Which of your resume variants best fits this role type
The specific job requirements, extracted and structured
Company context — size, stage, culture signals from the listing
Your positioning for this type of role
Any relevant achievements that map to what this company is looking for

All of that, formatted correctly, labeled clearly, delivered in a form the LLM can work with efficiently.

The harness does that. Not the LLM. The harness.

The analogy I keep coming back to: you hire the sharpest consultant you’ve ever worked with. Extraordinary. But she has complete amnesia between every session. Comes in fresh every time, knows nothing about your company, your product, your history.

She’s still extraordinary. Brief her properly and she produces work that justifies every rupee. Show up and say “pick up where we left off” — she can’t. She literally cannot.

So you build systems around her. Structured briefing documents. Careful context curation. Preparation that translates accumulated knowledge into something she can act on in the session.

That entire preparation system? The briefing infrastructure?

That’s the harness. That’s the 95%.

Part 3: Pre-flight and Post-flight
Once this model is clear, the agent design pattern becomes obvious.

Become a Medium member
Every LLM call has two distinct phases that have nothing to do with the LLM. I call them pre-flight and post-flight.

Pre-flight is everything that happens before you call the LLM.

For the resume tailoring agent, pre-flight looks like:

Fetch the job record from the database
Scrape the live job listing for fresh, complete text
Retrieve the user’s full profile — experiences, projects, skills, positioning
Determine which of 5 resume variants is the right starting point for this role type
Pull that variant’s positioning statement and summary
Compute skill overlap between job requirements and user background
Format all of this into a structured context document
Load the agent’s system prompt from its dedicated markdown file

Zero LLM involvement. All deterministic code. Testable. Debuggable. Fast.

Post-flight is everything that happens after the LLM responds.

Parse and validate output against the expected Pydantic schema
If validation fails: retry with adjusted parameters
Store the tailored resume in the database, linked to the correct job and profile
Update the job’s pipeline status
Trigger the next workflow step
Log model used, tokens consumed, latency — for cost tracking and debugging

Again — not the LLM. Code.

Here’s the thing that genuinely shifts how you design these systems: 80% of what makes an agent good is in the pre-flight and post-flight, not the prompt.

The prompt matters. Prompt engineering is real and worth doing well. But if the pre-flight is handing the LLM poorly structured, incomplete, or mislabeled context — no prompt technique saves you. And if the post-flight isn’t validating and correctly persisting the output — you’ll have brilliant LLM responses that produce nothing downstream.

The prompt is the 5 minutes with the consultant. The pre-flight is the two hours assembling her briefing. The post-flight is the three hours turning her output into action.

Optimize for the full process.

Part 4: What JobFlow AI Actually Looks Like
The system runs four services:

FastAPI backend — ~40 endpoints, all business logic, 13 database tables
Celery worker — picks up background tasks from Redis queue, runs the actual agents
Redis — message broker and task state tracking
Next.js 14 frontend — TanStack Query, Tailwind, shadcn/ui

The interaction flow:

Browser
↓
Next.js frontend
↓
FastAPI [accepts request, queues task, returns immediately]
↓
Redis queue
↓
Celery worker [picks up task]
↓
Agent orchestrator [determines what runs, in what sequence]
↓
Agent modules [10 specialized agents, each with one job]
↓
Services layer [Claude API, web scraper, PDF generator]
↓
SQLite [stores everything]
↑
Frontend polls for status updates

The reason the API returns immediately and the agent runs in the background: LLM calls take 3–15 seconds, sometimes longer. Synchronous request-response for AI workloads is the wrong pattern. Queue it, run it, poll for completion.

The 10 agent modules:
Profile Ingest — Takes raw LinkedIn text, resume, GitHub URL, writing samples — extracts structured profile data
Profile Synthesis — Turns structured profile into positioning, career narrative, and 5 resume variants
Job Parser — Reads a listing, extracts requirements, skills, company signals, red flags
Job Scorer — Scores the role against your profile on 7 weighted dimensions
Resume Tailor — Tailored resume using the right variant for this specific role
Cover Letter — Actually specific to this company and this role — not generic
Outreach Writer — Cold outreach draft to the hiring manager
Follow-up Writer — Follow-up email if no response after the right window
Interview Prep — Role-specific questions with suggested answers
Contact Finder — Attempts to identify the hiring manager’s name and direct contact

Each agent does exactly one thing. One input type. One output type. No agent has any knowledge of the others.

This is deliberate. When output is wrong, I know which agent produced it. When I want to improve resume tailoring, I change one thing without touching cover letter generation. When I want to test a new prompt approach, I test it against one agent’s behavior in isolation. Specialization makes everything — debugging, iteration, testing, prompt engineering — tractable.

One thing worth flagging: outreach emails and follow-ups sit in a human approval queue before they go anywhere. The agent drafts. You read, edit if needed, then approve or discard.

This isn’t a gap I’m planning to close by automating further.

An outreach email is a message from a real person to another real person reviewing a real career opportunity. That’s categorically different from generating a tailored resume. The agent should handle the work. The human should make the call on whether to send it and what it says.

Automate the labor. Keep the judgment.

Part 5: Stack Choices and Why
The stack is boring on purpose. When something breaks in a production agent system at 2am, “boring and well-understood” is exactly what you want.

FastAPI — async Python, type-safe, auto-generated docs I can actually test
SQLAlchemy + Alembic — mature ORM with migration support; schema evolves without destroying data
SQLite — zero infrastructure overhead, works perfectly for this use case, one config line from Postgres when needed
Celery + Redis — industry-tested async task processing, battle-hardened tooling
LangGraph — the one AI-specific tool in the stack; genuinely useful for defining state machines across multi-step agent pipelines
Anthropic Claude API directly — raw SDK, no wrappers

On the last point: I use the raw Anthropic SDK across all our agent work at Einstein Labs. Wrappers add abstraction that costs you debugging clarity. When a call fails or produces unexpected output, I want to see exactly what request went out and exactly what came back. Every layer of indirection between me and the API makes production debugging slower. Not worth it.

The interesting deliberate choice: different Claude models for different tasks.

Extraction tasks — parsing a job listing, pulling structured data from a resume — use Sonnet. Fast, economical, more than capable for well-defined extraction with clear schemas.

Synthesis tasks — generating career narrative, building positioning statements, creating the 5 resume variants — use Opus. Slower and more expensive. The quality difference is real and it compounds. Everything downstream of the profile synthesis — every tailored resume, every cover letter, every outreach draft — inherits from that synthesis. Getting it right at the foundation matters.

This is a general principle in agent system design: identify the load-bearing calls and resource them appropriately. The profile synthesis runs once per user setup. The job scorer runs on every job added. Model economics should reflect that asymmetry.

What This Series Covers
Part 2: The Brain — Engineering LLM Integration That Actually Works in Production

The pattern that makes or breaks production LLM integration: structured output via tool-use forcing. The BaseAgent[InputT, OutputT] class. Prompt files as first-class artifacts. Context window management. Why I call different models for different tasks and how I decided which is which.

Part 3: The Swarm — Why 10 Specialized Agents Instead of One

Why a monolithic agent is worse than a swarm of small ones. The design principles behind each agent boundary. The two hardest agents I built (Profile Ingest + Synthesis). The job scoring formula. Human-in-the-loop as intentional design.

Part 4: The Harness — Orchestration, Persistence, and Async Processing

Database design for AI applications — it’s different from regular app databases in ways that matter. LangGraph for workflow state machines. Celery + Redis for background processing. Error handling when LLMs fail in production, which they do.

Part 5: Ship It — Frontend, Deployment, and What I’d Do Differently

Building a UI that makes AI output feel trustworthy. One-command deployment, genuinely no manual steps. Actual token cost numbers. What worked better than expected. What I over-engineered. What I’d do completely differently if I started today.

The full codebase is on GitHub: github.com/abhichat85/jobflow-ai

Every code snippet in this series links to the actual file in the actual repo. Not tutorial code written for the blog. The real thing.

The plan: build this fully, document it in public, then open source it. R gets a working job search agent. Everyone else gets the full architectural blueprint and the code behind it.

One last thing about R.

He’s spent two decades building real things — teams, operations, companies. He’s not behind. He’s just playing an old game on a board that’s been redesigned.

The people who are going to win this job market aren’t the ones who apply to the most jobs. They’re the ones who show up with the right signal for the right role — specific, positioned, clearly relevant — and do it consistently without burning themselves out in the process.

That’s what this agent is designed to do. Give the right people the infrastructure to compete on signal, not volume.

I’ll see you in Part 2. That’s where we get into the engineering.

Building something in the agent space? Disagree with a decision I made here? I’m genuinely interested in both — the agreements are nice, the disagreements are useful.

How I test AI agent frontends without calling the API once

Abhishek Chatterjee — Fri, 03 Apr 2026 14:27:51 +0000

How I test AI agent frontends without calling the API once

Testing AI agent applications is broken.

Not the model calls — those you can mock. What nobody knows how to test is the streaming layer: the event sequence your frontend actually receives, the state transitions that happen across a multi-turn agent loop, the subtle timing between a tool_use and its tool_result. Most teams either skip this entirely or write flaky integration tests that hit the real API on every CI run.

There's a better way, and it comes from a realization that took us longer to arrive at than it should have.

A .jsonl recording is just a test fixture in disguise.

Once you see it that way, your production streams become a regression test suite you're building automatically, whether you meant to or not.

The problem with testing streaming frontends

Consider what you're actually testing when you write a test for an AI agent UI. You care about:

Does isStreaming flip to false after done?
Does activeTools clear when tool_result arrives?
Does a progress event at 60% actually update the progress bar?
If the server crashes mid-stream, does the UI recover?
If two tools run in parallel and the faster one resolves first, does the state machine handle out-of-order events correctly?

None of these are about the model's output. They're about your event handling logic — the state machine that consumes the stream. And yet, to test it, you either call the real API (slow, expensive, non-deterministic) or you mock fetch at the HTTP level (brittle, doesn't test real event sequences) or you just don't test it (most common).

The core insight: what you actually need is a deterministic source of real event sequences. Not mocked ones you invented — real ones that came from production, captured exactly as they happened, down to the millisecond timing between events.

That's what AgentStreamRecorder gives you.

Part 1: The Python side

Recording a stream

You're already recording every production stream with AgentStreamRecorder. Each session in production.jsonl looks like this:

{"session": "f3a2c1b0-...", "started_at": "2026-04-01T02:14:00+00:00", "t": 0}
{"t": 0.0,   "event": "token",      "data": {"text": "Here is what I found"}}
{"t": 0.052, "event": "tool_use",   "data": {"tool_name": "web_search", "tool_use_id": "tu_1", "status": "running"}}
{"t": 0.891, "event": "tool_result","data": {"tool_name": "web_search", "tool_use_id": "tu_1", "duration_ms": 839, "status": "done"}}
{"t": 1.204, "event": "done",       "data": {"num_turns": 1, "tool_count": 1, "duration_ms": 1204}}

Turning a recording into a pytest fixture

The load_sessions function from the CLI module reads any .jsonl file into a list of session dicts. From there, you can replay it as an async generator:

# tests/utils.py
import asyncio
import json
from pathlib import Path
from agent_stream.cli import load_sessions


async def replay_as_stream(path: Path, speed: float = 10_000.0):
    """Replay a .jsonl recording as an async SSE generator.

    At speed=10_000 the 1.2s recording replays in ~0.12ms.
    Use speed=1.0 to replay at real timing (rare in tests).
    """
    sessions = load_sessions(path)
    session = sessions[-1]
    prev_t = 0.0

    for record in session["events"]:
        gap = (record["t"] - prev_t) / speed
        if gap > 0:
            await asyncio.sleep(gap)
        prev_t = record["t"]
        yield f"event: {record['event']}\ndata: {json.dumps(record['data'])}\n\n"

Now your test fixture is just a file path:

# tests/conftest.py
import pytest
from pathlib import Path

@pytest.fixture
def tool_stream():
    return Path("tests/fixtures/tool_stream.jsonl")

@pytest.fixture
def multi_turn_stream():
    return Path("tests/fixtures/multi_turn.jsonl")

@pytest.fixture
def error_stream():
    return Path("tests/fixtures/connection_drop.jsonl")

Testing your agent endpoint

Here's what this looks like against a real FastAPI endpoint:

# tests/test_chat_endpoint.py
import json
import pytest
from httpx import AsyncClient
from agent_stream.cli import load_sessions
from tests.utils import replay_as_stream


async def test_tool_stream_returns_all_events(app, tool_stream):
    """Every event in the recording should appear in the response."""
    sessions = load_sessions(tool_stream)
    expected_events = [e["event"] for e in sessions[-1]["events"]]

    # Patch the agent to replay the recording instead of calling Anthropic
    async def mock_agent(message: str):
        async for sse in replay_as_stream(tool_stream):
            yield sse

    app.state.agent = mock_agent

    async with AsyncClient(app=app, base_url="http://test") as client:
        response_events = []
        async with client.stream("POST", "/chat", json={"message": "test"}) as resp:
            async for line in resp.aiter_lines():
                if line.startswith("event: "):
                    response_events.append(line.removeprefix("event: "))

    assert response_events == expected_events

The regression test workflow

Here's where it gets powerful. We had a bug in our activeTools tracking: when two tools ran in parallel, the faster one's tool_result could arrive before the slower one's tool_use, and the hook would fail to remove the right name from the array.

The bug was reported by a user. We couldn't reproduce it locally. We didn't know when it had first appeared.

With AgentStreamRecorder already running in production, we had the exact session. We extracted it:

# Find the session
agent-stream replay production.jsonl --list

# SESSION                                STARTED                   EVENTS   DURATION  TYPES
# f3a2c1b0-...                           2026-03-28T14:22:11        14        6.2s     token tool_use tool_use tool_result tool_result done

# Pull it to a fixture file
grep -B0 -A15 "f3a2c1b0" production.jsonl > tests/fixtures/parallel_tools_bug.jsonl

Then we wrote the regression test before touching the code:

async def test_parallel_tools_clear_correctly(parallel_tools_bug):
    """Regression: tool_result for faster tool arriving before slower tool's
    tool_use should not corrupt activeTools state."""
    sessions = load_sessions(parallel_tools_bug)
    events = sessions[-1]["events"]

    # Verify the bug condition exists in our fixture
    tool_uses    = [e for e in events if e["event"] == "tool_use"]
    tool_results = [e for e in events if e["event"] == "tool_result"]
    assert len(tool_uses) == 2
    assert len(tool_results) == 2

    # The fixture should have result arriving before second use
    first_result_t = tool_results[0]["t"]
    second_use_t   = tool_uses[1]["t"]
    assert first_result_t < second_use_t, "fixture doesn't reproduce the bug condition"

    # After full replay, activeTools should be empty
    active = []
    for record in events:
        if record["event"] == "tool_use" and record["data"].get("status") == "running":
            active.append(record["data"]["tool_name"])
        elif record["event"] == "tool_result":
            name = record["data"]["tool_name"]
            if name in active:
                active.remove(name)

    assert active == [], f"activeTools not cleared: {active}"

Test failed. We fixed the bug. Test passed. The .jsonl file stays in tests/fixtures/ forever as a regression guard.

Part 2: The React side

The Python side is straightforward because async generators are easy to swap. The React side is harder: useAgentStream calls fetch internally and reads a ReadableStream. You need to give it a fake fetch that produces a real SSE stream from your recording.

Building a mock SSE stream

// tests/utils/mockStream.ts
interface EventRecord {
  t: number;
  event: string;
  data: Record<string, unknown>;
}

interface RecordedSession {
  session: string;
  started_at: string;
  events: EventRecord[];
}

export function parseRecording(jsonl: string): RecordedSession {
  const lines = jsonl.trim().split("\n").map(l => JSON.parse(l));
  const header = lines[0];
  const events = lines.slice(1);
  return { ...header, events };
}

export function createMockSSEStream(
  events: EventRecord[],
  speed = 10_000
): ReadableStream<Uint8Array> {
  const encoder = new TextEncoder();

  return new ReadableStream({
    async start(controller) {
      let prevT = 0;

      for (const record of events) {
        const gapMs = ((record.t - prevT) / speed) * 1000;
        if (gapMs > 0) {
          await new Promise(resolve => setTimeout(resolve, gapMs));
        }
        prevT = record.t;

        const sse = `event: ${record.event}\ndata: ${JSON.stringify(record.data)}\n\n`;
        controller.enqueue(encoder.encode(sse));
      }

      controller.close();
    },
  });
}

export function mockFetchWithRecording(jsonl: string): void {
  const session = parseRecording(jsonl);

  vi.stubGlobal("fetch", vi.fn().mockResolvedValue({
    ok: true,
    status: 200,
    body: createMockSSEStream(session.events),
  } as unknown as Response));
}

Writing hook tests against recordings

// tests/useAgentStream.recording.test.ts
import { renderHook, act, waitFor } from "@testing-library/react";
import { readFileSync } from "fs";
import { useAgentStream } from "../src/useAgentStream";
import { mockFetchWithRecording } from "./utils/mockStream";

// Load fixture files once
const TOOL_STREAM    = readFileSync("fixtures/tool_stream.jsonl", "utf-8");
const MULTI_TURN     = readFileSync("fixtures/multi_turn.jsonl", "utf-8");
const ERROR_STREAM   = readFileSync("fixtures/connection_drop.jsonl", "utf-8");
const PARALLEL_TOOLS = readFileSync("fixtures/parallel_tools_bug.jsonl", "utf-8");

describe("useAgentStream — recording-based tests", () => {

  beforeEach(() => vi.restoreAllMocks());

  it("sets isDone after done event", async () => {
    mockFetchWithRecording(TOOL_STREAM);

    const { result } = renderHook(() => useAgentStream());

    await act(async () => {
      result.current.startStream("/chat", { message: "test" });
    });

    await waitFor(() => expect(result.current.isDone).toBe(true));
    expect(result.current.isStreaming).toBe(false);
  });

  it("accumulates text from token events", async () => {
    mockFetchWithRecording(TOOL_STREAM);

    const { result } = renderHook(() => useAgentStream());

    await act(async () => {
      result.current.startStream("/chat", { message: "test" });
    });

    await waitFor(() => expect(result.current.isDone).toBe(true));

    // Text should be non-empty and match the tokens in the recording
    expect(result.current.text.length).toBeGreaterThan(0);
  });

  it("clears activeTools when tool_result arrives", async () => {
    mockFetchWithRecording(TOOL_STREAM);

    const { result } = renderHook(() => useAgentStream());
    const toolSnapshots: string[][] = [];

    await act(async () => {
      result.current.startStream("/chat", { message: "test" }, {
        onToolUse: () => {
          toolSnapshots.push([...result.current.activeTools]);
        },
      });
    });

    await waitFor(() => expect(result.current.isDone).toBe(true));

    // After stream, no tools should be active
    expect(result.current.activeTools).toEqual([]);
  });

  it("regression: parallel tools clear correctly", async () => {
    // This fixture contains tool_result arriving before second tool_use —
    // the exact sequence that caused the activeTools corruption bug
    mockFetchWithRecording(PARALLEL_TOOLS);

    const { result } = renderHook(() => useAgentStream());

    await act(async () => {
      result.current.startStream("/chat", { message: "test" });
    });

    await waitFor(() => expect(result.current.isDone).toBe(true));

    expect(result.current.activeTools).toEqual([]);
  });

  it("recovers from connection drop without done event", async () => {
    // This fixture was recorded from a session where the server crashed —
    // no done event, just connection close
    mockFetchWithRecording(ERROR_STREAM);

    const { result } = renderHook(() => useAgentStream());

    await act(async () => {
      result.current.startStream("/chat", { message: "test" });
    });

    // Should still resolve, not hang
    await waitFor(() => expect(result.current.isStreaming).toBe(false), {
      timeout: 2000,
    });

    // isDone should be true via synthetic done
    expect(result.current.isDone).toBe(true);
  });

  it("progress updates to correct percentage", async () => {
    mockFetchWithRecording(MULTI_TURN);

    const { result } = renderHook(() => useAgentStream());
    const progressValues: number[] = [];

    await act(async () => {
      result.current.startStream("/chat", { message: "test" }, {
        onProgress: (e) => progressValues.push(e.percentage),
      });
    });

    await waitFor(() => expect(result.current.isDone).toBe(true));

    // Progress should be monotonically increasing
    for (let i = 1; i < progressValues.length; i++) {
      expect(progressValues[i]).toBeGreaterThanOrEqual(progressValues[i - 1]);
    }

    // Final progress should be 100
    expect(progressValues[progressValues.length - 1]).toBe(100);
  });

});

Creating fixtures when you don't have production recordings yet

If you're starting fresh and don't have recordings yet, build inline fixtures manually — they're just JSONL strings:

// tests/fixtures/inline.ts

export const SIMPLE_TOOL_SESSION = `
{"session":"test-001","started_at":"2026-04-01T00:00:00+00:00","t":0}
{"t":0.0,"event":"token","data":{"text":"Searching for that"}}
{"t":0.05,"event":"tool_use","data":{"tool_name":"web_search","tool_use_id":"tu_1","input_summary":"query=test","status":"running"}}
{"t":0.89,"event":"tool_result","data":{"tool_name":"web_search","tool_use_id":"tu_1","output_summary":"3 results","duration_ms":839,"status":"done"}}
{"t":1.02,"event":"token","data":{"text":" — here are the results"}}
{"t":1.20,"event":"done","data":{"num_turns":1,"tool_count":1,"duration_ms":1200,"model":"claude-sonnet-4-6","total_cost_usd":0.004}}
`.trim();

export const CONNECTION_DROP_SESSION = `
{"session":"test-002","started_at":"2026-04-01T00:00:00+00:00","t":0}
{"t":0.0,"event":"token","data":{"text":"Let me check that"}}
{"t":0.05,"event":"tool_use","data":{"tool_name":"web_search","tool_use_id":"tu_1","input_summary":"query=test","status":"running"}}
`.trim();
// Note: no done event — simulates server crash

The CONNECTION_DROP_SESSION fixture tests your synthetic-done fallback every single CI run, for free, deterministically.

What this unlocks

Once you have recordings as test fixtures, a few things change:

CI requires zero API keys. Your entire test suite runs without touching any LLM provider. No rate limits, no cost, no flakiness from model behavior changes. The recordings are checked into git, they run identically on every developer's machine and in every CI job.

Bugs arrive pre-packaged as regression tests. When a user reports something broke, you ask for the session ID (or pull it from your recording file by timestamp). The .jsonl is the reproduction case. You don't need to reproduce it — it's already reproduced. You write the failing assertion, fix the code, and the recording stays in tests/fixtures/ as a permanent guard.

You can test timing-sensitive behavior. Real recordings capture the actual millisecond intervals between events. If you need to test what happens when a tool call takes 4 seconds (timeout behavior, spinner state, etc.), you don't need to sleep(4) in your test — you have a real recording where it took 4 seconds, and you replay it at 1x speed.

Behavioral coverage grows automatically. Every unusual production session — rare event sequences, edge case tool combinations, partial streams — becomes a potential test fixture. The set of behaviors you're testing grows as your product is used, without anyone writing new tests.

The production → fixture workflow

To make this routine:

# Morning after a bug report
agent-stream replay production.jsonl --list

# Find the timestamp that matches the report
# SESSION                      STARTED               EVENTS  DURATION  TYPES
# f3a2c1b0-...                 2026-04-01T02:14:11   14      6.2s      token tool_use tool_use tool_result tool_result done

# Extract to a named fixture
grep -m1 -A100 "f3a2c1b0" production.jsonl | head -15 > tests/fixtures/parallel_tools_bug.jsonl

# Replay it locally to confirm you have the right session
agent-stream replay tests/fixtures/parallel_tools_bug.jsonl --list

# Write the test (red), fix the code (green), commit both

The fixture file goes in version control alongside the test. Anyone who clones the repo gets the reproduction case. The CI run that first passes on this test is the proof of fix.

Getting started

pip install agent-event-stream
npm install @agent-stream/react

Add two lines to your FastAPI endpoint to start recording:

from agent_stream.recorder import AgentStreamRecorder

recorder = AgentStreamRecorder("production.jsonl")

@app.post("/chat")
async def chat(req: ChatRequest):
    async def generate():
        async for sse_str in recorder.record(run_agent(req.message)):
            yield sse_str
    return agent_stream_response(generate())

After a day in production, you'll have enough sessions to build your first fixture set. After a month, you'll have coverage for behaviors you never thought to test manually.

→ github.com/abhichat85/agent-stream

I can now replay any AI agent stream from production. Here's how.

Abhishek Chatterjee — Mon, 30 Mar 2026 20:25:11 +0000

I can now replay any AI agent stream from production.

Here's how.
In my last post, I wrote about the four SSE bugs that break AI agent UIs at 2am — chunk boundary splits, missing token batching, hanging done states, and retry logic that retries the wrong things.

There's a fifth problem I didn't cover, because the fix didn't exist yet.

What do you do the morning after something broke?

The stream is gone. The event sequence that caused the bug evaporated the moment the connection closed. You have a user complaint, maybe a generic error log, and zero ability to reproduce the issue locally because local dev doesn't have real network conditions, real token rates, or the specific sequence of tool calls that triggered the failure.

Today I shipped AgentStreamRecorder to the agent-stream library to solve exactly this.

The debugging gap nobody talks about
When a REST API fails, you have the request and response in your logs. You can replay it with curl. You can write a regression test. The failure is reproducible.

When an AI agent stream fails mid-flight, you have nothing. The SSE connection is stateful and ephemeral. The events exist in a buffer, get consumed by the client, and are gone. If the frontend shows wrong state — tools that didn't clear, progress that froze at 60%, text that truncated — you're debugging from memory and screenshots.

We hit this at Praxiom repeatedly while building 36 production agent tools. The failure pattern was always the same:

User reports the stream "felt wrong" or the UI got stuck
You try to reproduce — works fine in dev
You add more logging — doesn't help, the issue is in the event sequence, not individual events
You fix something and hope — no regression test possible
Issue reappears three weeks later in a different context
The root problem: we had full observability on every other layer of the stack except the stream itself.

What AgentStreamRecorder does
It's a drop-in async wrapper. You add two lines to your existing endpoint and every stream gets saved to a .jsonl file automatically.

Before:

@app.post("/chat")
async def chat(req: ChatRequest):
    return agent_stream_response(run_agent(req.message))

After:

from agent_stream.recorder import AgentStreamRecorder

recorder = AgentStreamRecorder("streams/production.jsonl")

@app.post("/chat")
async def chat(req: ChatRequest):
    async def generate():
        async for sse_str in recorder.record(run_agent(req.message)):
            yield sse_str  # passes through unchanged

    return agent_stream_response(generate())

That's it. The recorder wraps the async generator, tees each SSE event to the file, and re-yields the string unchanged. Your StreamingResponse sees nothing different. The client sees nothing different. Zero impact on latency.

What gets recorded
Each session in the .jsonl file starts with a header line, then one line per event:

{% embed {"session": "f3a2c1b0-...", "started_at": "2026-03-31T02:14:00+00:00", "t": 0}
{"t": 0.0,   "event": "token",      "data": {"text": "Hello, here"}}
{"t": 0.041, "event": "token",      "data": {"text": " is what I found"}}
{"t": 0.052, "event": "tool_use",   "data": {"tool_name": "web_search", "status": "running"}}
{"t": 0.894, "event": "tool_result","data": {"tool_name": "web_search", "duration_ms": 842, "status": "done"}}
{"t": 1.204, "event": "done",       "data": {"num_turns": 1, "tool_count": 1, "duration_ms": 1204}} %}

The t field is seconds since stream start — relative, not absolute — so files are portable across machines and time zones. The format is append-safe: each new session appends to the file with its own UUID and t=0 baseline, so you can keep one file per day and scan it with grep.

Three design decisions worth explaining:

.jsonl not binary. You can grep it. grep '"event": "error"' production.jsonl instantly shows every stream that hit an error event, with timing. Binary formats are faster to write but terrible to investigate.

Relative timestamps. Absolute wall-clock timestamps tell you when something happened. Relative timestamps tell you how long it took. Relative is almost always what you need for debugging — "tool_result came back 842ms after tool_use" is more useful than two UTC timestamps you have to subtract.

Flush after every write. The file is flushed after every line, not buffered. If the process crashes mid-stream, you still have everything up to the last event. This matters — crashes are exactly when you most need the recording.

Replaying it
The agent-stream CLI (installed with the package) plays back recordings:

What sessions are in this file?

agent-stream replay production.jsonl --list

SESSION                                STARTED                   EVENTS   DURATION  TYPES
--------------------------------------------------------------------------------------------------
f3a2c1b0-4e5d-...                     2026-03-31T02:14:00       8         4.21s     token tool_use tool_result done
a8b3c2d1-5f6e-...                     2026-03-31T02:31:45       12        7.83s     token thinking tool_use tool_result turn done

# Replay the most recent session at original speed
agent-stream replay production.jsonl

# Replay at 2× speed (50ms gaps become 25ms)
agent-stream replay production.jsonl --speed 2

# Replay at 0.1× speed to watch a fast tool sequence in slow motion
agent-stream replay production.jsonl --speed 0.1

The output is valid SSE piped to stdout. You can pipe it to a local dev server, feed it into a test harness, or just watch it in the terminal to see exactly what the client received.

The use case that made this necessary
Here's the specific failure pattern that pushed us to build this. We had a multi-turn agent that called several tools in sequence. In production with certain inputs, the activeTools array in the React hook wasn't clearing properly — a tool would finish but its name would stay in the "currently running" UI indefinitely.

Couldn't reproduce it locally. The tool sequence was always different. Adding console logs to the hook showed correct state at each individual event, but the sequence mattered.

With AgentStreamRecorder, we recorded the failing session from production, replayed it through a local frontend, and watched the hook state update in real-time. Spotted the issue immediately: a tool_result event was arriving before the matching tool_use in one specific sequence, because two tools were running in parallel and the faster one resolved first. The hook was looking for tool names in order; the order wasn't guaranteed.

Fifteen-minute fix. Would have taken days without the recording.

The bigger pattern: AI agents need a different kind of observability
Standard application observability — request logs, error rates, latency percentiles — doesn't map cleanly onto AI agent streams. The unit of interest isn't a request, it's a session. The signal isn't an error code, it's an event sequence. The failure mode isn't a stack trace, it's a state machine reaching an unexpected state.

AgentStreamRecorder is a small step toward stream-native observability. Each .jsonl session is a complete, replayable trace of exactly what the agent did and how long each step took. You can diff two sessions to understand why one succeeded and one failed. You can build a test suite from real production recordings. You can grep across thousands of sessions to understand behavioral patterns at scale.

None of this is revolutionary. It's the kind of thing you'd take for granted in any mature system — request tracing, structured logging, replay-from-log. We just haven't had it for agent streams.

Getting it
pip install agent-event-stream
from agent_stream.recorder import AgentStreamRecorder

Full source, spec, and React client hook at github.com/abhichat85/agent-stream.

Extracted from Praxiom — Product Intelligence that compounds. Think Cursor for Product Managers.

Why SSE for AI agents keeps breaking at 2am

Abhishek Chatterjee — Sun, 29 Mar 2026 21:18:40 +0000

Why SSE for AI agents keeps breaking at 2am
Every team building AI agent UIs writes their own SSE client. And every team hits the same four bugs.

I know because we shipped 36 agent tools at Praxiom before we sat down and wrote a real protocol instead of patching the same streaming code for the fifteenth time. This is a post-mortem on the four bugs. At the end I'll show you what we extracted.

The setup
You're building a chat-style UI backed by an LLM agent. The agent calls tools, thinks for a few seconds, maybe runs multiple turns. You want the frontend to stream tokens in real-time, show "running web search..." while a tool is active, and display a progress bar for longer operations.

SSE seems like the obvious choice. It's simple. You've used it before. You write the server in an afternoon.

Then you go to production.

**Bug #1: The chunk boundary**
Here's the hand-rolled SSE parser most people write:

for await (const chunk of stream) {
  const text = decoder.decode(chunk);
  const lines = text.split('\n');

  for (const line of lines) {
    if (line.startsWith('event: ')) {
      currentEvent = line.slice(7);
    } else if (line.startsWith('data: ')) {
      dispatch(currentEvent, JSON.parse(line.slice(6)));
      currentEvent = ''; // reset
    }
  }
}

This works in local dev. The event: and data: lines arrive in the same chunk because there's no network latency.

In production, under load, with a real network, a proxy, or nginx in the path — they don't.

Chunk 1 arrives: "event: token\n"

Chunk 2 arrives: "data: {\"text\":\"Hello\"}\n\n"

Your parser resets currentEvent after chunk 1. When chunk 2 arrives, currentEvent is "". The event is dropped silently. Your tokens disappear in production but never in staging.

The fix: currentEvent must survive across reader.read() calls. It's not a per-chunk variable — it's a per-stream variable. Reset it only after the data: line is dispatched, not at any chunk boundary.

// Outside the chunk loop — survives across reads
let currentEventType = '';

for await (const chunk of stream) {
  // ... parse lines ...
  // Reset ONLY after data: is dispatched
  if (line.startsWith('data: ') && currentEventType) {
    dispatch(currentEventType, JSON.parse(line.slice(6)));
    currentEventType = ''; // reset HERE, not at chunk boundary
  }
}

Bug #2: 30 React renders per second
Claude 3.5 Sonnet emits roughly 25–35 tokens per second. Without any batching, each token event directly updates state:

onToken: (e) => setText(prev => prev + e.text)

That's 30 setState calls per second. React batches some of these in concurrent mode, but not reliably under high frequency. What you get is visible jank — the text renders choppy, other UI elements freeze, and on slower devices the whole component tree starts missing frames.

The fix isn't complicated. Accumulate tokens into a buffer and flush on an interval:

let buffer = '';
let lastFlush = Date.now();
const INTERVAL_MS = 50;

onToken: (e) => {
  buffer += e.text;
  const now = Date.now();
  if (now - lastFlush >= INTERVAL_MS) {
    setText(prev => prev + buffer);
    buffer = '';
    lastFlush = now;
  }
}

// On stream end, flush remainder
onDone: () => {
  if (buffer) setText(prev => prev + buffer);
}

50ms gives you 20 renders per second — smooth to the eye, fraction of the CPU cost. The only subtlety: make sure you flush the remainder on stream end, or the last few tokens never appear.

Bug #3: The loading state that never resolves
Your server looks like this:

async def stream_agent(request):
    async def generate():
        async for event in agent.run():
            yield emitter.token(event.text)
        yield emitter.done()  # <-- this line

    return StreamingResponse(generate())

That done event is what tells the frontend to set isStreaming = false. But what happens when the server crashes mid-stream? An unhandled exception in your agent loop. A memory error. An upstream API timeout that your error handling missed.

The done event is never emitted. The SSE connection closes. Your frontend detects the closure... and does nothing, because "connection closed" and "stream finished" look the same from the client side.

The spinner keeps spinning. The user stares at it. Eventually they reload.

The fix: synthesize a done event client-side when the connection closes without one:

// After the read loop exits normally or via error
if (!receivedDone) {
  callbacks.onDone?.({ synthetic: true });
  setState(prev => ({ ...prev, isStreaming: false, isDone: true }));
}

The UI recovers cleanly. You log the synthetic done event server-side as a signal that something went wrong upstream.

Bug #4: Retry logic that makes things worse
The standard reconnect implementation retries on any connection failure. But there are two very different kinds of failures:

HTTP errors (4xx/5xx): The request reached your server. The server said no — bad auth token, rate limit, your request body was malformed, the endpoint changed. Retrying the exact same request will get the exact same error. You're just hammering your own server.

Network drops: TCP connection closed mid-stream. The client never got a response, or got a partial one. This should retry — it's likely transient (user's wifi dropped, proxy timeout, load balancer cycle).

Most hand-rolled retry logic doesn't distinguish between them:

// ❌ Wrong — retries on 403, hammers server, wastes tokens
catch (error) {
  setTimeout(retry, 1000);
}

The correct split:

const response = await fetch(endpoint, options);

if (!response.ok) {
  // HTTP error — throw immediately, no retry
  throw new HttpError(response.status, await response.text());
}

// Past this point: we have a 200 and are reading the stream
// Any failure here is a network drop → retry with backoff
try {
  await readStream(response.body);
} catch (networkError) {
  if (attempt < MAX_RETRIES) {
    await sleep(Math.pow(2, attempt) * 1000); // 1s, 2s, 4s
    return retry(attempt + 1);
  }
}

HTTP errors surface immediately to the user. Network drops retry silently up to 3 times. Your error handling for a 403 Forbidden is fundamentally different from your handling for a dropped connection.

The same five events, every time
After shipping 36 agent tools at Praxiom, we noticed something. Every tool needed to emit:

Tokens accumulating into the response text
Tool calls and their status (running → done / error)
Thinking blocks (for extended thinking models)
Progress for multi-step pipelines
A clean end signal with metadata
And every frontend needed to consume them with the same state shape: text, isStreaming, activeTools, progress, error, isDone.

We were rediscovering the same edge cases on every new tool. The token batching tweak happened three separate times before someone documented it. The chunk boundary bug was fixed in four different files.

So we extracted it.

agent-stream
A typed SSE event protocol for AI agents. Nine event types. Python emitter. React hook. JSON Schema spec.

pip install agent-event-stream
npm install @agent-stream/react

Python — emit from any async generator:

from agent_stream import AgentStreamEmitter
from agent_stream.fastapi import agent_stream_response

emitter = AgentStreamEmitter()

async def run_agent(message: str):
    async for chunk in anthropic_client.stream(message):
        yield emitter.token(chunk.text)

    yield emitter.tool_use("web_search", tool_id, "searching...")
    # ... run tool ...
    yield emitter.tool_result("web_search", tool_id, "found 3 results", duration_ms=850)

    yield emitter.done(num_turns=2, tool_count=1, duration_ms=3200)

@app.post("/chat")
async def chat(req: ChatRequest):
    return agent_stream_response(run_agent(req.message))

React — full state from one hook:

const { text, isStreaming, activeTools, progress, error, isDone, startStream } =
  useAgentStream();

return (
  <div>
    <p>{text}{isStreaming && <Cursor />}</p>
    {activeTools.map(tool => <ToolBadge key={tool} name={tool} />)}
    {progress && <ProgressBar value={progress.percentage} label={progress.message} />}
    <button onClick={() => startStream('/chat', { message })}>
      Send
    </button>
  </div>
);

All four bugs above are handled in the library. Cross-chunk parsing is correct by construction. Token batching is on by default (50ms). Synthetic done fires when the server drops the connection. Retry logic distinguishes HTTP errors from network drops.

The JSON Schema spec (spec/events.schema.json) means you can implement the protocol in any language. It's not a React-only thing — we have a FastAPI server and the client is a plain TypeScript class that works in any framework.

What's next
We're building more of these extracts out of Praxiom's infrastructure — the parts that turn out to be the same across every AI product. agent-stream is the first.

If you're hitting these bugs, or if you've hit others we haven't documented — open an issue. The hard-won production details are the most valuable thing we can contribute.

→ github.com/abhichat85/agent-stream

Extracted from Praxiom - www.praxiomai.xyz