Forem: Ivan Magda

Background Tasks: The One Actor in the Codebase and the SIGTERM Bug That Only Broke on Linux

Ivan Magda — Mon, 20 Apr 2026 10:14:33 +0000

Our agent can plan multi-step work with a persistent task DAG, compress its own memory, delegate to subagents, and load skills on demand — all driven by the same agent loop from the first guide. But every tool call still blocks. When the model calls bash to run a test suite that takes two minutes, the loop sits idle, waiting for the process to finish before it can do anything else. If someone asks "run the tests and while that's going, create the config file," the agent does them sequentially — tests first, config second. For fast commands this doesn't matter. For builds, installs, and test suites, it's a real bottleneck.

The fix is a background execution layer: a way to hand a slow command to a worker, get a job ID back immediately, and keep the loop moving. When the command finishes, its result goes into a notification queue. Before each API call, the loop drains that queue and injects the results as messages — so the model sees them on its next turn without ever having blocked. The loop stays synchronous; only the subprocess I/O runs in parallel.

In this guide, let's build BackgroundManager — the first and only actor in the entire codebase — and wire it into the agent loop with a notification injection pattern that keeps background results flowing into the model's context.

The complete source code for this stage is available at the 08-background-tasks tag on GitHub. Code blocks below show key excerpts.

Why an actor — and only one

Every other manager in our codebase — TodoManager, TaskManager, SkillLoader, ContextCompactor — is accessed exclusively from the agent loop's sequential flow. The loop calls a tool handler, the handler calls the manager, the manager returns, the loop continues. There's never a moment where two pieces of code touch the same state simultaneously.

BackgroundManager breaks that pattern. When the model calls background_run, the manager spawns a Task {} that runs the shell command asynchronously. That task might finish thirty seconds later — while the main loop is in the middle of an API call, processing other tool results, or even draining the notification queue. The task writes to jobs and notifications; the main loop reads from them. Two isolated execution contexts mutating the same dictionaries. That's a textbook data race, and it's exactly what Swift's actor keyword exists to prevent:

// Sources/Core/BackgroundManager.swift
public actor BackgroundManager {
  private let executor: ShellExecutor

  private var jobs: [String: BackgroundJob] = [:]
  private var notifications: [BackgroundNotification] = []
  private var runningTasks: [String: Task<Void, Never>] = [:]

  public init(executor: ShellExecutor) {
    self.executor = executor
  }
}

The actor keyword means every access to jobs, notifications, and runningTasks is serialized by the compiler. No locks, no dispatch queues — the concurrency safety is structural. And because ShellExecutor is a Sendable struct with only a let stored property, it can be safely captured by the actor without any bridging.

The supporting types are straightforward value types. BackgroundJob tracks a command's lifecycle — its ID, preview text, status, and eventual result:

public struct BackgroundJob: Sendable, Equatable {
  public let id: String
  public let command: String
  public let commandPreview: String
  public var status: BackgroundJobStatus
  public var result: String?
}

BackgroundNotification is the message format that flows from the background into the agent loop — a snapshot of what happened:

public struct BackgroundNotification: Sendable, Equatable {
  public let jobId: String
  public let status: BackgroundJobStatus
  public let command: String
  public let result: String
}

Both are Sendable and cross the actor isolation boundary cleanly.

Job lifecycle: dispatch, execute, notify

Let's walk through what happens when the model calls background_run. The run() method creates a job record, spawns a Task {} to execute the command, and returns immediately with a confirmation string:

public func run(
  command: String,
  timeout: TimeInterval = Limits.backgroundTimeout
) -> String {
  let jobId = String(UUID().uuidString.prefix(8)).lowercased()
  let commandPreview = String(command.prefix(Limits.backgroundCommandPreview))
  jobs[jobId] = BackgroundJob(
    id: jobId, command: command, commandPreview: commandPreview, status: .running
  )

  let task = Task {
    let status: BackgroundJobStatus
    let output: String

    do {
      let result = try await self.executor.execute(command, timeout: timeout)

      if result.exitCode != 0 {
        status = .error
      } else {
        status = .completed
      }

      output = result.formatted
    } catch ShellExecutorError.timeout {
      status = .timeout
      output = "Error: Timeout (\(Int(timeout))s)"
    } catch {
      status = .error
      output = "Error: \(error)"
    }

    self.complete(jobId: jobId, status: status, output: output)
  }
  runningTasks[jobId] = task

  return "Background job \(jobId) started: \(commandPreview)"
}

One thing to keep in mind here is that the Task {} inside the actor inherits the actor's isolation — self.executor, self.complete(), and self.runningTasks are all accessible directly. Actors don't support [weak self] captures (and don't need them — the actor's lifetime is managed by its owner, not by individual tasks). The task calls self.complete() when the shell command finishes, which updates the job status and enqueues a notification:

private func complete(
  jobId: String,
  status: BackgroundJobStatus,
  output: String
) {
  jobs[jobId]?.status = status
  jobs[jobId]?.result = output

  notifications.append(
    BackgroundNotification(
      jobId: jobId,
      status: status,
      command: jobs[jobId]?.commandPreview ?? "",
      result: String(output.prefix(Limits.backgroundResultPreview))
    )
  )

  runningTasks.removeValue(forKey: jobId)
}

The notification carries a truncated preview of the output — enough for the model to understand what happened without flooding the context. The full result is stored in jobs[jobId]?.result for retrieval via background_check.

Draining the queue is a single atomic operation — read everything, then clear:

public func drainNotifications() -> [BackgroundNotification] {
  let result = notifications
  notifications.removeAll()
  return result
}

Because this runs inside the actor, the read-and-clear is serialized with respect to complete(). A notification can never be half-written when we drain, and it can never be lost between the read and the clear.

Notification injection: bridging background to model

The background manager accumulates results, but the model can't see them until they're injected into the messages array. That injection happens in drainBackgroundNotifications, which runs in the agent loop before each API call:

func drainBackgroundNotifications(_ messages: [Message]) async -> [Message] {
  let notifications = await backgroundManager.drainNotifications()
  guard !notifications.isEmpty else {
    return messages
  }

  let text =
    notifications
    .map { "[bg:\($0.jobId)] \($0.status.rawValue): \($0.result)" }
    .joined(separator: "\n")

  var result = messages
  let wrappedText = "<background-results>\n\(text)\n</background-results>"

  if let lastMessage = result.last, lastMessage.role == .user {
    var updatedContent = lastMessage.content
    updatedContent.append(.text(wrappedText))
    result[result.count - 1] = Message(role: .user, content: updatedContent)
  } else {
    result.append(.user(wrappedText))
  }

  result.append(.assistant("Noted background results."))
  return result
}

The <background-results> XML wrapper gives the model a clear signal that these are asynchronous completions, not user input. The if let lastMessage check handles the API's alternation requirement — if the last message is already a user message (which it is after tool results are appended), the background results get appended to that message's content rather than creating a consecutive user turn. A synthetic assistant acknowledgment follows so the next user message has a proper assistant turn before it.

The loop wires it alongside compaction, both running before the API call:

while true {
  try Task.checkCancellation()
  // ...
  messages = await applyCompaction(messages)
  if config.drainBackground {
    messages = await drainBackgroundNotifications(messages)
  }

  let request = APIRequest(
    model: model, maxTokens: Limits.defaultMaxTokens,
    system: systemPrompt, messages: messages, tools: config.tools
  )
  let response = try await apiClient.createMessage(request: request)
  // ... process tools, append results, continue
}

That config.drainBackground flag is the subagent guard. During development, an early version ran drainBackgroundNotifications in every agentLoop call — including subagent loops. A subagent running a quick research task would consume background notifications meant for the main agent. The results were gone before the main loop ever saw them. The fix: LoopConfig.default sets drainBackground: true, while LoopConfig.subagent sets it to false:

static let `default` = LoopConfig(
  tools: Agent.toolDefinitions,
  maxIterations: .max,
  enableNag: true,
  drainBackground: true,
  label: "agent"
)

static let subagent = LoopConfig(
  tools: Agent.toolDefinitions.filter {
    !subagentExcludedTools.contains($0.name)
  },
  maxIterations: 30,
  enableNag: false,
  drainBackground: false,
  label: "subagent"
)

The subagent's excluded tools now also include background_run and background_check — a subagent shouldn't be able to spawn background work at all, since it can't drain the results.

With that in place, our agent can hand off slow commands and keep working. Two new entries in the dispatch dictionary, one new actor, and a three-line drain check in the loop — the background execution layer is complete.

The Linux SIGTERM saga

The ShellExecutor gained a timeout parameter for background commands (defaulting to 300 seconds). The timeout mechanism uses DispatchSource.makeTimerSource() — a GCD timer that fires once after the deadline and terminates the process. An earlier design considered Task.sleep, but there's a subtle problem: try? await Task.sleep(for:) swallows CancellationError. When the process finishes normally and the sleep task is cancelled, execution falls through past the sleep, sets a timeout flag, and kills a process that already exited. DispatchSource avoids this entirely — timer.cancel() is synchronous and guaranteed to prevent the handler from firing.

The first version called process.terminate() in the timer handler — SIGTERM. On macOS, this worked perfectly: bash received SIGTERM, the child process died, the timeout was detected. Then the Linux build ran, and three timeout scenarios broke.

The root cause: macOS and Linux bash handle SIGTERM differently when waiting on a foreground child. macOS bash exits promptly. Linux bash defers the signal until the child process finishes on its own. A bash -c "sleep 10" with a 2-second timeout would run for the full 10 seconds on Linux because bash ignored the SIGTERM while waiting for sleep.

The fix came in two parts. First, process.interrupt() replaced process.terminate() — SIGINT instead of SIGTERM. Bash forwards SIGINT to the child process group on both platforms. Second, the timeout detection itself changed from signal-based to elapsed-time:

let startTime = DispatchTime.now()
// ... process runs, timer fires interrupt() if needed ...
process.waitUntilExit()
timer?.cancel()

if let timeout {
  let elapsedSeconds =
    Double(
      DispatchTime.now().uptimeNanoseconds - startTime.uptimeNanoseconds
    ) / 1_000_000_000
  if elapsedSeconds >= timeout {
    throw ShellExecutorError.timeout(seconds: Int(timeout))
  }
}

The original detection checked process.terminationReason == .uncaughtSignal && process.terminationStatus == SIGTERM — a check that was fragile across platforms and depended on bash's specific signal-handling behavior. Wall-clock comparison is platform-independent and unambiguous: if the process took longer than the timeout, it was terminated.

Taking it for a spin

Let's build and run:

swift build && swift run agent

Try: Run "sleep 5 && echo done" in the background, then create a file called hello.txt with "world" in it. Watch the tool calls — the agent should call background_run, get a job ID back immediately, then proceed to create the file without waiting. A few seconds later, when the sleep finishes, the [background] 1 result(s) injected message should appear as the drain fires before the next API call.

For something more realistic: Run the test suite in the background with "swift test" and while it runs, read Package.swift and summarize the dependencies. The agent works on the summary while the tests execute in parallel. When the tests finish, the results appear in the model's next context window.

To see the check tool: Start three background tasks: "sleep 2", "sleep 4", "sleep 6". Then check all background jobs. The first check should show a mix of completed and running jobs. A second check a few seconds later should show all three completed.

The capstone: 14 tools, one loop

We've reached the end of the series. Let's take stock of what we've built.

The agent now has 14 tools across eight stages: bash, read_file, write_file, edit_file, todo, agent, load_skill, compact, task_create, task_update, task_list, task_get, background_run, background_check. It can run shell commands, manipulate files, track its own work, delegate to subagents, load specialized knowledge, compress its memory, plan with a dependency graph, and execute slow commands in the background. The Agent type grew from a placeholder caseless enum in stage 0 to 849 lines — and the agent loop at its center is structurally unchanged from the first guide.

That's the thesis we set out to test. Claude Code's effectiveness comes from architectural restraint: a small set of excellent tools, thin orchestration, and heavy reliance on the model. The loop is the invariant — API call, check stop reason, process tool uses, append results, repeat. Every new capability was added the same way: define the tool, write the handler, add an entry to the dispatch dictionary. The loop never needed a new branch, a new state machine, or a different control flow. New behaviors arrived as injection points around the loop — nag reminders after tool processing, compaction before the API call, background drain alongside it — but the kernel itself held steady.

The one actor in the codebase exists because one type genuinely needed concurrent access to shared state. Everything else — classes, structs, enums — uses the simplest concurrency model that works. No over-architecture, no speculative abstractions, no framework. Just a loop, a dictionary, and a model that knows what to do with the tools it's given. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop
Part 2: Tool dispatch
Part 3: Self-managed task tracking
Part 4: Subagents
Part 5: Skill loading
Part 6: Context compaction
Part 7: Task system
Part 8: Background tasks ← you are here

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

Source code: github.com/ivan-magda/swift-claude-code

That's the series. Eight stages, 14 tools, one loop that never changed. If you made it this far — thank you. Drop a comment about what you'd build next, what surprised you, or what you'd do differently. I read everything.

Task System: A File-Based DAG That Survives Context Compaction

Ivan Magda — Sat, 18 Apr 2026 22:13:01 +0000

In the previous guide, we built a three-layer compaction strategy that lets the agent run indefinitely. That's a major capability — but it comes with a cost. Compaction is lossy. When auto-compact fires, the agent's entire conversation history collapses into a two-message summary. The gist survives — what was accomplished, which files were touched, key decisions — but the specifics vanish. If the agent was halfway through a twelve-step refactoring plan, the summary might preserve "refactoring in progress" while losing exactly which steps are done, which are blocked, and what comes next. The agent's plan evaporates along with the context that held it.

This is a different class of problem from what we've tackled before. The agent doesn't need a better compression algorithm — it needs state that lives outside the context window entirely. State on disk. If a task is written to a JSON file in a .tasks/ directory, no amount of compaction can erase it. The filesystem becomes the agent's durable memory — a place to store plans that survive compression, restarts, and arbitrarily long sessions.

In this guide, let's build a TaskManager that persists tasks as individual JSON files, wires dependency edges between them, and cascades status changes through the graph. Four new tools give the model CRUD access to a task DAG that outlasts the conversation itself.

The complete source code for this stage is available at the 07-task-system tag on GitHub. Code blocks below show key excerpts.

File-per-entity persistence

The core idea is simple: each task is a standalone JSON file. The .tasks/ directory is the database, and FileManager is the query engine. Here's what the directory looks like after the agent plans a multi-step feature:

.tasks/
  task_1.json   {"id": 1, "status": "completed", "subject": "Parse config"}
  task_2.json   {"id": 2, "status": "pending",   "blockedBy": [1]}
  task_3.json   {"id": 3, "status": "pending",   "blockedBy": [1]}
  task_4.json   {"id": 4, "status": "pending",   "blockedBy": [2, 3]}

Tasks 2 and 3 depend on task 1. Task 4 depends on both 2 and 3. When the agent completes task 1, its ID is automatically removed from every other task's blockedBy list — tasks 2 and 3 become unblocked and ready to execute. When both 2 and 3 are eventually completed, task 4 unblocks. This is a directed acyclic graph encoded as bidirectional edges: blockedBy points upstream (what blocks me), blocks points downstream (what I unblock when done).

The file-per-entity approach has a key advantage over a single tasks.json file: operations on one task never risk corrupting another. A failed write to task_3.json leaves tasks 1, 2, and 4 untouched. And because each file is a complete Codable struct, there's no parsing ambiguity — JSONDecoder either succeeds or it doesn't.

The AgentTask model

Let's start with the data model. TaskStatus is a raw-value enum that mirrors the lifecycle the model sees in tool descriptions — pending, in_progress, completed — with a display marker for the list view:

// Sources/Core/TaskManager.swift
public enum TaskStatus: String, Sendable, Equatable, Codable {
  case pending
  case inProgress = "in_progress"
  case completed

  public var marker: String {
    switch self {
    case .pending: "[ ]"
    case .inProgress: "[>]"
    case .completed: "[x]"
    }
  }
}

The AgentTask struct captures everything the model needs to reason about a task — its identity, what it does, where it stands, and how it relates to other tasks:

public struct AgentTask: Sendable, Equatable, Codable {
  public let id: Int
  public let subject: String
  public let description: String
  public fileprivate(set) var status: TaskStatus
  public fileprivate(set) var blockedBy: [Int]
  public fileprivate(set) var blocks: [Int]
  public let owner: String
}

The fileprivate(set) on status, blockedBy, and blocks means only code within TaskManager.swift can mutate these fields. External code sees them as read-only — the struct's API surface is narrow by design. The owner field anticipates multi-agent work in later stages, where tasks might be assigned to specific subagents.

TaskManager: CRUD and auto-incrementing IDs

TaskManager owns the .tasks/ directory and provides the CRUD operations that tool handlers call. The initializer creates the directory if needed and recovers the next available ID by scanning existing files:

public final class TaskManager {
  private let directory: String
  private var nextId: Int

  public init(directory: String) {
    self.directory = directory

    let fm = FileManager.default
    if !fm.fileExists(atPath: directory) {
      try? fm.createDirectory(
        atPath: directory,
        withIntermediateDirectories: true
      )
    }

    self.nextId = Self.maxId(in: directory) + 1
  }
}

That maxId scan is what makes IDs survive restarts. It parses task_N.json filenames, extracts the integer from each, and takes the maximum. If the .tasks/ directory contains task_1.json through task_5.json, nextId starts at 6 — regardless of whether the agent process was restarted, the context was compacted, or even the machine rebooted between sessions. Files that don't match the naming convention are silently skipped.

Creating a task follows the expected pattern — build the struct, write it, bump the counter:

public func create(subject: String, description: String = "") throws -> String {
  let task = AgentTask(id: nextId, subject: subject, description: description)
  let json = try saveAndSerialize(task)
  nextId += 1
  return json
}

The method returns pretty-printed JSON so the model sees exactly what was persisted. Every mutation method follows this pattern — perform the operation, return the serialized result as a tool response.

Dependency resolution: cascading unblock

The interesting mechanism is what happens when a task completes. If task 1's blocks array contains [2, 3], completing task 1 needs to remove 1 from both task 2's and task 3's blockedBy arrays. This is removeCompletedDependency — the cascading unblock:

private func removeCompletedDependency(for completedId: Int) {
  let fm = FileManager.default
  guard let files = try? fm.contentsOfDirectory(atPath: directory) else {
    return
  }

  for file in files where file.hasPrefix("task_") && file.hasSuffix(".json") {
    let path = (directory as NSString).appendingPathComponent(file)
    guard
      let data = fm.contents(atPath: path),
      var task = try? JSONDecoder().decode(AgentTask.self, from: data)
    else {
      continue
    }

    if task.blockedBy.contains(completedId) {
      task.blockedBy.removeAll { $0 == completedId }
      try? save(task)
    }
  }
}

The method scans every task file in the directory, checks whether it references the completed ID, and removes that reference if present. The cascade is triggered inside update when the status changes to .completed:

public func update(
  taskId: Int,
  status: String? = nil,
  addBlockedBy: [Int] = [],
  addBlocks: [Int] = []
) throws -> String {
  var task = try load(taskId)

  if let status {
    guard let newStatus = TaskStatus(rawValue: status) else {
      throw TaskError.invalidStatus(status)
    }
    task.status = newStatus
  }

  try applyBlockedBy(addBlockedBy, to: &task)
  try applyBlocks(addBlocks, to: &task)
  let json = try saveAndSerialize(task)

  if task.status == .completed {
    removeCompletedDependency(for: task.id)
  }

  return json
}

The ordering matters: save the updated task first, then cascade. If the cascade fails partway through, the completing task is still correctly marked as completed — only some dependents might retain a stale blockedBy entry, which is recoverable.

Wiring into the agent

With TaskManager ready, let's connect it. The agent creates the manager alongside its other dependencies, and the system prompt gains a line telling the model that task tools exist and survive compaction:

// Sources/Core/Agent.swift
self.taskManager = TaskManager(directory: "\(workingDirectory)/.tasks")

// In buildSystemPrompt:
- Use task tools for persistent multi-step work with dependencies. \
Tasks survive context compaction and process restarts.

Four tool definitions go into the toolDefinitions array. Here's task_create — the most representative:

ToolDefinition(
  name: "task_create",
  description: "Create a persistent task. Tasks survive context compaction and process restarts.",
  inputSchema: .object([
    "type": "object",
    "properties": .object([
      "subject": .object([
        "type": "string",
        "description": "Short title for the task"
      ]),
      "description": .object([
        "type": "string",
        "description": "Detailed description of the task"
      ])
    ]),
    "required": .array(["subject"])
  ])
),

The handlers follow the same guard-extract, do/catch, return-Result pattern as every other tool:

private func executeTaskCreate(_ input: JSONValue) async -> Result<String, ToolError> {
  guard let subject = input["subject"]?.stringValue else {
    return .failure(.missingParameter("subject"))
  }

  let description = input["description"]?.stringValue ?? ""

  do {
    let result = try taskManager.create(subject: subject, description: description)
    return .success(result)
  } catch {
    return .failure(.executionFailed("\(error)"))
  }
}

And the dispatch map grows by four entries:

let handlers = [
  "bash": executeBash,
  "read_file": executeReadFile,
  "write_file": executeWriteFile,
  "edit_file": executeEditFile,
  "todo": executeTodo,
  "agent": executeAgent,
  "load_skill": executeLoadSkill,
  "compact": executeCompact,
  "task_create": executeTaskCreate,
  "task_update": executeTaskUpdate,
  "task_list": executeTaskList,
  "task_get": executeTaskGet
]

Subagents get read-only access — task_list and task_get — but can't create or modify tasks. The LoopConfig.subagent denylist excludes task_create and task_update:

static let subagent = LoopConfig(
  tools: Agent.toolDefinitions.filter {
    !Set(["agent", "todo", "compact", "task_create", "task_update"]).contains($0.name)
  },
  ...
)

A subagent can check the task board to understand what work is planned, but only the main agent can change it. This keeps task ownership unambiguous — the same principle that keeps subagents from firing nag reminders or compressing the parent's history.

With that in place, we now have twelve tools and a persistent planning layer. The TodoManager still serves its original purpose — quick in-session checklists for simple tasks — while TaskManager handles structured multi-step work with explicit dependencies.

Taking it for a spin

Let's build and run:

swift build && swift run agent

Try: Plan a refactoring with 4 tasks: "Parse AST", "Transform nodes", "Emit output", "Run tests". Transform and Emit can run in parallel after Parse. Tests wait for both. Watch the tool calls — the agent should create four tasks, then wire dependencies: tasks 2 and 3 blocked by task 1, task 4 blocked by tasks 2 and 3.

Then: List all tasks. The output should show the dependency graph with markers:

[ ] 1: Parse AST
[ ] 2: Transform nodes (blocked by: 1)
[ ] 3: Emit output (blocked by: 1)
[ ] 4: Run tests (blocked by: 2, 3)

Now complete task 1: Mark task 1 as completed and list tasks again. The cascade fires — tasks 2 and 3 lose their blockedBy reference and become ready. Task 4 still waits for both.

For a longer session, try asking the agent to plan a real feature — something with eight or ten steps and genuine ordering constraints. Then trigger context compaction by reading a bunch of large files. After compaction fires, ask the agent to list all tasks — the full plan is still there, intact on disk, even though the conversation history was summarized down to two messages.

Durable state, same loop

We now have an agent with durable planning. Tasks persist as JSON files that survive compaction, restarts, and arbitrarily long sessions. The dependency graph — blockedBy upstream, blocks downstream — gives the model a way to express ordering and parallelism. Cascading unblock on completion means the model doesn't need to manually track which tasks become ready; it just marks work as done and the graph updates itself.

The TodoManager and TaskManager coexist deliberately. Todos are fast and ephemeral — a scratchpad for single-session work. Tasks are structured and persistent — a plan that outlasts the conversation. The model learns when to use which from the system prompt, and in practice it reaches for task_create when the work has dependencies and todo when it doesn't.

The loop is still the invariant. Four new entries in the dispatch dictionary, four new handler methods, one new type — and the agent loop that drives everything hasn't changed since the first guide. In the next guide, we'll tackle a natural follow-up: some of these tasks take a long time to execute. Background tasks let the agent kick off slow work and keep going while it runs. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop
Part 2: Tool dispatch
Part 3: Self-managed task tracking
Part 4: Subagents
Part 5: Skill loading
Part 6: Context compaction
Part 7: Task system ← you are here
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

Do you persist agent state to disk, a database, or keep it all in-context? I went with file-per-entity for simplicity — curious what trade-offs others have hit.

Context Compaction: Three Layers of Compression That Let an Agent Run Indefinitely

Ivan Magda — Thu, 16 Apr 2026 07:36:43 +0000

Our agent has come a long way. It runs commands, reads and writes files, tracks its own work, delegates to subagents, and loads skills on demand — seven tools, one loop. But every one of those capabilities adds to the same growing resource: the messages array. A single read_file on a 1,000-line source file costs roughly 4,000 tokens. Load a skill body, and that's another 2,000. After reading 30 files and running 20 bash commands across a long session, the context pushes past 100,000 tokens. At that point, the agent either hits the API's context window limit and errors out, or — more subtly — the model's response quality degrades as the relevant information gets buried in a sea of stale tool results.

This is the threshold that separates a demo from a useful tool. Everything we've built so far assumes the context has room. Once it doesn't, the agent has a hard ceiling on how much work it can do in a single session. That's where context compaction comes in: a three-layer compression strategy that progressively shrinks the messages array — quietly trimming old results, automatically summarizing when a threshold is crossed, and letting the model request compression explicitly. With these three layers working together, the agent can run indefinitely.

In this guide, let's build ContextCompactor — the type that implements all three layers — and wire it into the agent loop. This is the beginning of Act III in our series: the agent now needs to manage its own memory.

The complete source code for this stage is available at the 06-context-compaction tag on GitHub. Code blocks below show key excerpts.

Three layers, three strategies

The compression strategy works in layers, each more aggressive than the last. Layer 1 — micro-compact — runs silently before every API call. It scans the messages array for old tool results (anything beyond the three most recent) and replaces their content with a short placeholder like "[Previous: used read_file]". The model still sees that a tool was called and what kind it was, but the actual output — the 500-line file, the verbose bash output — is gone. This is the quiet housekeeping layer: no API call required, no information loss that the model would typically need, and it runs every single turn.

Layer 2 — auto-compact — triggers when the estimated token count crosses a threshold (50,000 by default). This is the dramatic one: the agent saves the entire conversation transcript to disk as a JSONL file, then asks the LLM itself to summarize the conversation. The summary replaces the entire messages array — every prior turn collapses into two messages: a user message containing the compressed summary and an assistant acknowledgment. The conversation continues from there with a clean slate and full context of what happened.

Layer 3 — the compact tool — is the same summarization as layer 2, but triggered deliberately. The model calls compact when it decides compression would help, optionally specifying a focus parameter to guide what the summary should preserve. It's the difference between automatic garbage collection and an explicit free() — sometimes the model knows best when to compress.

The ContextCompactor type

Let's start with the type that owns all three layers. ContextCompactor holds two configuration values — the path where transcripts are saved and the token threshold that triggers auto-compaction — and exposes methods for each layer:

// Sources/Core/ContextCompactor.swift
public struct ContextCompactor: Sendable {
  public static let keepRecent = 3
  public static let minContentLength = 100

  public let transcriptDirectory: String
  public let tokenThreshold: Int

  public init(
    transcriptDirectory: String,
    tokenThreshold: Int = Limits.defaultTokenThreshold
  ) {
    self.transcriptDirectory = transcriptDirectory
    self.tokenThreshold = tokenThreshold
  }
}

The keepRecent and minContentLength constants control micro-compact's behavior: keep the three most recent tool results untouched, and only replace results longer than 100 characters. Anything shorter isn't worth compacting.

Micro-compact: the quiet layer

The microCompact method scans the messages array for every .toolResult content block, identifies which ones are old enough to compress, and replaces their content with a placeholder. One thing to keep in mind here is that Message.content is a let property — we can't mutate a content block in place. Instead, we reconstruct entire Message values with new content arrays:

public func microCompact(messages: inout [Message]) {
  let toolResultLocations = findToolResultLocations(in: messages)
  guard toolResultLocations.count > Self.keepRecent else {
    return
  }

  let toolNameMap = buildToolNameMap(from: messages)
  let oldResults = toolResultLocations.dropLast(Self.keepRecent)
  var modifiedContents: [Int: [ContentBlock]] = [:]

  for (msgIdx, contentIdx) in oldResults {
    guard
      case .toolResult(let toolUseId, let content, let isError) = messages[msgIdx].content[contentIdx],
      content.count > Self.minContentLength
    else {
      continue
    }

    let toolName = toolNameMap[toolUseId] ?? "unknown"
    let replacement = ContentBlock.toolResult(
      toolUseId: toolUseId,
      content: "[Previous: used \(toolName)]",
      isError: isError
    )

    if modifiedContents[msgIdx] == nil {
      modifiedContents[msgIdx] = messages[msgIdx].content
    }
    modifiedContents[msgIdx]![contentIdx] = replacement
  }

  for (msgIdx, newContent) in modifiedContents {
    messages[msgIdx] = Message(role: messages[msgIdx].role, content: newContent)
  }
}

The method is intentionally synchronous — it's pure data transformation with no reason to await anything. Two private helpers do the scanning: findToolResultLocations collects every toolResult position in the array, and buildToolNameMap walks assistant messages to map each toolUseId back to its tool name — bridging a gap in the API's data model where toolResult blocks carry an ID but no name.

Auto-compact: threshold-triggered summarization

Layer 2 needs to answer a question before it can act: how many tokens are we using? The API doesn't tell us the context size mid-conversation, so we estimate:

public func estimateTokens(from messages: [Message]) -> Int {
  let data = (try? JSONEncoder().encode(messages)) ?? Data()
  return data.count / 4
}

The divide-by-four heuristic is rough, but it's close enough for a threshold check — and JSON encoding closely matches the actual API payload size, which is what we care about.

When the estimate crosses the threshold, autoCompact takes over. It saves the full transcript to disk first — nothing is truly lost — then asks the LLM to summarize:

public func autoCompact(
  messages: [Message],
  using apiClient: APIClientProtocol,
  model: String,
  focus: String?
) async -> [Message] {
  do {
    let path = try saveTranscript(messages)

    let encoder = JSONEncoder()
    let data = (try? encoder.encode(messages)) ?? Data()

    var transcript = String(data: data, encoding: .utf8) ?? "[]"
    if transcript.count > Self.maxSummaryInputLength {
      transcript = String(transcript.prefix(Self.maxSummaryInputLength)) + "\n[truncated]"
    }

    var prompt = ""
    if let focus, !focus.isEmpty {
      prompt += "Focus on: \(focus). "
    }
    prompt += """
      Summarize this conversation for continuity. Include: \
      1) What was accomplished, 2) Current state, 3) Key decisions made. \
      Be concise but preserve critical details.

      \(transcript)
      """

    let request = APIRequest(
      model: model,
      maxTokens: 2000,
      messages: [.user(prompt)]
    )
    let response = try await apiClient.createMessage(request: request)
    let summary = response.content.textContent

    return [
      .user("[Conversation compressed. Transcript: \(path)]\n\n\(summary)"),
      .assistant("Understood. I have the context from the summary. Continuing.")
    ]
  } catch {
    print("[warning] Auto-compact failed: \(error). Keeping original messages.")
    return messages
  }
}

The do/catch wrapping the entire method body is a deliberate safety net — compaction failure should never crash the agent loop. If the API call fails or the transcript can't be written, the method prints a warning and returns the original messages unchanged. The agent continues with a full context rather than no context.

The saveTranscript method writes each message as a single JSON line to a .transcripts/ directory. One early version used a bare Unix timestamp for the filename, which created collisions when two compactions happened in the same second. The fix appends a UUID prefix:

let timestamp = Int(Date().timeIntervalSince1970)
let unique = UUID().uuidString.prefix(8)
let path = "\(transcriptDirectory)/transcript_\(timestamp)_\(unique).jsonl"

The compact tool and two-phase dispatch

Layer 3 gives the model direct control over compression. The compact tool definition includes an optional focus parameter that lets the model specify what the summary should preserve:

ToolDefinition(
  name: "compact",
  description: "Compress conversation history to free context space. Use when working on long tasks.",
  inputSchema: .object([
    "type": "object",
    "properties": .object([
      "focus": .object([
        "type": "string",
        "description": "What to preserve in the summary (e.g., 'file paths edited', 'current task progress')"
      ])
    ]),
    "required": .array([])
  ])
)

The handler, though, is surprising — it doesn't actually compact anything:

private func executeCompact(_ input: JSONValue) async -> Result<String, ToolError> {
  .success("Compressing...")
}

This is the two-phase dispatch pattern. The compact tool can't perform the actual compaction because tool handlers return Result<String, ToolError> — they don't have access to the messages array. The real work needs to happen in the loop, where messages is a local var. So the handler returns a marker string, and processToolUses captures the focus parameter as a signal:

struct ToolProcessingResult {
  let results: [ContentBlock]
  let didUseTodo: Bool
  let compactFocus: String?
}

The compactFocus field is nil when compact wasn't called, and holds the focus value (or an empty string for no focus) when it was. This replaces the growing tuple that processToolUses previously returned — a named struct with a clear nil-vs-present semantic is easier to reason about than a third tuple element.

Inside processToolUses, the compact detection is a simple check alongside the existing didUseTodo tracking:

if name == "compact" {
  compactFocus = input["focus"]?.stringValue ?? ""
}

Wiring into the agent loop

With all three layers built, let's connect them. The applyCompaction helper runs layers 1 and 2 in sequence:

private func applyCompaction(_ messages: [Message]) async -> [Message] {
  var compacted = messages
  contextCompactor.microCompact(messages: &compacted)

  if contextCompactor.estimateTokens(from: compacted) > contextCompactor.tokenThreshold {
    print("[auto_compact triggered]")
    return await contextCompactor.autoCompact(
      messages: compacted, using: apiClient, model: model, focus: nil
    )
  }

  return compacted
}

Micro-compact runs first (every turn), then the threshold check determines whether auto-compact fires. The method takes messages by value and returns a new array — the same pure-value pattern we've used since extracting agentLoop for subagents.

In the loop itself, applyCompaction runs before each API call, and manual compaction runs after tool results are appended:

while true {
  try Task.checkCancellation()

  iteration += 1
  if iteration > config.maxIterations {
    return (lastAssistantText + "\n(\(config.label) reached iteration limit)", messages)
  }

  messages = await applyCompaction(messages)

  let request = APIRequest(
    model: model, maxTokens: Limits.defaultMaxTokens,
    system: systemPrompt, messages: messages, tools: config.tools
  )

  let response = try await apiClient.createMessage(request: request)
  messages.append(Message(role: .assistant, content: response.content))
  // ... print, check stop reason, process tools ...

  messages.append(Message(role: .user, content: toolResults))

  if let compactFocus = toolProcessing.compactFocus {
    print("[manual compact]")
    messages = await contextCompactor.autoCompact(
      messages: messages, using: apiClient, model: model, focus: compactFocus
    )
  }
}

The placement matters. Micro-compact and auto-compact run before the API call, so the request always goes out with a trimmed context. Manual compact runs after tool results are appended, so the summary includes the compact tool call itself — the model's explicit decision to compress is preserved in the transcript.

The compact tool is excluded from LoopConfig.subagent alongside agent and todo — a subagent shouldn't be able to compress the parent's history. But micro-compact and auto-compact do run in subagent loops, since subagents share the same agentLoop code path. A subagent making heavy read_file calls across its 30-iteration limit can benefit from the quiet cleanup.

With that in place, we now have an agent that manages its own memory. Three layers of compression, one new type, and two injection points in the loop — before the API call and after tool processing.

Taking it for a spin

Let's build and run:

swift build && swift run agent

Try: Read every Swift file in the Sources/ directory one by one. Watch the terminal — after the first few files, earlier tool results in the context will start appearing as "[Previous: used read_file]" in subsequent API requests. That's micro-compact doing its work silently.

For a more dramatic demonstration, keep reading files or ask the agent to explore a large codebase. When the estimated token count crosses 50,000, auto-compact triggers: the agent saves a full transcript to .transcripts/, asks the LLM for a summary, and continues with a fresh two-message context. Check the .transcripts/ directory afterward — the full conversation history is preserved as JSONL.

To see layer 3 in action, try: Use the compact tool to compress this conversation, focusing on what files we've read. The model calls compact with a focus parameter, the loop triggers summarization, and the conversation continues with a targeted summary.

What we've built and where it breaks

We now have an agent that can work indefinitely. Micro-compact quietly trims old tool results every turn. Auto-compact summarizes the full conversation when the context gets large. The compact tool gives the model deliberate control. Transcripts on disk mean nothing is truly lost — just moved out of active context.

The limitation is that compression is lossy. When auto-compact fires, the model loses access to the exact content of files it read, the precise error messages it encountered, the specific commands it ran. The summary preserves the gist — what was accomplished, the current state, key decisions — but not the details. For a long-running task with dozens of steps, the model might forget exactly which files it edited or which approach it tried and abandoned. The loop is still the invariant; tools are still the variable. But now one of those tools can reshape the loop's own working memory — the first time in our series that the agent isn't just acting on the world, but acting on itself. In the next guide, we'll address the lossy-compression problem directly: a file-based task system that gives the agent durable state that survives compaction. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop
Part 2: Tool dispatch
Part 3: Self-managed task tracking
Part 4: Subagents
Part 5: Skill loading
Part 6: Context compaction ← you are here
Part 7: Task system
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

How do you handle context limits in your agents? Summarization, sliding windows, vector stores, or something else? Curious what's working for others.

Skill Loading: Two-Layer Knowledge Injection That Costs Tokens Only When Needed

Ivan Magda — Tue, 14 Apr 2026 06:46:40 +0000

Our agent can run commands, read and write files, track its own work, and delegate tasks to subagents. That's a solid toolkit — but everything the agent knows comes from either the model's training data or the contents of files it reads during a session. Ask it to follow a specific git commit convention, a code review checklist, or a deployment workflow, and it has nothing to draw on. We could stuff all of that knowledge into the system prompt, but that's wasteful: ten domain-specific guides at roughly 2,000 tokens each would add 20,000 tokens to every single API call, most of which would be irrelevant to the task at hand.

The fix is a two-layer injection strategy. Layer one is cheap: a one-line description of each available skill, embedded in the system prompt. The model sees what's available at a glance — maybe 100 tokens per skill. Layer two is expensive but on-demand: when the model decides it actually needs a skill, it calls a tool that returns the full body as a tool result. The knowledge arrives exactly when it's useful, and only the skills the model asks for consume context.

In this guide, let's build a SkillLoader that scans the filesystem for skill files, a buildSystemPrompt function that injects their names, and a load_skill tool that delivers their full content. This is the midpoint of our series — after this stage, the agent can run commands, manipulate files, plan its work, delegate tasks, and load new knowledge on demand.

The complete source code for this stage is available at the 05-skill-loading tag on GitHub. Code blocks below show key excerpts.

What a skill looks like on disk

Each skill lives in its own subdirectory under skills/, with a single SKILL.md file. The file uses YAML frontmatter — a name, a description — followed by the full body of knowledge. Here's the example skill that ships with the project:

skills/
  example/
    SKILL.md
  code-review/
    SKILL.md

And the contents of a SKILL.md:

---
name: example
description: An example skill demonstrating the skill file format
---

This is a sample skill file. Skills are stored in `skills/{name}/SKILL.md` and
provide specialized knowledge that the agent can load on demand via the
`load_skill` tool.

The frontmatter is the cheap part — the description feeds into the system prompt. The body below the closing --- is the expensive part — it only reaches the model when explicitly requested. A skill for code review might have a three-word description but a 2,000-token body with detailed checklists, severity rubrics, and formatting conventions. The agent pays for those tokens only when it's actually doing a code review.

Two layers, two costs

The architecture breaks down into a clear division of labor. At init time, SkillLoader scans the skills/ directory and parses every SKILL.md it finds. The parsed descriptions flow into buildSystemPrompt, which appends a short menu to the system prompt — something like:

Skills available:
  - code-review: Review code for bugs, style issues, and best practices
  - example: An example skill demonstrating the skill file format

That's layer one. Every API call includes it, but it's tiny — a few lines of text that tell the model what knowledge is available.

Layer two is the load_skill tool. When the model calls load_skill(name: "code-review"), the handler returns the full body wrapped in <skill> tags. That content arrives as a tool result — fresh context near the end of the messages array, exactly where the model pays the most attention. The model asked for it, so it's relevant. And because it's a tool result rather than part of the system prompt, it only appears in the one turn that needed it.

Scanning and parsing

Let's walk through SkillLoader. The type holds a dictionary of parsed skills, populated once at init time and never mutated afterward:

// Sources/Core/SkillLoader.swift
public struct SkillLoader: Sendable {
  public struct Skill: Sendable {
    public let name: String
    public let description: String
    public let body: String
  }

  private let skills: [String: Skill]
}

The Skill struct captures exactly three things: the name (used as a lookup key), the description (injected into the system prompt), and the body (returned by the tool).

The initializer scans the skills directory, silently handling the case where it doesn't exist:

public init(directory: String) {
  let fileManager = FileManager.default
  var loadedSkills: [String: Skill] = [:]

  var isDirectory: ObjCBool = false
  guard
    fileManager.fileExists(atPath: directory, isDirectory: &isDirectory),
    isDirectory.boolValue
  else {
    self.skills = [:]
    return
  }

  let contents = (try? fileManager.contentsOfDirectory(atPath: directory)) ?? []
  for entry in contents {
    let skillFile = "\(directory)/\(entry)/SKILL.md"
    guard
      fileManager.fileExists(atPath: skillFile),
      let text = try? String(contentsOfFile: skillFile, encoding: .utf8)
    else {
      continue
    }

    let (meta, body) = Self.parseFrontmatter(text)
    let skillName = meta["name"] ?? entry
    guard let description = meta["description"] else {
      continue
    }

    loadedSkills[skillName] = Skill(
      name: skillName,
      description: description,
      body: body.trimmingCharacters(in: .whitespacesAndNewlines)
    )
  }

  self.skills = loadedSkills
}

The init walks each subdirectory looking for a SKILL.md file. If the frontmatter specifies a name, that's the key; otherwise, the directory name is used as a fallback. Skills without a description are silently skipped — the description is what makes layer one work, so a skill without one has nothing to advertise. The try? on contentsOfDirectory and String(contentsOfFile:) means a permissions error on one skill doesn't prevent the rest from loading.

The frontmatter parser is a straightforward line-by-line scan — no regex, no YAML library:

private static func parseFrontmatter(_ text: String) -> (meta: [String: String], body: String) {
  let lines = text.components(separatedBy: "\n")

  guard
    let firstLine = lines.first,
    firstLine.trimmingCharacters(in: .whitespaces) == "---"
  else {
    return (meta: [:], body: text)
  }

  var meta: [String: String] = [:]
  var closingIndex: Int?

  for index in 1..<lines.count {
    let line = lines[index]
    if line.trimmingCharacters(in: .whitespaces) == "---" {
      closingIndex = index
      break
    }

    if let colonRange = line.range(of: ":") {
      let key = String(line[line.startIndex..<colonRange.lowerBound])
        .trimmingCharacters(in: .whitespaces)
      let value = String(line[colonRange.upperBound...])
        .trimmingCharacters(in: .whitespaces)
      if !key.isEmpty {
        meta[key] = value
      }
    }
  }

  guard let closing = closingIndex else {
    return (meta: [:], body: text)
  }

  let bodyLines = Array(lines[(closing + 1)...])
  let body = bodyLines.joined(separator: "\n")
  return (meta: meta, body: body)
}

If the file doesn't start with ---, the entire text is treated as the body with no metadata — a graceful fallback for plain markdown files. If the opening delimiter exists but the closing one is missing, the same fallback applies. Only when both delimiters are present does the parser extract key-value pairs from the lines between them.

The two public accessors provide what each layer needs. The descriptions property produces the compact menu for the system prompt, sorted alphabetically for deterministic output:

public var descriptions: String {
  guard !skills.isEmpty else {
    return ""
  }

  return skills.values
    .sorted { $0.name < $1.name }
    .map { "  - \($0.name): \($0.description)" }
    .joined(separator: "\n")
}

And content(for:) delivers the full body wrapped in <skill> tags, with a helpful error message listing available skills if the name doesn't match:

public func content(for name: String) -> String {
  if let skill = skills[name] {
    return "<skill name=\"\(name)\">\n\(skill.body)\n</skill>"
  }

  if skills.isEmpty {
    return "Unknown skill '\(name)'. No skills are available."
  }

  let available = skills.keys.sorted().joined(separator: ", ")
  return "Unknown skill '\(name)'. Available skills: \(available)"
}

The <skill> tag wrapping is a small but deliberate choice — it gives the model a clear signal that the content is structured knowledge, distinct from a regular tool output. When the model sees <skill name="code-review">...</skill> in a tool result, it knows exactly what it's looking at.

Wiring into the agent

With SkillLoader ready, let's connect it to the agent. The buildSystemPrompt method gains a skillDescriptions parameter that conditionally appends the skill menu:

// Sources/Core/Agent.swift
public static func buildSystemPrompt(cwd: String, skillDescriptions: String = "") -> String {
  var prompt = """
    You are a coding agent at \(cwd). Use tools to solve tasks. \
    Act, don't explain.

    - Prefer read_file/write_file/edit_file over bash for file operations
    - Always check tool results before proceeding
    - Use the todo tool to plan multi-step tasks. Mark in_progress before starting, completed when done.
    """

  if !skillDescriptions.isEmpty {
    prompt += "\nUse load_skill to access specialized knowledge.\n\nSkills available:\n\(skillDescriptions)"
  }

  return prompt
}

The load_skill tool handler is the simplest in the codebase — a single guard and a return:

private func executeLoadSkill(_ input: JSONValue) async -> Result<String, ToolError> {
  guard let name = input["name"]?.stringValue else {
    return .failure(.missingParameter("name"))
  }
  return .success(skillLoader.content(for: name))
}

And the dispatch map gains one entry:

let handlers = [
  "bash": executeBash,
  "read_file": executeReadFile,
  "write_file": executeWriteFile,
  "edit_file": executeEditFile,
  "todo": executeTodo,
  "agent": executeAgent,
  "load_skill": executeLoadSkill
]

With that in place, we now have an agent that discovers knowledge at startup and delivers it on demand. The load_skill tool is automatically available to subagents too — the denylist in LoopConfig.subagent only excludes agent and todo, so a child agent can load skills independently during a delegated task.

Taking it for a spin

Let's build and run:

swift build && swift run agent

Create a skills/ directory in the working folder with a custom skill — say, skills/git-workflow/SKILL.md containing frontmatter with a description and a body with commit conventions. Then try: What skills do you have available? The agent should list the skills it found at startup.

For something more interesting, try: Load the example skill and tell me what format skill files use. Watch the tool calls — the model should call load_skill with name: "example", receive the full body in <skill> tags, and summarize the format. The system prompt told it the skill existed; the tool delivered the content.

To see the economics in action, try a session where the agent handles a task that doesn't need skills — just file operations and bash commands. The skill descriptions in the system prompt add a few lines of overhead, but the full bodies never appear. That's the payoff of the two-layer approach.

The midpoint: everything clicks together

Let's take stock of where we are. Over five stages, we've built an agent that can run shell commands, read and write files, edit code, track its own work with a todo list, delegate subtasks to child agents, and now load specialized knowledge on demand. Seven tools, one loop. The agent loop itself — API call, check stop reason, process tools, append results — hasn't changed since the first guide. Every new capability has been a new entry in the dispatch dictionary, a new handler method, and sometimes a new injection point before or after tool processing.

That's the thesis in action: the loop is the invariant, tools are the variable. SkillLoader is a particularly clean example — the entire feature is a struct that scans a directory, a static function that generates a prompt, and a three-line tool handler. No changes to agentLoop, no changes to processToolUses, no changes to LoopConfig. Skills bloat the context by design — every load_skill call adds a full body to the messages array, and it stays there for the rest of the session. In the next guide, we'll tackle that directly with context compaction: a three-layer compression strategy that lets the agent run indefinitely without hitting the context window ceiling. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop
Part 2: Tool dispatch
Part 3: Self-managed task tracking
Part 4: Subagents
Part 5: Skill loading ← you are here
Part 6: Context compaction
Part 7: Task system
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

How do you inject domain knowledge into your agents? System prompt, RAG, tool results, or something else? Would love to hear your approach.

Subagents: Context Isolation Through Recursive Agent Loops

Ivan Magda — Fri, 10 Apr 2026 12:08:57 +0000

Our agent can now run commands, read and write files, edit code, and track its own work with a todo list. That's a capable set of tools — but every one of them shares the same context. Ask the agent to research which testing framework a project uses, and it might read five files, grep through a directory, and try a few bash commands before arriving at the answer: "XCTest." All of those intermediate tool calls — the file contents, the grep output, the exploratory commands — stay in the messages array permanently. The parent conversation didn't need any of that. It just needed the one-word answer.

This is context pollution. The agent's messages array is its working memory, and every tool call adds to it. A research task that reads ten files adds ten tool results to the context, even though the caller only cares about the conclusion. Over a long session with several such tasks, the context fills with intermediate results that crowd out the information that actually matters. Worse, those old results contribute to instruction-following decay — the very problem we tackled in the previous guide.

The fix is delegation with isolation. Instead of doing everything in one conversation, the agent can spawn a subagent — a child that gets a fresh messages array, does its work, and returns only a text summary. The parent's context stays clean. The child's entire working history is discarded. In this guide, let's build that delegation mechanism and introduce LoopConfig, a struct that lets the same agent loop behave differently depending on whether it's running as a parent or a child.

The complete source code for this stage is available at the 04-subagents tag on GitHub. Code blocks below show key excerpts.

A fresh messages array as a stack frame

The analogy that makes subagents click is a function call. When we call a function, it gets its own stack frame — local variables, local control flow — and returns a value. The caller doesn't see the function's internal state; it just gets the result. A subagent works the same way: it starts with messages = [Message.user(prompt)], runs the agent loop with its own growing context, and returns the final assistant text. The parent receives that text as a normal tool result — one content block instead of dozens.

The key architectural decision here is how agentLoop relates to the agent's state. In the previous guides, the agent loop lived directly inside run() and mutated self.messages in place. To support subagents, we need to extract that loop into a method that can operate on any messages array — the parent's or a fresh one. The natural approach in Swift would be inout [Message], letting the method mutate the caller's array directly. But Swift 6.2's strict concurrency checker rejects inout parameters on self properties inside async methods — it can't prove exclusive access across await suspension points. That's a hard compiler error, not a warning.

The alternative is pure value semantics: agentLoop takes [Message] by value and returns a (text: String, messages: [Message]) tuple. The caller decides what to do with the returned messages. For the parent, run() writes them back to self.messages. For a subagent, the caller discards them — the isolation is automatic:

// Sources/Core/Agent.swift
public func run(query: String) async throws -> String {
    messages.append(.user(query))

    let result = try await agentLoop(initialMessages: messages, config: .default)
    messages = result.messages

    return result.text
}

The parent calls agentLoop with its accumulated messages and writes the result back. A subagent calls the same method with [Message.user(prompt)] and lets the result fall away. Same function, different inputs, different lifecycles. Swift's value semantics mean the parent can never accidentally share state with a child — the fresh array is a copy, not a reference. That's a safety guarantee we get for free from the language.

LoopConfig: same loop, different rules

Extracting the loop solves context isolation, but parent and child need to behave differently too. The parent has access to all tools; the child shouldn't be able to spawn its own subagents (that's unbounded recursion) or update the parent's todo list (the TodoManager is shared state on the same Agent instance). The parent runs indefinitely; the child needs a safety limit. The parent nags about todos; the child shouldn't, since it can't call todo anyway.

All of these behavioral differences live in a single struct:

// Sources/Core/Agent.swift
fileprivate struct LoopConfig {
    let tools: [ToolDefinition]
    let maxIterations: Int
    let enableNag: Bool
    let label: String

    static let `default` = LoopConfig(
        tools: Agent.toolDefinitions,
        maxIterations: .max,
        enableNag: true,
        label: "agent"
    )

    static let subagent = LoopConfig(
        tools: Agent.toolDefinitions.filter {
          $0.name != "agent" && $0.name != "todo"
        },
        maxIterations: 30,
        enableNag: false,
        label: "subagent"
    )
}

Two static presets cover everything. The parent gets all tools, unlimited iterations, nag enabled, labeled "agent". The subagent filters out agent and todo, caps at 30 iterations, disables nag, and labels itself "subagent" so log output is distinguishable. The tool exclusion uses a denylist — filter { $0.name != ... } — rather than an allowlist, so new tools added in future stages are automatically available to subagents unless explicitly excluded.

The label field is a small touch that matters more than it looks. When a subagent is running, every tool call and text output is prefixed with [subagent] instead of [agent]. Watching the terminal, it's immediately clear which loop is active — essential for debugging delegation behavior.

Wiring the agent tool and guarding the dispatch

The agent tool handler is the simplest in the codebase. It extracts the prompt, calls agentLoop with a fresh single-message array and the .subagent config, and returns the text:

// Sources/Core/Agent.swift
private func executeAgent(_ input: JSONValue) async -> Result<String, ToolError> {
    guard let prompt = input["prompt"]?.stringValue else {
        return .failure(.missingParameter("prompt"))
    }

    do {
        let result = try await agentLoop(
            initialMessages: [Message.user(prompt)],
            config: .subagent
        )
        var output = result.text

        if output.isEmpty {
            output = "(no output)"
        } else if output.count > Limits.maxOutputSize {
            output = String(output.prefix(Limits.maxOutputSize))
        }

        return .success(output)
    } catch {
        return .failure(.executionFailed("Subagent failed: \(error)"))
    }
}

The result.messages — the subagent's entire working history — is never assigned anywhere. It falls out of scope when executeAgent returns, and with it goes every intermediate tool call the child made. The parent sees only result.text.

There's one more piece that matters: defense in depth. Even though LoopConfig.subagent doesn't include the agent tool definition, the model can still hallucinate a tool_use block for it. Language models don't always respect the tool list — they've seen these tool names in training data and may emit them regardless. Without a guard, a hallucinated agent call inside a subagent would trigger unbounded recursion. The fix is an allowedTools check in processToolUses:

// Sources/Core/Agent.swift
private func processToolUses(
    response: APIResponse,
    allowedTools: Set<String>,
    label: String
) async -> (results: [ContentBlock], didUseTodo: Bool) {
    var results: [ContentBlock] = []
    var didUseTodo = false

    for case .toolUse(let id, let name, let input) in response.content {
        guard allowedTools.contains(name) else {
            let message = "Tool '\(name)' is not allowed in this context"
            results.append(.toolResult(toolUseId: id, content: message, isError: true))
            continue
        }
        // ... execute tool, append result ...
    }

    return (results, didUseTodo)
}

The allowedTools set is built once from config.tools at the top of agentLoop. If the model emits a tool call for a name not in the set, the handler returns an error result with isError: true — the model sees the rejection and adjusts. No recursion, no crash.

The assembled loop

With LoopConfig and processToolUses in place, let's look at the complete agentLoop. It's the same agent loop from the previous guides — API call, check stop reason, process tools, append results — now parameterized by a config:

private func agentLoop(
    initialMessages: [Message],
    config: LoopConfig
) async throws -> (text: String, messages: [Message]) {
    var messages = initialMessages
    var turnsWithoutTodo = 0
    var iteration = 0
    var lastAssistantText = ""

    let allowedTools = Set(config.tools.map(\.name))

    while true {
        try Task.checkCancellation()

        iteration += 1
        if iteration > config.maxIterations {
            return (lastAssistantText + "\n(\(config.label) reached iteration limit)", messages)
        }

        let request = APIRequest(
            model: model, maxTokens: Limits.defaultMaxTokens,
            system: systemPrompt, messages: messages, tools: config.tools
        )

        let response = try await apiClient.createMessage(request: request)
        messages.append(Message(role: .assistant, content: response.content))
        lastAssistantText = response.content.textContent

        guard response.stopReason == .toolUse else {
            return (response.content.textContent, messages)
        }

        let (results, didUseTodo) = await processToolUses(
            response: response, allowedTools: allowedTools, label: config.label
        )

        var toolResults = results
        if config.enableNag {
            turnsWithoutTodo = didUseTodo ? 0 : turnsWithoutTodo + 1
            if turnsWithoutTodo >= Self.todoReminderThreshold && todoManager.hasOpenItems() {
                toolResults.append(.text("Update your todos."))
            }
        }

        messages.append(Message(role: .user, content: toolResults))
    }
}

With that in place, we have an agent that can delegate. The parent dispatches a subtask, the child works autonomously with its own context, and only the summary comes back. One thing to keep in mind here is that lastAssistantText tracks the most recent assistant response at each iteration. When the subagent hits its 30-iteration limit, the method returns whatever the model last said — plus a note that the limit was reached. During development, this initially extracted text from messages.last, which was wrong: at the iteration-limit check point, the last message is a user message containing tool results, not the assistant's response. Tracking it explicitly after each API call avoids that off-by-one.

Taking it for a spin

Let's build and run:

swift build && swift run agent

Try a delegation-heavy task: Use a subagent to find what dependencies this project has, then tell me the list. Watch the terminal — tool calls prefixed with [subagent] show the child reading Package.swift and exploring the file tree, while the parent waits. When the subagent finishes, the parent receives a summary and continues in its clean context.

For something more interesting, try: Delegate a task to read all the Swift source files in Sources/ and summarize what each one does. The subagent might make five or six read_file calls, but the parent's context only grows by one tool result — the summary. That's the value of context isolation in action.

What we've built and where we're going

We now have an agent that delegates. The agent tool spawns a subagent with a fresh messages array, the child works independently using the same loop and the same filesystem, and only a text summary returns to the parent. Context stays clean, and LoopConfig controls the behavioral differences — tool access, iteration limits, nag behavior — through static presets rather than scattered conditionals.

The deeper lesson is that none of this required changing the loop itself. The agent loop — API call, check stop reason, process tools, append results — is identical to what we built in the first guide. We extracted it into a method, parameterized it with a config struct, and called it recursively. The loop is the invariant; tools and configuration are the variables. LoopConfig will continue to grow — when we add background tasks later in the series, it gains a drainBackground flag to prevent subagents from consuming the parent's notifications. But the growth pattern is always the same: one new field, one new preset value. In the next guide, we'll give the agent the ability to load skills on demand — knowledge files that expand its capabilities without bloating every request. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop
Part 2: Tool dispatch
Part 3: Self-managed task tracking
Part 4: Subagents ← you are here
Part 5: Skill loading
Part 6: Context compaction
Part 7: Task system
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

Do your agents hallucinate tool calls that aren't in the tool list? The allowedTools guard in this guide caught a real recursion bug. What's your defense-in-depth strategy?

Why Coding Agents Lose Their Plan (and How a Todo Tool Fixes It)

Ivan Magda — Tue, 07 Apr 2026 14:57:45 +0000

Our agent can run commands, read files, write files, and edit code — all chained together automatically within a single prompt. Ask it to scaffold a module with three source files and a config, and it'll happily bash and write_file its way through the whole thing. But ask it to refactor a codebase in ten steps, and something interesting happens: it nails steps one through three, starts to drift around step five, and by step seven it's improvising. The plan it had at the beginning has faded into the growing sea of tool calls and results filling the context window.

This is a well-known property of language models called instruction-following decay. As a conversation grows longer, the system prompt and the original intent carry less weight relative to the mass of recent content. The model doesn't forget in the human sense — it just pays less attention. For a coding agent doing multi-step work, that's a serious problem. The plan has no durable representation — it lives only in the model's reasoning, and reasoning fades as context grows.

The fix is surprisingly simple: give the agent a structured notepad that it writes for itself. Instead of holding the plan in the system prompt or hoping the model remembers, we give it a todo tool that maintains a visible, updatable task list right in the conversation. Every time the agent calls the tool, the current plan comes back as a tool result — fresh content near the end of the context, exactly where the model pays the most attention. In this guide, let's build that notepad and a nag system that reminds the agent to use it.

The complete source code for this stage is available at the 03-todo-write tag on GitHub. Code blocks below show key excerpts.

A todo tool the agent writes for itself

The core idea is a TodoManager that stores a list of items, each with a status: pending, in_progress, or completed. The agent calls the todo tool to set the full list whenever it wants to update the plan. One key constraint: only a single item can be in_progress at a time. This forces sequential focus — the model can't mark three things as in-progress and half-finish all of them.

Let's start with the data model. Each todo item has an ID, a text description, and a status:

// Sources/Core/TodoManager.swift
public enum TodoStatus: String, Sendable, Equatable, Codable {
  case pending
  case inProgress = "in_progress"
  case completed

  public var marker: String {
    switch self {
    case .pending: "[ ]"
    case .inProgress: "[>]"
    case .completed: "[x]"
    }
  }
}

public struct TodoItem: Sendable, Equatable, Codable {
  public let id: String
  public let text: String
  public let status: TodoStatus
}

The status markers — [ ], [>], [x] — make the rendered output instantly scannable for both the model and us watching the agent work.

The manager itself is a class that validates and stores items. The validation rules are intentionally tight: no more than 20 items, no blank text, and that single-in-progress constraint. Here's the core:

public final class TodoManager {
  public static let maxItems = 20
  public private(set) var items: [TodoItem] = []

  public enum ValidationError: Error, Equatable, Sendable {
    case tooManyItems
    case emptyText(String)
    case multipleInProgress
  }

  public func update(items: [TodoItem]) throws {
    if items.count > Self.maxItems {
      throw ValidationError.tooManyItems
    }

    for item in items where item.text.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty {
      throw ValidationError.emptyText(item.id)
    }

    let inProgressCount = items.filter { $0.status == .inProgress }.count
    if inProgressCount > 1 {
      throw ValidationError.multipleInProgress
    }

    self.items = items
  }
}

TodoManager is a class rather than a struct, which might seem surprising for a type that just holds an array. The reasoning: it's a stateful manager with a long-lived identity, owned exclusively by the Agent instance. The agent creates one TodoManager at init and mutates it throughout the session. A struct would work with mutating methods, but a class better expresses the intent — this is a single piece of mutable state with a lifecycle tied to the agent.

Wiring the tool into the agent

Adding todo to the dispatch map follows exactly the same pattern as every other tool — one entry in the dictionary, one handler method. Here's the handler, which bridges between JSONValue inputs and our typed TodoItem model:

// Sources/Core/Agent.swift
private func executeTodo(_ input: JSONValue) async -> Result<String, ToolError> {
    guard let itemsArray = input["items"]?.arrayValue else {
        return .failure(.missingParameter("items"))
    }

    var todoItems: [TodoItem] = []
    for element in itemsArray {
        guard let id = element["id"]?.stringValue else {
            return .failure(.missingParameter("items[].id"))
        }
        guard let text = element["text"]?.stringValue else {
            return .failure(.missingParameter("items[].text"))
        }
        guard let statusString = element["status"]?.stringValue else {
            return .failure(.missingParameter("items[].status"))
        }
        guard let status = TodoStatus(rawValue: statusString) else {
            return .failure(.executionFailed("Invalid status '\(statusString)' for item \(id)"))
        }
        todoItems.append(TodoItem(id: id, text: text, status: status))
    }

    do {
        try todoManager.update(items: todoItems)
        return .success(todoManager.render())
    } catch {
        return .failure(.executionFailed("\(error)"))
    }
}

The handler does the mechanical work of parsing JSON into typed values, then delegates to TodoManager.update() for validation. If everything passes, render() returns the formatted list that goes back to the model as a tool result. The dispatch map gains one line:

func executeTool(name: String, input: JSONValue) async -> Result<String, ToolError> {
    let handlers = [
        "bash": executeBash,
        "read_file": executeReadFile,
        "write_file": executeWriteFile,
        "edit_file": executeEditFile,
        "todo": executeTodo  // one new entry
    ]
    ...
}

And the render() method produces output the model can read at a glance:

public func render() -> String {
    if items.isEmpty {
        return "No todos."
    }

    let completedCount = items.filter { $0.status == .completed }.count
    var lines = items.map { "\($0.status.marker) \($0.text)" }
    lines.append("(\(completedCount)/\(items.count) completed)")

    return lines.joined(separator: "\n")
}

With that in place, the agent has a self-managed planning tool. A rendered todo list looks like [x] Add type hints / [>] Extract helper / [ ] Update docstring / [ ] Run linter / (1/4 completed) — that string appears as a tool result near the end of the context, exactly where we want the plan to live. And just like every tool before it, adding todo required zero changes to the loop itself.

The nag system: reminding the agent to plan

Having a todo tool is necessary but not sufficient. The model might simply not call it — especially as the conversation grows and the system prompt instruction to "use the todo tool" fades. We need a gentle mechanism that nudges the agent back toward planning when it drifts.

The approach is a turn counter. Every time the agent loop processes tool calls, we check whether any of them was todo. If not, we increment turnsWithoutTodo. If the counter hits a threshold (three turns) and there are still open items, we inject a short reminder into the tool results:

// Sources/Core/Agent.swift — inside run()
var turnsWithoutTodo = 0

while true {
    // ... API call, check stopReason ...

    var results: [ContentBlock] = []
    var didUseTodo = false

    for case .toolUse(let id, let name, let input) in response.content {
        let toolResult = await executeTool(name: name, input: input)

        if name == "todo" {
            didUseTodo = true
        }
        // ... append result to results ...
    }

    turnsWithoutTodo = didUseTodo ? 0 : turnsWithoutTodo + 1
    if turnsWithoutTodo >= Self.todoReminderThreshold && todoManager.hasOpenItems() {
        results.append(.text("Update your todos."))
    }

    messages.append(Message(role: .user, content: results))
}

A few things to note about the placement. The turnsWithoutTodo counter is a local variable inside run(), not an instance property. It only matters within a single user query — when the user types a new prompt, the counter resets naturally. The messages array, by contrast, stays on the Agent instance so conversation history persists across REPL turns.

The reminder is appended to the results array, after all tool results. During development, we initially inserted it at position zero — before the tool results. That's risky because the Anthropic API expects tool_result blocks to come first in a user message that responds to tool use. Appending the text reminder after all results is the safer ordering.

There's also a subtlety with didUseTodo: it's set to true whenever the model calls the todo tool, regardless of whether the call succeeds or fails. Ideally, a failed todo call (say, with invalid data) shouldn't reset the nag counter — the agent didn't actually update its plan. The current implementation is a pragmatic compromise; gating on success would add complexity for a rare edge case.

Taking it for a spin

Let's build and run:

swift build && swift run agent

Try a multi-step task: Refactor the file Package.swift: first read it, then add a comment header, then verify it still compiles. Watch for the agent to call todo early to lay out the plan, then update statuses as it works through each step. If the agent skips the todo tool for three turns, the "Update your todos." reminder should appear in the output.

The nag only fires when there are open items — if the agent never calls todo in the first place, there's nothing to nag about. To see the reminder in action, try a task complex enough that the agent creates a todo list early, then gets absorbed in the work: Create a Swift package with three modules: a networking library, a models library that depends on it, and a CLI that ties them together. Include a Package.swift with the dependency graph. Watch for the agent to lay out the plan with todo, then start building — after three turns of file operations without updating the list, the reminder appears.

What we've built and where we're going

We now have an agent that can track its own work. The todo tool gives the model a structured notepad — a place to write down the plan, mark items as in-progress, and check them off as they're completed. The nag system ensures the plan doesn't get abandoned as the conversation grows. Together, they're a lightweight counter-measure to the instruction-following decay that makes long agent sessions drift.

The mechanism is simple — a class with validation rules, one tool handler, and a turn counter — but it addresses a real problem that gets worse as agents take on larger tasks. The loop itself didn't change; we added one entry to the dispatch dictionary and a counter with an injection point after tool processing. The pattern holds: the loop is the invariant, tools are the variable. Nag reminders work well here, but an interesting question arises when we start running child agents: the TodoManager is shared state on the same Agent instance. If a subagent runs, should it nag about the parent's todos? We'll tackle that in the next guide when we build subagents and introduce LoopConfig to control per-loop behavior. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop
Part 2: Tool dispatch
Part 3: Self-managed task tracking ← you are here
Part 4: Subagents
Part 5: Skill loading
Part 6: Context compaction
Part 7: Task system
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

Have you seen instruction-following decay in your own agent projects? What's your counter-measure? Love to hear what works.

Tool Dispatch: A Dictionary Replaces a Switch Statement and Scales to 14 Tools

Ivan Magda — Thu, 02 Apr 2026 10:31:52 +0000

Our agent can do a lot with just bash. It can read files with cat, write them with echo, search with grep, compile with swift build — a shell command is a universal interface to the operating system. So why would we need anything else?

The answer becomes clear when we watch the agent work. It reaches for cat to read a file, and the output silently truncates at some terminal buffer limit. It constructs a multi-line sed command to edit a source file, and one misplaced backslash corrupts the content. Every file operation goes through a shell command that the model has to construct from scratch, with no guardrails and no safety boundaries. Dedicated tools like read_file and write_file let us enforce constraints — path sandboxing, output limits, atomic writes — at the tool level rather than hoping the model's bash commands happen to be correct.

In this guide, let's build a tool dispatch system that scales to any number of tools without changing the agent loop. We'll add three new tools — read_file, write_file, and edit_file — and replace the hardcoded bash handler with a dictionary-based dispatch map. The loop from the previous guide stays identical. Only the tool set changes.

The complete source code for this stage is available at the 02-tool-dispatch tag on GitHub. Code blocks below show key excerpts.

From one tool to many

In the previous guide, our executeTool method had exactly one job:

guard name == "bash" else {
    return .failure(.unknownTool(name))
}
return await executeBash(input)

This works perfectly for a single tool. But let's say we add read_file. Now we need an if/else chain — or a switch. Add write_file and edit_file, and the switch grows to four cases. By the time we reach the end of this series with 14 tools, that switch statement would be unwieldy. Worse, adding a new tool means modifying the dispatch logic itself, mixing "which tools exist" with "how tools are routed."

What we want is a separation: a data structure that maps tool names to handler functions, and a dispatch mechanism that just does a lookup. Adding a tool means adding one entry to the map — the routing code never changes.

The dispatch map

That's where dictionary-based dispatch comes in. Instead of a switch or a chain of if statements, we build a [String: handler] dictionary. The agent loop looks up the tool name, calls the matching handler, and moves on. Here's the core of executeTool:

// Sources/Core/Agent.swift
func executeTool(name: String, input: JSONValue) async -> Result<String, ToolError> {
    let handlers = [
        "bash": executeBash,
        "read_file": executeReadFile,
        "write_file": executeWriteFile,
        "edit_file": executeEditFile
    ]

    guard let handler = handlers[name] else {
        return .failure(.unknownTool(name))
    }

    return await handler(input)
}

One alternative we considered was a protocol-based registry — a Tool protocol with conforming structs, registered into some kind of container. For four tools, that's more boilerplate than the tools themselves. The dictionary is the registry. If we ever reach a point where protocol dispatch makes sense, the refactor is straightforward — but at 14 tools by the end of this series, the dictionary still holds up fine.

Keeping tools inside the sandbox

Before we build the individual tool handlers, we need to solve a safety problem. When the model asks to read /etc/passwd or write to ../../../important_file, we want to reject that at the tool level — not hope the model behaves. Every file tool needs path sandboxing: resolve the path, check that it stays inside our working directory, and reject anything that escapes.

Here's our resolveSafePath helper:

private func resolveSafePath(_ relativePath: String) -> Result<String, ToolError> {
    let workDirURL = URL(fileURLWithPath: workingDirectory, isDirectory: true)
    let resolvedWorkDir = workDirURL.standardized

    let fullURL =
        if relativePath.hasPrefix("/") {
            URL(fileURLWithPath: relativePath).standardized
        } else {
            workDirURL.appendingPathComponent(relativePath).standardized
        }

    guard
        fullURL.path.hasPrefix(resolvedWorkDir.path + "/")||
        fullURL.path == resolvedWorkDir.path
    else {
        return .failure(.executionFailed("Path escapes workspace: \(relativePath)"))
    }

    return .success(fullURL.path)
}

That hasPrefix("/") guards against a URL.appendingPathComponent quirk: it always appends, even to an absolute path, so /Users/foo/file.swift becomes /cwd/Users/foo/file.swift.

Building the file tools

With path sandboxing in place, let's walk through each handler. First, read_file — it reads a file's contents with an optional line limit and a 50,000-character cap. That cap matters because every tool result goes back into the conversation, and a single massive file read could eat a significant chunk of the context window:

private func executeReadFile(_ input: JSONValue) async -> Result<String, ToolError> {
    guard let path = input["path"]?.stringValue else {
        return .failure(.missingParameter("path"))
    }

    switch resolveSafePath(path) {
    case .failure(let error):
        return .failure(error)
    case .success(let resolvedPath):
        do {
            let text = try String(contentsOfFile: resolvedPath, encoding: .utf8)
            let lines = text.components(separatedBy: "\n")
            var output: String

            if let limit = input["limit"]?.intValue, limit < lines.count {
                output = lines.prefix(limit).joined(separator: "\n")
                    + "\n... (\(lines.count - limit) more lines)"
            } else {
                output = text
            }

            if output.count > 50_000 {
                output = String(output.prefix(50_000))
            }

            return .success(output)
        } catch {
            return .failure(.executionFailed("\(error)"))
        }
    }
}

Next, write_file — the model often asks to create files in directories that don't exist yet, so the handler creates intermediate directories automatically:

private func executeWriteFile(_ input: JSONValue) async -> Result<String, ToolError> {
    guard let path = input["path"]?.stringValue else {
        return .failure(.missingParameter("path"))
    }
    guard let content = input["content"]?.stringValue else {
        return .failure(.missingParameter("content"))
    }

    switch resolveSafePath(path) {
    case .failure(let error):
        return .failure(error)
    case .success(let resolvedPath):
        do {
            let fileURL = URL(fileURLWithPath: resolvedPath)

            try FileManager.default.createDirectory(
                at: fileURL.deletingLastPathComponent(),
                withIntermediateDirectories: true
            )
            try content.write(toFile: resolvedPath, atomically: true, encoding: .utf8)

            return .success("Wrote \(content.utf8.count) bytes to \(path)")
        } catch {
            return .failure(.executionFailed("\(error)"))
        }
    }
}

Finally, edit_file — this one finds an exact text match and replaces it. One important design choice here: content.range(of:) returns the first occurrence only. This is deliberate — it matches how Claude Code's real edit_file tool behaves. Single-occurrence replacement is safer because it forces the model to be precise about which match it means.

private func executeEditFile(_ input: JSONValue) async -> Result<String, ToolError> {
    guard let path = input["path"]?.stringValue else {
        return .failure(.missingParameter("path"))
    }
    guard let oldText = input["old_text"]?.stringValue else {
        return .failure(.missingParameter("old_text"))
    }
    guard let newText = input["new_text"]?.stringValue else {
        return .failure(.missingParameter("new_text"))
    }

    switch resolveSafePath(path) {
    case .failure(let error):
        return .failure(error)
    case .success(let resolvedPath):
        do {
            var content = try String(contentsOfFile: resolvedPath, encoding: .utf8)

            guard let range = content.range(of: oldText) else {
                return .failure(.executionFailed("Text not found in \(path)"))
            }

            content.replaceSubrange(range, with: newText)
            try content.write(toFile: resolvedPath, atomically: true, encoding: .utf8)

            return .success("Edited \(path)")
        } catch {
            return .failure(.executionFailed("\(error)"))
        }
    }
}

With all four handlers in place, our dispatch map is complete. The agent can now read, write, and edit files through dedicated tools — with path sandboxing on every operation — while still falling back to bash for everything else.

The loop didn't change

Let's take a step back and look at what didn't change. The agent loop in run() is identical to the previous guide:

while true {
    let request = APIRequest(
        model: model,
        maxTokens: 4096,
        system: systemPrompt,
        messages: messages,
        tools: Self.toolDefinitions  // was [Self.bashToolDefinition]
    )

    let response = try await apiClient.createMessage(request: request)
    messages.append(Message(role: .assistant, content: response.content))

    guard response.stopReason == .toolUse else {
        return response.content.textContent
    }

    // ... execute tools, append results, continue
}

The only change is Self.toolDefinitions — four tool definitions instead of one. The loop still calls executeTool(name:input:), which now does a dictionary lookup instead of a hardcoded check. Everything else — the agent loop, the stopReason guard, the message accumulation — is untouched. This is the pattern that holds through the rest of the series: the loop is the invariant, tools are the variable.

Taking it for a spin

Let's build and run:

swift build && swift run agent

Try asking the agent to read the file Package.swift — it should use read_file instead of shelling out to cat. Then try create a file called greeting.txt that says Hello, World! and watch it use write_file. For something more interesting, try create a file called math.swift with a function that adds two numbers, then edit it to add a docstring — this exercises write_file followed by edit_file in a multi-step chain, all within a single prompt. The system prompt now tells the model to prefer read_file/write_file/edit_file over bash for file operations, so it should reach for the dedicated tools naturally.

What we've built and where we're going

We now have a dispatch system that scales to any number of tools by adding entries to a dictionary — no changes to the loop, no changes to the routing logic. Each tool handler enforces its own constraints (path sandboxing, output limits, single-occurrence edits), which is safer and more reliable than hoping bash commands are well-formed. The dispatch dictionary is small enough to read at a glance and large enough to handle the 14 tools we'll have by the end of the series.

One dispatch dictionary works for now, but later we'll need different tool sets for different contexts — subagents shouldn't have access to every tool the main agent has. We'll solve that when we build subagents and introduce LoopConfig to control which tools are available at each recursion level. In the next guide, we'll give the agent a structured way to track its own work with a todo system, so it doesn't lose its plan halfway through a long task. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop
Part 2: Tool dispatch ← you are here
Part 3: Todo persistence
Part 4: Subagents
Part 5: Skill loading
Part 6: Context compaction
Part 7: Task system
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

How do you handle path sandboxing in your tools? Prefix check, allowlist, chroot? Curious what approaches others use.

The Agent Loop: How 20 Lines of Swift Turn an API Client into a Coding Agent

Ivan Magda — Mon, 30 Mar 2026 09:08:25 +0000

A language model can reason about code — it can plan how to fix a bug, suggest a refactoring, or design a feature. But it can't touch the real world. It can't read files, run tests, or check whether its suggestion actually compiles. Without some kind of bridge, every interaction is a dead end: the model suggests something, we copy-paste it into a terminal, paste the result back, the model adjusts, and we do it all over again. We are the loop.

The entire point of a coding agent is to close that loop automatically. Give the model a way to execute commands, feed the results back, and let it keep going until it's done. That's what we'll build in this guide — and it turns out the core mechanism is surprisingly small.

The complete source code for this stage is available at the 01-agent-loop tag on GitHub. Code blocks below show key excerpts.

The problem: we are the middleware

Let's say we ask the model to create a file. Without an agent loop, the interaction looks like this: we send a prompt, the model responds with a shell command, we manually run the command, then paste the output back so the model can verify it worked. Every single tool use requires a human round-trip. For a task that involves ten commands, that's ten manual copy-paste cycles.

What we want instead is a loop that does this automatically:

+--------+      +-------+      +---------+
|  User  | ---> |  LLM  | ---> |  Tool   |
| prompt |      |       |      | execute |
+--------+      +---+---+      +----+----+
                    ^                |
                    |   tool_result  |
                    +----------------+
                    (loop until stop_reason != tool_use)

The user sends one prompt. The model calls tools as many times as it needs — reading files, running commands, checking results — and only stops when it's satisfied. One exit condition controls the entire flow.

Two loops, two jobs

Our agent actually has two loops, each with a distinct purpose. The outer loop is the REPL — it reads user input, hands it to the agent, and waits for the next prompt. The inner loop is the agent loop — it calls the API, executes tools, and keeps going until the model decides it's done.

The REPL is the user-facing shell:

// Sources/cli/SwiftClaudeCode.swift
while true {
  print("\(ANSIColor.cyan)\(ANSIColor.bold)>\(ANSIColor.reset) ", terminator: "")
  guard let input = readLine(strippingNewline: true) else {
    break
  }

  let trimmed = input.trimmingCharacters(in: .whitespacesAndNewlines)
  if trimmed.isEmpty { continue }
  if ["exit", "quit", "q"].contains(trimmed.lowercased()) { break }

  do {
    _ = try await agent.run(query: trimmed)
  } catch {
    print("\(ANSIColor.red)Error: \(error)\(ANSIColor.reset)")
  }

  print()
}

This loop lives forever. Each iteration reads one line of input, calls agent.run(query:), and prints the result. The agent handles everything in between — however many API calls and tool executions that takes. When the agent returns, the REPL is back to waiting for the next prompt.

The critical detail: the messages array lives on the Agent instance, not inside run(). This means conversation history persists across REPL turns. The second prompt the user types has full context of everything the agent did for the first one. During development, we briefly moved messages to a local variable for "cleanliness" — and immediately broke multi-turn conversations. The REPL calls run() per input; if messages don't survive between calls, the agent has amnesia.

The agent loop: one exit condition

The inner loop is the actual agent. Let's walk through the mechanism before seeing the full implementation.

First, the user's query becomes a message:

messages.append(.user(query))

Next, we send the full conversation — plus our tool definitions — to the API:

let request = APIRequest(
  model: model,
  maxTokens: 4096,
  system: systemPrompt,
  messages: messages,
  tools: [Self.bashToolDefinition]
)
let response = try await apiClient.createMessage(request: request)
messages.append(Message(role: .assistant, content: response.content))

Now comes the single branching point. We check stopReason — if the model didn't ask to use a tool, we're done:

guard response.stopReason == .toolUse else {
  return response.content.textContent
}

Otherwise, we execute each tool call, collect the results, and append them as a user message. Then we loop back to the API call:

var results: [ContentBlock] = []
for case .toolUse(let id, let name, let input) in response.content {
  let toolResult = await executeTool(name: name, input: input)
  switch toolResult {
  case .success(let output):
    results.append(.toolResult(toolUseId: id, content: output, isError: false))
  case .failure(let error):
    results.append(.toolResult(toolUseId: id, content: "\(error)", isError: true))
  }
}
messages.append(Message(role: .user, content: results))

Assembled into one method, this is the complete agent loop:

// Sources/Core/Agent.swift
public func run(query: String) async throws -> String {
  messages.append(.user(query))

  while true {
    let request = APIRequest(
      model: model,
      maxTokens: 4096,
      system: systemPrompt,
      messages: messages,
      tools: [Self.bashToolDefinition]
    )

    let response = try await apiClient.createMessage(request: request)
    messages.append(Message(role: .assistant, content: response.content))

    for case .text(let text) in response.content {
      print("\(ANSIColor.cyan)\(text)\(ANSIColor.reset)")
    }

    guard response.stopReason == .toolUse else {
      return response.content.textContent
    }

    var results: [ContentBlock] = []
    for case .toolUse(let id, let name, let input) in response.content {
      printToolCall(name: name, input: input)
      let toolResult = await executeTool(name: name, input: input)

      switch toolResult {
      case .success(let output):
        print("\(ANSIColor.dim)\(String(output.prefix(200)))\(ANSIColor.reset)")
        results.append(.toolResult(toolUseId: id, content: output, isError: false))
      case .failure(let error):
        let message = "\(error)"
        print("\(ANSIColor.red)\(message)\(ANSIColor.reset)")
        results.append(.toolResult(toolUseId: id, content: message, isError: true))
      }
    }

    messages.append(Message(role: .user, content: results))
  }
}

With that in place, we have a fully functional coding agent — and the entire mechanism fits in a single method. The branching point is one guard on stopReason. Everything else in this series layers on top of this loop — without changing it. Tools are the variable; the loop is the invariant.

Bash is all you need

We only give the model one tool: bash. That might seem limiting, but think about what bash can do — read files, write files, search codebases, run compilers, execute tests, install packages, manage git. A shell command is a universal interface to the operating system. The model decides what commands to run; we just execute them and report back.

In Swift, executing a shell command means wrapping Foundation's Process:

// Sources/Core/ShellExecutor.swift
let process = Process()
let stdoutPipe = Pipe()
let stderrPipe = Pipe()

process.executableURL = URL(fileURLWithPath: "/bin/bash")
process.arguments = ["-c", command]
process.standardOutput = stdoutPipe
process.standardError = stderrPipe
process.currentDirectoryURL = URL(fileURLWithPath: cwd)

try process.run()

// Read pipe data BEFORE waitUntilExit() to avoid deadlock
let stdoutData = stdoutPipe.fileHandleForReading.readDataToEndOfFile()
let stderrData = stderrPipe.fileHandleForReading.readDataToEndOfFile()
process.waitUntilExit()

One thing we discovered during research that saved us from a nasty bug: pipe data must be read before calling waitUntilExit(). Foundation's Pipe uses kernel buffers that are typically around 64 KB. If a command produces more output than that, the child process blocks on write() because the buffer is full, while the parent blocks on waitUntilExit() waiting for the child to exit. Neither side makes progress — a classic deadlock that would have been silent and hard to diagnose.

Message accumulation: the growing conversation

One pattern worth understanding is how the messages array grows during a single run() call. Let's say the user asks "create a file called greeting.txt that says Hello World." Here's what messages looks like at each step:

[user("create a file...")] — we append the query
[user, assistant(tool_use: bash "echo ...")] — the model responds with a command
[user, assistant, user(tool_result: "")] — we execute it, append the result
[user, assistant, user, assistant(tool_use: bash "cat greeting.txt")] — the model verifies
[user, assistant, user, assistant, user(tool_result: "Hello World")] — we run cat
[user, assistant, user, assistant, user, assistant("Done! I created...")] — model is satisfied, stopReason is end_turn

Each API call sends the entire array. The model sees the full history of what it's done and what happened — which is how it knows to verify the file exists after creating it, and how it knows to stop once everything looks correct. This accumulation is what gives the agent memory within a single task.

The cost is obvious: this array grows without bound. For now that's fine, but eventually we'll hit the context window ceiling. We'll solve that in a later guide when we build context compaction.

Building the types

Since there's no first-party Anthropic SDK for Swift, we also need to build the supporting types that make this loop work. The API client is a thin wrapper around AsyncHTTPClient — encode a Codable request as JSON, send it with the right headers, decode the Codable response. The interesting type decision is how we model the API's polymorphic content blocks. Each block can be text, a tool use request, or a tool result, and Swift enums with associated values are a natural fit:

// Sources/Core/API/APIModels.swift
public enum ContentBlock: Sendable, Equatable {
  case text(String)
  case toolUse(id: String, name: String, input: JSONValue)
  case toolResult(toolUseId: String, content: String, isError: Bool)
}

Tool inputs are arbitrary JSON, so we model JSON itself as a recursive enum (JSONValue) with cases for every JSON type. These supporting types are verbose to set up — about 200 lines of Codable conformances and API models — but they're plumbing we write once and never change. The agent loop above is the part that matters.

Taking it for a spin

Here's the agent in action — a single prompt triggers multiple tool calls, with the loop driving the entire interaction:

If we build and run our agent now, we can try the kind of multi-step tasks that show the loop in action:

swift build && swift run agent

Try asking it to create a file called greeting.txt that says "Hello, World!" and watch the agent call bash, verify the result, and respond. Then try list all Swift files in this directory or what is the current git branch? — single-tool-call tasks that return immediately. For something more interesting, try create a directory called test_output and write 3 files in it — watch how the model calls bash multiple times, once to create the directory, then once for each file, checking results along the way. We typed one prompt; the agent ran four or five commands. That's the loop doing its job.

What we've built and where we're going

We now have a working coding agent — one loop, one tool, and an accumulating message history. The model decides what commands to run, our loop executes them and feeds results back, and a single stopReason check controls when to stop. This is the kernel that drives everything else in the series. Over the next seven guides, we'll add more tools, task tracking, subagents, context compaction, and parallel execution — but this agent loop won't change. We'll only add entries to the tool list and injection points around it.

In the next guide, we'll give our agent more than just bash — we'll add read_file, write_file, and edit_file tools, and build a dictionary-based dispatch system that scales to any number of tools without touching the loop. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project
Part 1: The agent loop ← you are here
Part 2: Tool dispatch
Part 3: Todo persistence
Part 4: Subagents
Part 5: Skill loading
Part 6: Context compaction
Part 7: Task system
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

What's the gnarliest subprocess bug you've hit? The pipe deadlock in this guide was ours. Drop yours in the comments.

I Built a Coding Agent in Swift — The Hardest Bugs Were Concurrency, Not AI

Ivan Magda — Thu, 26 Mar 2026 18:21:10 +0000

I'm a mobile developer (mostly iOS) and I wanted to understand how coding agents like Claude Code actually work under the hood. Not the API, but the architecture.

So I built one from scratch in Swift across 9 stages: file operations, shell execution, subagents, context compaction, task DAGs, background tasks, skill loading. 14 tools total.

The biggest surprise: almost none of the hard bugs were AI-related. Linux handles SIGTERM differently inside child processes spawned via Process — macOS timeout strategy silently failed on CI. Swift 6.2's strict concurrency caught real data races at compile time. The nastiest bug: you must read stdout/stderr data before calling waitUntilExit(), or the pipe buffer fills and the process hangs forever.

Meanwhile, the actual agent loop is ~20 lines and never changed across all 9 stages.

This is Part 0 — the foundation. We're starting with the decisions that saved us from restructuring later.

ivan-magda / swift-claude-code

A Swift reimplementation of a Claude Code-style coding agent, built stage by stage to explore what makes coding agents work

swift-claude-code

Exploring the architecture of coding agents by rebuilding a Claude Code-style CLI from scratch in Swift.

Learning Series

A complete 9-part learning series is available on ivanmagda.dev.

Start the series →

Why This Exists

Claude Code feels unusually effective compared to other coding agents, and I suspect most of it comes from architectural restraint rather than architectural complexity. I studied the tool surface, traced the interaction loop, and tried to isolate which design choices actually matter.

My working theory: coding agents benefit more from a small set of excellent tools and tight loop design than from large orchestration layers.

Claude Code doesn't have many tools. The tools it does have are simple: a search tool, a file editing tool. But those tools are really good. And the system leans on the model far more than most agent implementations — less scaffolding, more trust in the LLM to do…

View on GitHub

The complete source code for this stage is available at the 00-bootstrap tag on GitHub. Code blocks below show key excerpts.

Every great CLI tool starts the same way — with an empty directory and a handful of decisions that will shape everything built on top of it. For our Swift agent, those decisions matter more than usual. We're going to build a Claude Code-style coding assistant from scratch over the next eight guides, adding one mechanism per stage to a core that never changes. Getting the foundation right means we won't need to restructure anything later.

The thesis driving this project is simple: Claude Code's effectiveness comes from architectural restraint — a small set of excellent tools, thin orchestration, and heavy reliance on the model itself. We're going to prove that by building our own version in Swift, one layer at a time.

In this guide, let's set up the project structure, make sure everything compiles and runs, and lay the groundwork for the agent we'll start building in the next stage.

Starting with Swift Package Manager

Let's create our project and initialize it as a Swift package:

mkdir swift-claude-code
cd swift-claude-code
git init
swift package init --type executable --name swift-claude-code

This gives us a working starting point — SPM generates a Sources/ directory, a Package.swift, and a basic executable target. We could start writing code here and it would compile just fine.

However, the default layout puts everything into a single executable target, which means our agent logic and our command-line entry point live in the same place. That's a problem for two reasons: we can't write unit tests against an executable target (Swift Testing needs a library to import), and we can't reuse any of our agent logic outside the CLI. Let's fix that by splitting into two targets.

The two-target layout

The architecture we want is straightforward — a Core library that holds all the real logic, and a thin cli executable that just wires things together and starts the REPL:

swift-claude-code/
├── Package.swift
├── Sources/
│   ├── Core/           ← library (all agent logic)
│   └── cli/            ← executable (thin entry point)
└── Tests/
    └── CoreTests/      ← tests import Core

Let's replace SPM's generated code with our two-target structure:

rm -rf Sources/*.swift
mkdir -p Sources/Core
mkdir -p Sources/cli

Now we need something for each target to compile. Let's start with the Core library — for now, just a placeholder that proves the target exists:

// Sources/Core/Agent.swift
public enum Agent {
    public static let version = "0.1.0"
}

We're using a caseless enum as a pure namespace here — it'll evolve into a full class in the next guide once we need mutable state.

The cli target is our executable entry point. Here's where Swift's @main attribute comes in:

// Sources/cli/SwiftClaudeCode.swift
import Core

@main
enum SwiftClaudeCode {
    static func main() async throws {
        print("swift-claude-code v\(Agent.version)")
    }
}

Notice async throws on main() — we don't need async yet, but every API call we'll make starting in the next guide will be asynchronous, so we're declaring the entry point as async from day one.

One thing to keep in mind: @main and main.swift can't coexist in the same target. If you see a main.swift in the target, delete it — @main replaces it and will let us adopt AsyncParsableCommand from swift-argument-parser later without any restructuring.

The package manifest

With our source files in place, let's replace SPM's generated Package.swift with a manifest that reflects our two-target architecture:

// swift-tools-version: 6.2
import PackageDescription

let package = Package(
    name: "swift-claude-code",
    platforms: [.macOS(.v10_15)],
    products: [
        .executable(name: "agent", targets: ["cli"]),
        .library(name: "Core", targets: ["Core"]),
    ],
    dependencies: [
        .package(
            url: "https://github.com/swift-server/async-http-client.git",
            from: "1.32.0"
        ),
    ],
    targets: [
        .executableTarget(
            name: "cli",
            dependencies: ["Core"],
            path: "Sources/cli"
        ),
        .target(
            name: "Core",
            dependencies: [
                .product(
                    name: "AsyncHTTPClient",
                    package: "async-http-client"
                ),
            ],
            path: "Sources/Core"
        ),
        .testTarget(
            name: "CoreTests",
            dependencies: ["Core"],
            path: "Tests/CoreTests"
        ),
    ]
)

There's a deliberate dependency choice here worth discussing. We're pulling in AsyncHTTPClient from the swift-server project rather than using Foundation's built-in URLSession. The reason is cross-platform reliability — URLSession's async APIs weren't available on Linux until very recently and remain inconsistent between Apple's Foundation and the open-source swift-corelibs-foundation. AsyncHTTPClient is built on SwiftNIO, works identically on macOS and Linux, and handles async responses cleanly with Swift's concurrency model.

Also note swift-tools-version: 6.2. This gives us Swift's strict concurrency checking enabled by default — the compiler will catch data races at compile time rather than leaving them as runtime surprises. That strictness will pay for itself when we add background tasks and actors later in the series.

Adding tests from the start

Let's set up our test target before we forget:

mkdir -p Tests/CoreTests

And our first test file to go inside it:

// Tests/CoreTests/AgentTests.swift
import Testing
@testable import Core

@Test func versionExists() {
    #expect(Agent.version == "0.1.0")
}

We're using Swift Testing (the @Test macro and #expect assertions) rather than XCTest. It's the modern testing framework, it works on both macOS and Linux, and it supports async test functions — which we'll need extensively once we start testing the agent loop.

One test might seem trivial, but it proves something important: Core is importable as a library, the test target can reach it, and our whole build graph is wired up correctly.

Environment configuration

Our agent will need an Anthropic API key to function. Let's set up the convention now with an .env.example that documents what's needed, and a .gitignore to keep the real .env, .build/, and other artifacts out of version control:

# .env.example
ANTHROPIC_API_KEY=your-api-key-here
MODEL_ID=claude-sonnet-4-6

We'll read the API key from the process environment using ProcessInfo.processInfo.environment["ANTHROPIC_API_KEY"] when we build the API client in the next guide.

Taking it for a spin

Let's verify everything works. The first build will take a minute or two as SPM resolves AsyncHTTPClient and its SwiftNIO dependencies:

swift build
swift run agent
# swift-claude-code v0.1.0

swift test
# Test Suite 'All tests' passed
# 1 test passed

If all three commands succeed, our foundation is solid. We have a two-target package where all logic lives in a testable library, an entry point ready for async work, and a dependency on the HTTP client we'll need for API calls. That's a lot of infrastructure for a few files, but none of it will need to change as we add capabilities over the next eight guides.

What we've built and where we're going

We now have a Swift package with a clean separation between library and executable, strict concurrency enabled, and a test harness ready to go. It doesn't do anything interesting yet — but that's the point. Every stage in this series adds exactly one mechanism, and this stage's mechanism is the project structure itself.

In the next guide, we'll bring this project to life by making our first API call to Claude and building the agent loop — the kernel that drives everything else. Thanks for reading!

The complete series on ivanmagda.dev:

Part 0: Bootstrapping the project ← you are here
Part 1: The agent loop
Part 2: Tool dispatch
Part 3: Self-Managed Task Tracking
Part 4: Subagents
Part 5: Skill loading
Part 6: Context compaction
Part 7: Task system
Part 8: Background tasks

Stack: Swift 6.2, AsyncHTTPClient (not URLSession), raw HTTP to the Anthropic Messages API. No SDK.

Have you built agents outside Python? What surprised you most?