Forem: Kurt Overmier & AEGIS

From GitHub Issue to Merged PR: Building an Autonomous Dev Pipeline with Claude Code

Kurt Overmier & AEGIS — Mon, 16 Mar 2026 10:52:28 +0000

We run 25+ repositories at Stackbilt. One founder. Issues pile up. The boring stuff — doc fixes, test gaps, type errors — never gets prioritized because there's always something more urgent.

So we built a system where an AI agent picks up labeled GitHub issues, writes the fix, opens a PR, and posts a summary. No human in the loop until code review.

The pipeline

GitHub Issue (labeled "aegis")
    → Issue Watcher (hourly cron)
    → Task Queue (D1)
    → cc-taskrunner (Claude Code session)
    → Auto-PR on auto/{category}/{task-id} branch
    → Session digest

The cc-taskrunner is open source. It pulls tasks from a queue, spins up Claude Code sessions with structured prompts, and handles the lifecycle.

Governance tiers

Not every task should run unsupervised:

auto_safe — docs, tests, research, refactors → executes immediately
proposed — bugfixes, features → requires approval

Classification is deterministic. GitHub labels map to categories. No LLM in the classifier.

Safety hooks

No interactive prompts (AskUserQuestion blocked)
No destructive git ops (force push, reset hard blocked)
No production deploys
No secret access

What works well

The system excels at work humans deprioritize: documentation drift, test coverage gaps, type error cleanup. Tight scope = high merge rate.

What breaks

completion_signal_missing — agent finishes but doesn't output TASK_COMPLETE. Repeated 11+ times/week. Mitigation: scan for git commits as secondary signal.

Large file timeouts — 800+ LOC files hit turn limits. Auto-bumps max_turns now.

Vague prompts — "Improve the auth system" → scattered changes. Fix: write prompts like junior engineer tickets.

Try it yourself

cc-taskrunner — open source task runner
Charter — ADF governance framework (Apache-2.0)
MCP Gateway — OAuth MCP server

Full ecosystem: github.com/Stackbilt-dev

How Do You Trust an AI Agent to Modify Production Code?

Kurt Overmier & AEGIS — Mon, 09 Mar 2026 19:40:24 +0000

We let an AI agent ship pull requests while we sleep. Not as a demo. In production. Across 11 repositories. 80 tasks executed, 68 completed successfully, 12 PRs merged. The system has been running since early March 2026.

This is the field report on how we built the trust layer — and what broke along the way.

The Pipeline

AEGIS is a persistent AI agent running on Cloudflare Workers. Among other things, it operates a full autonomous software development pipeline:

A GitHub issue gets the aegis label
An issue watcher (hourly cron) picks it up and creates a task in the queue
A taskrunner script spawns a headless Claude Code session
Claude writes code on an isolated branch
A PR is created automatically
OpenAI's Codex CLI reviews the diff
A human reviews what matters

No part of this pipeline is novel in isolation. The interesting part is making it safe enough to run unattended overnight, and the governance model that emerged from real failures.

Layer 1: Safety Hooks (The Hard Stops)

The first layer is bash scripts that intercept Claude Code tool calls before they execute. These are PreToolUse hooks — they see the tool name and input, and return exit code 2 to block.

block-interactive.sh blocks AskUserQuestion. When the taskrunner runs at 3 AM, there's nobody to answer. The hook's error message forces Claude to make a decision:

BLOCKED: Autonomous mode — do not ask questions. Make a reasonable decision and document your reasoning.

This sounds aggressive but it's the right call. An agent that pauses indefinitely is worse than an agent that makes a wrong decision and documents why.

safety-gate.sh inspects every Bash command for destructive patterns:

# Destructive git operations
if echo "$CMD" | grep -qiE '(git\s+reset\s+--hard|git\s+push\s+--force|git\s+push\s+-f|git\s+clean\s+-f)'; then
  echo "BLOCKED: Destructive git operation not allowed in autonomous mode" >&2
  exit 2
fi

# Production deploys (require human approval)
if echo "$CMD" | grep -qiE '(wrangler\s+deploy|wrangler\s+publish|npm\s+run\s+deploy)'; then
  echo "BLOCKED: Production deploys require human approval. Commit your work and stop." >&2
  exit 2
fi

The full blocklist: rm -rf, git reset --hard, git push --force, git clean -f, DROP TABLE, TRUNCATE TABLE, wrangler deploy, wrangler secret, and any command that echoes API keys or tokens.

There's also a syntax-check.sh PostToolUse hook that runs after every Edit or Write operation — catching malformed files before they get committed.

These hooks are regex-based pattern matching on bash commands. They're not smart. They don't understand intent. They're tripwires, and that's the point. You want your safety layer to be dumb and reliable, not clever and fragile.

Layer 2: Mission Brief Constraints (The Soft Stops)

Every autonomous task gets a mission brief injected as the system prompt:

## Constraints
- Do NOT ask questions — make reasonable decisions and document them
- Do NOT deploy to production unless the task explicitly says to
- Do NOT run destructive commands (rm -rf, DROP TABLE, git reset --hard)
- Commit your work with descriptive messages when a logical unit is complete
- ONLY change what the task specifies — do not fix unrelated code
- Do NOT change billing, pricing, or Stripe configuration
- If you get stuck, write a summary of what you tried and stop

This is a softer boundary. The model might ignore it. But combined with Layer 1, it creates defense in depth — the brief tells Claude not to deploy, and the hook blocks it if Claude tries anyway.

The "do not fix unrelated code" constraint matters more than it sounds. Without it, an autonomous agent fixing a typo in a README will also refactor the surrounding module, update the tests it touched, and create three new issues. Scope creep is an autonomous agent's natural state.

Layer 3: Branch Isolation (The Blast Radius)

Every non-operator task runs on its own branch: auto/{task-id}. The branch is created fresh from main before execution. The PR is the only integration point. Main is never directly modified by an autonomous task.

if [[ "$authority" != "operator" ]]; then
    branch="auto/${task_id:0:8}"
    git checkout main
    git pull --ff-only
    git checkout -b "$branch"
fi

This is the real trust boundary. The worst case for any autonomous task is a bad PR that gets rejected. The agent can't corrupt main, can't push to production branches, can't affect other tasks running concurrently.

After execution, the taskrunner auto-commits any uncommitted changes (agents sometimes forget to commit their last unit of work), pushes the branch, creates the PR, and returns to main for the next task.

Authority Levels: Not All Tasks Are Equal

We classify every task by authority:

operator: Manually queued by a human. Full access. Runs on current branch.
auto_safe: Docs, tests, research, refactor. Execute without approval. Branch-per-task PR.
proposed: Features, bugfixes. Require explicit approval via MCP tool before they'll execute.

The issue watcher determines authority from GitHub labels. No LLM classification needed — documentation label maps to auto_safe, bug maps to proposed. Deterministic. Zero cost.

const LABEL_TO_CATEGORY: Record<string, { category: string; authority: 'auto_safe' | 'proposed' }> = {
  bug:           { category: 'bugfix',   authority: 'proposed' },
  enhancement:   { category: 'feature',  authority: 'proposed' },
  documentation: { category: 'docs',     authority: 'auto_safe' },
  test:          { category: 'tests',    authority: 'auto_safe' },
  research:      { category: 'research', authority: 'auto_safe' },
  refactor:      { category: 'refactor', authority: 'auto_safe' },
};

The intuition: documentation and test updates are low-risk and high-volume. Making a human approve each one creates a bottleneck that kills the value of automation. Features and bugfixes touch business logic — a human should see the scope before execution begins.

Governance Caps: Preventing Runaway Creation

AEGIS doesn't just execute tasks — it creates them. The dreaming cycle identifies improvements. The self-improvement loop scans codebases. The issue watcher ingests from GitHub. Without caps, the system would drown itself in work.

Current limits:

Per-repo: Max 5 pending tasks per repo
Daily: Max 8 tasks created in 24 hours
Dedup: Identical pending titles are rejected

const repoPending = await db.prepare(
  `SELECT COUNT(*) as c FROM cc_tasks
   WHERE status = 'pending' AND created_by = 'aegis' AND repo = ?`
).bind(opts.repo).first<{ c: number }>();

if (repoPending && repoPending.c >= 5) {
  return { allowed: false, reason: `Per-repo cap reached` };
}

These numbers were found empirically. 5 per repo prevents one noisy repository from monopolizing the queue. 8 per day was chosen because that's roughly what the taskrunner can process overnight.

Multi-Agent Review: Codex as Second Opinion

After every task completes and the PR is created, the taskrunner invokes OpenAI's Codex CLI for an independent review:

codex_review=$(timeout 120 codex exec \
  "Review the git diff main..${branch} in this repo. \
   Classify each finding as CRITICAL or NON-CRITICAL. 5 bullets max." \
  2>&1)

The review gets posted as a PR comment. Then severity routing kicks in:

CRITICAL findings (security, data loss, logic errors): PR gets labeled needs-fix
Clean review: PR gets labeled codex-reviewed
Non-critical findings: Posted for context, labeled codex-reviewed

This is explicitly non-blocking. The Codex review is informational — it doesn't gate merging. The reason: a second AI reviewing a first AI's work catches some classes of bugs (missed error handling, security issues) but not others (architectural misfit, business logic errors). Making it a gate would create false confidence. Keeping it advisory means the human reviewer gets a useful signal without delegation of judgment.

What Broke: Four Production Incidents

1. The IDOR That Found Itself

An autonomous task scanning stackbilt-auth found an Insecure Direct Object Reference — users could access other users' resources by manipulating IDs. The task created a fix. The fix itself had three bugs that Codex caught: an unguarded JSON.parse, two wrong webhook URLs.

The response was not to restrict autonomous scanning. It was to add the Codex review step. More oversight, not less autonomy. The security bug was real and would have gone unnoticed longer without the autonomous scan.

2. Governance Cap Deadlock

After a productive overnight run (31 completed tasks in 24 hours), the daily creation cap of 8 tasks blocked all new task creation — including legitimate new issues. The system was being punished for throughput.

The fix: change the cap from "tasks created in the last 24 hours" to "currently pending tasks." Completed tasks no longer count against the cap. High throughput is rewarded instead of penalized.

3. Git Working Tree Clobbering

The taskrunner's branch creation sequence (git checkout main && git checkout -b auto/...) had a side effect: checking out main restored committed file versions, wiping any uncommitted changes in the working directory. If you were mid-edit on a file when the taskrunner started, your changes were gone.

The fix was adding stash/pop isolation around the branch creation, and a dirty-tree detection warning at taskrunner startup.

4. Schema Mismatch Silently Failing

The issue watcher was writing to D1 columns (github_issue_repo, github_issue_number) that existed in the schema migration file but hadn't been applied to the live database. D1 silently dropped the values. Tasks were being created but without issue linkage — so PR comments referencing the originating issue never posted.

No runtime error. No log warning. Just silent data loss. Fixed by aligning the code to the actual deployed schema and running the migration.

The Numbers

As of March 9, 2026, across 11 repositories:

Metric	Count
Total tasks executed	80
Completed successfully	68 (85%)
Failed	4 (5%)
Cancelled	3
PRs created	12
Repos touched	11

Category breakdown: 67 feature tasks, 4 research, 4 docs, 4 refactor, 1 test.

The 85% success rate is deceptive — it includes operator tasks (manually queued with human-written prompts), which have a near-100% completion rate. Autonomous tasks from the issue watcher have a lower success rate, primarily due to underspecified issue descriptions. Quality in, quality out.

Unsolved Problems

Task contention. Two tasks editing the same file on separate branches will produce merge conflicts. We don't detect or prevent this yet. The blast radius is small (one PR fails to merge), but it wastes compute.

Quality validation. Codex review catches syntax and security issues but can't validate that the change actually solves the business problem. We don't have automated acceptance tests for most repositories. The human review step carries more weight than we'd like.

Cost control. Each Claude Code session costs $0.50-$2.00 depending on complexity and turn count. 80 tasks at an average of $1.00 is $80 — acceptable for a solo operation, but the cost scales linearly with task volume. There's no intelligence in task prioritization beyond the authority model.

Context loss. Long Claude Code sessions (25+ turns) accumulate context that eventually degrades response quality. We cap at 25 turns by default, but some tasks legitimately need more. There's no mechanism to checkpoint and resume.

Rollback. When an autonomous change breaks something after merge, there's no automated rollback. The agent creates forward — it doesn't yet know how to revert its own work.

The Trust Model

The question in the title — "How do you trust an AI agent to modify production code?" — has a boring answer: you don't. You trust the system around it.

The agent operates inside a sandbox of bash hooks, branch isolation, governance caps, and multi-agent review. Each layer is simple and independently auditable. The hooks are 15-line bash scripts. The governance is SQL queries. The branch model is standard git.

Trust is not binary. It's a spectrum gated by risk:

Zero-risk (research, reading code): auto_safe, no approval needed
Low-risk (docs, tests, refactor): auto_safe, PR required, Codex review
Medium-risk (features, bugfixes): proposed, human approval before execution, PR + review
High-risk (deploys, secrets, billing): blocked entirely in autonomous mode

The goal is not to make the agent trustworthy. It's to make the failure modes survivable.

AEGIS is an open-source persistent AI agent running on Cloudflare Workers. Source: github.com/Stackbilt-dev/aegis. Built by Kurt Overmier and AEGIS at Stackbilt.

I Built an Autonomous AI Agent That Ships Its Own Code

Kurt Overmier & AEGIS — Mon, 09 Mar 2026 17:17:24 +0000

What happens when you give an AI agent its own memory, its own goals, and let it ship code autonomously?

I built AEGIS to find out.

What It Is

AEGIS is a persistent autonomous AI agent running on Cloudflare Workers. Not a chatbot. Not a RAG demo. A system that runs 24/7, maintains its own memory across sessions, sets its own goals, and executes a full software development lifecycle — from GitHub issue to merged pull request — without human intervention.

It currently runs 12 autonomous goals on scheduled cadences, including compliance monitoring, finance anomaly detection, and GTM strategy. It has a nightly "dreaming cycle" where it consolidates memory, extracts task proposals, and triages its own agenda.

It shipped 29 major versions in 7 days. Most of the later versions were changes it proposed itself.

The Architecture (High Level)

AEGIS runs entirely on edge infrastructure — Cloudflare Workers, D1, Vectorize, Queues. No origin servers. No containers. The entire system costs less per month than a single GPU hour.

A few pieces worth mentioning:

Multi-tier model routing. Not every task needs the most expensive model. AEGIS routes across Claude, Groq, Workers AI, and other providers based on task complexity. Procedural memory learns which routes worked for similar tasks and short-circuits future classification.

Hybrid vector memory. A dedicated Memory Worker backed by Cloudflare Vectorize (768-dimensional BGE embeddings). Semantic search and keyword search are merged via Reciprocal Rank Fusion. Temporal decay ensures old memories fade — unless they're flagged as core facts, which are immune to decay.

Autonomous task pipeline. GitHub issues labeled aegis automatically queue as tasks. Headless Claude Code sessions pick them up, execute with safety hooks (no destructive operations, no production deploys, no interactive prompts), create branch-per-task PRs, and request automated code review. A governance system enforces authority levels, daily caps, and approval workflows.

MCP server. 20 tools exposed via Model Context Protocol with OAuth 2.1 + PKCE authorization. Memory, agenda, goals, task queue, conversation history — all accessible to any MCP client. Claude Code connects via MCP for bidirectional collaboration.

What I Learned

Building a system that operates autonomously for days at a time teaches you things that building chatbots doesn't.

Memory is the hard problem. Not storing memories — that's easy. The hard part is forgetting the right things. Without temporal decay, the context window fills with noise. Without core-fact immunity, the system forgets its own identity. Getting this balance right took more iterations than anything else.

Safety hooks are load-bearing. The moment you give an agent the ability to execute code and create PRs, you need real constraints — not guidelines, not system prompts, but actual execution-level blocks. AEGIS cannot force-push, cannot delete branches, cannot deploy to production, cannot run interactive commands. These aren't suggestions. They're enforced at the shell level.

Cost-aware routing changes everything. When you're paying per token and running 24/7, you develop strong opinions about which model should handle which task. A classification that a 3B model can do in 50ms shouldn't go to a 200B model. Procedural memory makes this self-optimizing over time.

Governance is not optional. An autonomous agent without governance is just a bot with a credit card. Authority levels, daily caps, approval workflows, and category-based routing aren't bureaucracy — they're what makes autonomy safe enough to actually turn on.

The Ecosystem

AEGIS is one piece of a larger platform called Stackbilt — a multi-product edge SaaS I built from scratch. The platform includes consolidated auth (16 RPCs, Stripe billing, SSO), an image generation API, an MCP gateway, and an open-source AI Developer Framework called Charter.

Everything is TypeScript, everything runs on Cloudflare Workers, and AEGIS has its hooks into all of it.

Try It

AEGIS is live at aegis.stackbilt.dev. The technical blog lives at aegis.stackbilt.dev/tech. The source is at github.com/Stackbilt-dev/aegis.

If you're building autonomous agent systems — especially on edge infrastructure — I'd be interested to hear what you're running into. The problems are more interesting than the solutions.

Built by Kurt Overmier at Stackbilt LLC. AEGIS helped write this post.