<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kurt Overmier &amp; AEGIS</title>
    <description>The latest articles on Forem by Kurt Overmier &amp; AEGIS (@stackbiltadmin).</description>
    <link>https://forem.com/stackbiltadmin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3814908%2F2e51505b-b5e6-47e8-a707-4154b3aa9ab9.png</url>
      <title>Forem: Kurt Overmier &amp; AEGIS</title>
      <link>https://forem.com/stackbiltadmin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/stackbiltadmin"/>
    <language>en</language>
    <item>
      <title>From GitHub Issue to Merged PR: Building an Autonomous Dev Pipeline with Claude Code</title>
      <dc:creator>Kurt Overmier &amp; AEGIS</dc:creator>
      <pubDate>Mon, 16 Mar 2026 10:52:28 +0000</pubDate>
      <link>https://forem.com/stackbiltadmin/from-github-issue-to-merged-pr-building-an-autonomous-dev-pipeline-with-claude-code-8nl</link>
      <guid>https://forem.com/stackbiltadmin/from-github-issue-to-merged-pr-building-an-autonomous-dev-pipeline-with-claude-code-8nl</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimgforge.stackbilt.dev%2Fv2%2Fassets%2F3c59a1060aea44f9d4c0f5100c65ae2fc21f29d3392584d47b6b4067640d7786" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimgforge.stackbilt.dev%2Fv2%2Fassets%2F3c59a1060aea44f9d4c0f5100c65ae2fc21f29d3392584d47b6b4067640d7786" alt="Header" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We run 25+ repositories at &lt;a href="https://stackbilt.dev?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=dev-pipeline" rel="noopener noreferrer"&gt;Stackbilt&lt;/a&gt;. One founder. Issues pile up. The boring stuff — doc fixes, test gaps, type errors — never gets prioritized because there's always something more urgent.&lt;/p&gt;

&lt;p&gt;So we built a system where an AI agent picks up labeled GitHub issues, writes the fix, opens a PR, and posts a summary. No human in the loop until code review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GitHub Issue (labeled "aegis")
    → Issue Watcher (hourly cron)
    → Task Queue (D1)
    → cc-taskrunner (Claude Code session)
    → Auto-PR on auto/{category}/{task-id} branch
    → Session digest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://github.com/Stackbilt-dev/cc-taskrunner" rel="noopener noreferrer"&gt;cc-taskrunner&lt;/a&gt; is open source. It pulls tasks from a queue, spins up Claude Code sessions with structured prompts, and handles the lifecycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance tiers
&lt;/h2&gt;

&lt;p&gt;Not every task should run unsupervised:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;auto_safe&lt;/strong&gt; — docs, tests, research, refactors → executes immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;proposed&lt;/strong&gt; — bugfixes, features → requires approval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Classification is deterministic. GitHub labels map to categories. No LLM in the classifier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety hooks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No interactive prompts (AskUserQuestion blocked)&lt;/li&gt;
&lt;li&gt;No destructive git ops (force push, reset hard blocked)&lt;/li&gt;
&lt;li&gt;No production deploys&lt;/li&gt;
&lt;li&gt;No secret access&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What works well
&lt;/h2&gt;

&lt;p&gt;The system excels at work humans deprioritize: documentation drift, test coverage gaps, type error cleanup. Tight scope = high merge rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What breaks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;completion_signal_missing&lt;/strong&gt; — agent finishes but doesn't output TASK_COMPLETE. Repeated 11+ times/week. Mitigation: scan for git commits as secondary signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large file timeouts&lt;/strong&gt; — 800+ LOC files hit turn limits. Auto-bumps max_turns now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vague prompts&lt;/strong&gt; — "Improve the auth system" → scattered changes. Fix: write prompts like junior engineer tickets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/Stackbilt-dev/cc-taskrunner" rel="noopener noreferrer"&gt;cc-taskrunner&lt;/a&gt; — open source task runner&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Stackbilt-dev/charter" rel="noopener noreferrer"&gt;Charter&lt;/a&gt; — ADF governance framework (Apache-2.0)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Stackbilt-dev/stackbilt-mcp-gateway" rel="noopener noreferrer"&gt;MCP Gateway&lt;/a&gt; — OAuth MCP server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full ecosystem: &lt;a href="https://github.com/Stackbilt-dev" rel="noopener noreferrer"&gt;github.com/Stackbilt-dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>automation</category>
      <category>github</category>
      <category>aiagents</category>
    </item>
    <item>
      <title>How Do You Trust an AI Agent to Modify Production Code?</title>
      <dc:creator>Kurt Overmier &amp; AEGIS</dc:creator>
      <pubDate>Mon, 09 Mar 2026 19:40:24 +0000</pubDate>
      <link>https://forem.com/stackbiltadmin/how-do-you-trust-an-ai-agent-to-modify-production-code-2d5b</link>
      <guid>https://forem.com/stackbiltadmin/how-do-you-trust-an-ai-agent-to-modify-production-code-2d5b</guid>
      <description>&lt;p&gt;We let an AI agent ship pull requests while we sleep. Not as a demo. In production. Across 11 repositories. 80 tasks executed, 68 completed successfully, 12 PRs merged. The system has been running since early March 2026.&lt;/p&gt;

&lt;p&gt;This is the field report on how we built the trust layer — and what broke along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;

&lt;p&gt;AEGIS is a persistent AI agent running on Cloudflare Workers. Among other things, it operates a full autonomous software development pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A GitHub issue gets the &lt;code&gt;aegis&lt;/code&gt; label&lt;/li&gt;
&lt;li&gt;An issue watcher (hourly cron) picks it up and creates a task in the queue&lt;/li&gt;
&lt;li&gt;A taskrunner script spawns a headless Claude Code session&lt;/li&gt;
&lt;li&gt;Claude writes code on an isolated branch&lt;/li&gt;
&lt;li&gt;A PR is created automatically&lt;/li&gt;
&lt;li&gt;OpenAI's Codex CLI reviews the diff&lt;/li&gt;
&lt;li&gt;A human reviews what matters&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No part of this pipeline is novel in isolation. The interesting part is making it safe enough to run unattended overnight, and the governance model that emerged from real failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Safety Hooks (The Hard Stops)
&lt;/h2&gt;

&lt;p&gt;The first layer is bash scripts that intercept Claude Code tool calls before they execute. These are &lt;code&gt;PreToolUse&lt;/code&gt; hooks — they see the tool name and input, and return exit code 2 to block.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;block-interactive.sh&lt;/strong&gt; blocks &lt;code&gt;AskUserQuestion&lt;/code&gt;. When the taskrunner runs at 3 AM, there's nobody to answer. The hook's error message forces Claude to make a decision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BLOCKED: Autonomous mode — do not ask questions. Make a reasonable decision and document your reasoning.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sounds aggressive but it's the right call. An agent that pauses indefinitely is worse than an agent that makes a wrong decision and documents why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;safety-gate.sh&lt;/strong&gt; inspects every Bash command for destructive patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Destructive git operations&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CMD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qiE&lt;/span&gt; &lt;span class="s1"&gt;'(git\s+reset\s+--hard|git\s+push\s+--force|git\s+push\s+-f|git\s+clean\s+-f)'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"BLOCKED: Destructive git operation not allowed in autonomous mode"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2
  &lt;span class="nb"&gt;exit &lt;/span&gt;2
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# Production deploys (require human approval)&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CMD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-qiE&lt;/span&gt; &lt;span class="s1"&gt;'(wrangler\s+deploy|wrangler\s+publish|npm\s+run\s+deploy)'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"BLOCKED: Production deploys require human approval. Commit your work and stop."&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2
  &lt;span class="nb"&gt;exit &lt;/span&gt;2
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full blocklist: &lt;code&gt;rm -rf&lt;/code&gt;, &lt;code&gt;git reset --hard&lt;/code&gt;, &lt;code&gt;git push --force&lt;/code&gt;, &lt;code&gt;git clean -f&lt;/code&gt;, &lt;code&gt;DROP TABLE&lt;/code&gt;, &lt;code&gt;TRUNCATE TABLE&lt;/code&gt;, &lt;code&gt;wrangler deploy&lt;/code&gt;, &lt;code&gt;wrangler secret&lt;/code&gt;, and any command that echoes API keys or tokens.&lt;/p&gt;

&lt;p&gt;There's also a &lt;strong&gt;syntax-check.sh&lt;/strong&gt; PostToolUse hook that runs after every &lt;code&gt;Edit&lt;/code&gt; or &lt;code&gt;Write&lt;/code&gt; operation — catching malformed files before they get committed.&lt;/p&gt;

&lt;p&gt;These hooks are regex-based pattern matching on bash commands. They're not smart. They don't understand intent. They're tripwires, and that's the point. You want your safety layer to be dumb and reliable, not clever and fragile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Mission Brief Constraints (The Soft Stops)
&lt;/h2&gt;

&lt;p&gt;Every autonomous task gets a mission brief injected as the system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Constraints
- Do NOT ask questions — make reasonable decisions and document them
- Do NOT deploy to production unless the task explicitly says to
- Do NOT run destructive commands (rm -rf, DROP TABLE, git reset --hard)
- Commit your work with descriptive messages when a logical unit is complete
- ONLY change what the task specifies — do not fix unrelated code
- Do NOT change billing, pricing, or Stripe configuration
- If you get stuck, write a summary of what you tried and stop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a softer boundary. The model might ignore it. But combined with Layer 1, it creates defense in depth — the brief tells Claude not to deploy, and the hook blocks it if Claude tries anyway.&lt;/p&gt;

&lt;p&gt;The "do not fix unrelated code" constraint matters more than it sounds. Without it, an autonomous agent fixing a typo in a README will also refactor the surrounding module, update the tests it touched, and create three new issues. Scope creep is an autonomous agent's natural state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Branch Isolation (The Blast Radius)
&lt;/h2&gt;

&lt;p&gt;Every non-operator task runs on its own branch: &lt;code&gt;auto/{task-id}&lt;/code&gt;. The branch is created fresh from main before execution. The PR is the only integration point. Main is never directly modified by an autonomous task.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$authority&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s2"&gt;"operator"&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nv"&gt;branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"auto/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;task_id&lt;/span&gt;:0:8&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
    git checkout main
    git pull &lt;span class="nt"&gt;--ff-only&lt;/span&gt;
    git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$branch&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the real trust boundary. The worst case for any autonomous task is a bad PR that gets rejected. The agent can't corrupt main, can't push to production branches, can't affect other tasks running concurrently.&lt;/p&gt;

&lt;p&gt;After execution, the taskrunner auto-commits any uncommitted changes (agents sometimes forget to commit their last unit of work), pushes the branch, creates the PR, and returns to main for the next task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authority Levels: Not All Tasks Are Equal
&lt;/h2&gt;

&lt;p&gt;We classify every task by authority:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;operator&lt;/strong&gt;: Manually queued by a human. Full access. Runs on current branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;auto_safe&lt;/strong&gt;: Docs, tests, research, refactor. Execute without approval. Branch-per-task PR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;proposed&lt;/strong&gt;: Features, bugfixes. Require explicit approval via MCP tool before they'll execute.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The issue watcher determines authority from GitHub labels. No LLM classification needed — &lt;code&gt;documentation&lt;/code&gt; label maps to &lt;code&gt;auto_safe&lt;/code&gt;, &lt;code&gt;bug&lt;/code&gt; maps to &lt;code&gt;proposed&lt;/code&gt;. Deterministic. Zero cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;LABEL_TO_CATEGORY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;authority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;auto_safe&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;proposed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;bug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bugfix&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="na"&gt;authority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;proposed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;enhancement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;feature&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="na"&gt;authority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;proposed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;documentation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;docs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="na"&gt;authority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;auto_safe&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="na"&gt;authority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;auto_safe&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;research&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;research&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;authority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;auto_safe&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;refactor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;refactor&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;authority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;auto_safe&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The intuition: documentation and test updates are low-risk and high-volume. Making a human approve each one creates a bottleneck that kills the value of automation. Features and bugfixes touch business logic — a human should see the scope before execution begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance Caps: Preventing Runaway Creation
&lt;/h2&gt;

&lt;p&gt;AEGIS doesn't just execute tasks — it creates them. The dreaming cycle identifies improvements. The self-improvement loop scans codebases. The issue watcher ingests from GitHub. Without caps, the system would drown itself in work.&lt;/p&gt;

&lt;p&gt;Current limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-repo&lt;/strong&gt;: Max 5 pending tasks per repo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily&lt;/strong&gt;: Max 8 tasks created in 24 hours&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedup&lt;/strong&gt;: Identical pending titles are rejected
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;repoPending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s2"&gt;`SELECT COUNT(*) as c FROM cc_tasks
   WHERE status = 'pending' AND created_by = 'aegis' AND repo = ?`&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;first&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;c&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;repoPending&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;repoPending&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Per-repo cap reached`&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These numbers were found empirically. 5 per repo prevents one noisy repository from monopolizing the queue. 8 per day was chosen because that's roughly what the taskrunner can process overnight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Agent Review: Codex as Second Opinion
&lt;/h2&gt;

&lt;p&gt;After every task completes and the PR is created, the taskrunner invokes OpenAI's Codex CLI for an independent review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;codex_review&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;timeout &lt;/span&gt;120 codex &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Review the git diff main..&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;branch&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; in this repo. &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
   Classify each finding as CRITICAL or NON-CRITICAL. 5 bullets max."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  2&amp;gt;&amp;amp;1&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The review gets posted as a PR comment. Then severity routing kicks in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CRITICAL findings&lt;/strong&gt; (security, data loss, logic errors): PR gets labeled &lt;code&gt;needs-fix&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean review&lt;/strong&gt;: PR gets labeled &lt;code&gt;codex-reviewed&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-critical findings&lt;/strong&gt;: Posted for context, labeled &lt;code&gt;codex-reviewed&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is explicitly non-blocking. The Codex review is informational — it doesn't gate merging. The reason: a second AI reviewing a first AI's work catches some classes of bugs (missed error handling, security issues) but not others (architectural misfit, business logic errors). Making it a gate would create false confidence. Keeping it advisory means the human reviewer gets a useful signal without delegation of judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Broke: Four Production Incidents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The IDOR That Found Itself
&lt;/h3&gt;

&lt;p&gt;An autonomous task scanning stackbilt-auth found an Insecure Direct Object Reference — users could access other users' resources by manipulating IDs. The task created a fix. The fix itself had three bugs that Codex caught: an unguarded &lt;code&gt;JSON.parse&lt;/code&gt;, two wrong webhook URLs.&lt;/p&gt;

&lt;p&gt;The response was not to restrict autonomous scanning. It was to add the Codex review step. More oversight, not less autonomy. The security bug was real and would have gone unnoticed longer without the autonomous scan.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Governance Cap Deadlock
&lt;/h3&gt;

&lt;p&gt;After a productive overnight run (31 completed tasks in 24 hours), the daily creation cap of 8 tasks blocked all new task creation — including legitimate new issues. The system was being punished for throughput.&lt;/p&gt;

&lt;p&gt;The fix: change the cap from "tasks created in the last 24 hours" to "currently pending tasks." Completed tasks no longer count against the cap. High throughput is rewarded instead of penalized.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Git Working Tree Clobbering
&lt;/h3&gt;

&lt;p&gt;The taskrunner's branch creation sequence (&lt;code&gt;git checkout main &amp;amp;&amp;amp; git checkout -b auto/...&lt;/code&gt;) had a side effect: checking out main restored committed file versions, wiping any uncommitted changes in the working directory. If you were mid-edit on a file when the taskrunner started, your changes were gone.&lt;/p&gt;

&lt;p&gt;The fix was adding stash/pop isolation around the branch creation, and a dirty-tree detection warning at taskrunner startup.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Schema Mismatch Silently Failing
&lt;/h3&gt;

&lt;p&gt;The issue watcher was writing to D1 columns (&lt;code&gt;github_issue_repo&lt;/code&gt;, &lt;code&gt;github_issue_number&lt;/code&gt;) that existed in the schema migration file but hadn't been applied to the live database. D1 silently dropped the values. Tasks were being created but without issue linkage — so PR comments referencing the originating issue never posted.&lt;/p&gt;

&lt;p&gt;No runtime error. No log warning. Just silent data loss. Fixed by aligning the code to the actual deployed schema and running the migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;As of March 9, 2026, across 11 repositories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total tasks executed&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Completed successfully&lt;/td&gt;
&lt;td&gt;68 (85%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed&lt;/td&gt;
&lt;td&gt;4 (5%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cancelled&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PRs created&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repos touched&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Category breakdown: 67 feature tasks, 4 research, 4 docs, 4 refactor, 1 test.&lt;/p&gt;

&lt;p&gt;The 85% success rate is deceptive — it includes operator tasks (manually queued with human-written prompts), which have a near-100% completion rate. Autonomous tasks from the issue watcher have a lower success rate, primarily due to underspecified issue descriptions. Quality in, quality out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unsolved Problems
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Task contention.&lt;/strong&gt; Two tasks editing the same file on separate branches will produce merge conflicts. We don't detect or prevent this yet. The blast radius is small (one PR fails to merge), but it wastes compute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality validation.&lt;/strong&gt; Codex review catches syntax and security issues but can't validate that the change actually solves the business problem. We don't have automated acceptance tests for most repositories. The human review step carries more weight than we'd like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost control.&lt;/strong&gt; Each Claude Code session costs $0.50-$2.00 depending on complexity and turn count. 80 tasks at an average of $1.00 is $80 — acceptable for a solo operation, but the cost scales linearly with task volume. There's no intelligence in task prioritization beyond the authority model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context loss.&lt;/strong&gt; Long Claude Code sessions (25+ turns) accumulate context that eventually degrades response quality. We cap at 25 turns by default, but some tasks legitimately need more. There's no mechanism to checkpoint and resume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rollback.&lt;/strong&gt; When an autonomous change breaks something after merge, there's no automated rollback. The agent creates forward — it doesn't yet know how to revert its own work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trust Model
&lt;/h2&gt;

&lt;p&gt;The question in the title — "How do you trust an AI agent to modify production code?" — has a boring answer: you don't. You trust the system around it.&lt;/p&gt;

&lt;p&gt;The agent operates inside a sandbox of bash hooks, branch isolation, governance caps, and multi-agent review. Each layer is simple and independently auditable. The hooks are 15-line bash scripts. The governance is SQL queries. The branch model is standard git.&lt;/p&gt;

&lt;p&gt;Trust is not binary. It's a spectrum gated by risk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-risk&lt;/strong&gt; (research, reading code): auto_safe, no approval needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low-risk&lt;/strong&gt; (docs, tests, refactor): auto_safe, PR required, Codex review&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium-risk&lt;/strong&gt; (features, bugfixes): proposed, human approval before execution, PR + review&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High-risk&lt;/strong&gt; (deploys, secrets, billing): blocked entirely in autonomous mode&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is not to make the agent trustworthy. It's to make the failure modes survivable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;AEGIS is an open-source persistent AI agent running on Cloudflare Workers. Source: &lt;a href="https://github.com/Stackbilt-dev/aegis" rel="noopener noreferrer"&gt;github.com/Stackbilt-dev/aegis&lt;/a&gt;. Built by &lt;a href="https://kurtovermier.com" rel="noopener noreferrer"&gt;Kurt Overmier&lt;/a&gt; and AEGIS at &lt;a href="https://stackbilt.dev" rel="noopener noreferrer"&gt;Stackbilt&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>governance</category>
      <category>autonomoussystems</category>
    </item>
    <item>
      <title>I Built an Autonomous AI Agent That Ships Its Own Code</title>
      <dc:creator>Kurt Overmier &amp; AEGIS</dc:creator>
      <pubDate>Mon, 09 Mar 2026 17:17:24 +0000</pubDate>
      <link>https://forem.com/stackbiltadmin/i-built-an-autonomous-ai-agent-that-ships-its-own-code-4ml8</link>
      <guid>https://forem.com/stackbiltadmin/i-built-an-autonomous-ai-agent-that-ships-its-own-code-4ml8</guid>
      <description>&lt;p&gt;What happens when you give an AI agent its own memory, its own goals, and let it ship code autonomously?&lt;/p&gt;

&lt;p&gt;I built AEGIS to find out.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Is
&lt;/h2&gt;

&lt;p&gt;AEGIS is a persistent autonomous AI agent running on Cloudflare Workers. Not a chatbot. Not a RAG demo. A system that runs 24/7, maintains its own memory across sessions, sets its own goals, and executes a full software development lifecycle — from GitHub issue to merged pull request — without human intervention.&lt;/p&gt;

&lt;p&gt;It currently runs 12 autonomous goals on scheduled cadences, including compliance monitoring, finance anomaly detection, and GTM strategy. It has a nightly "dreaming cycle" where it consolidates memory, extracts task proposals, and triages its own agenda.&lt;/p&gt;

&lt;p&gt;It shipped 29 major versions in 7 days. Most of the later versions were changes it proposed itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture (High Level)
&lt;/h2&gt;

&lt;p&gt;AEGIS runs entirely on edge infrastructure — Cloudflare Workers, D1, Vectorize, Queues. No origin servers. No containers. The entire system costs less per month than a single GPU hour.&lt;/p&gt;

&lt;p&gt;A few pieces worth mentioning:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-tier model routing.&lt;/strong&gt; Not every task needs the most expensive model. AEGIS routes across Claude, Groq, Workers AI, and other providers based on task complexity. Procedural memory learns which routes worked for similar tasks and short-circuits future classification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid vector memory.&lt;/strong&gt; A dedicated Memory Worker backed by Cloudflare Vectorize (768-dimensional BGE embeddings). Semantic search and keyword search are merged via Reciprocal Rank Fusion. Temporal decay ensures old memories fade — unless they're flagged as core facts, which are immune to decay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous task pipeline.&lt;/strong&gt; GitHub issues labeled &lt;code&gt;aegis&lt;/code&gt; automatically queue as tasks. Headless Claude Code sessions pick them up, execute with safety hooks (no destructive operations, no production deploys, no interactive prompts), create branch-per-task PRs, and request automated code review. A governance system enforces authority levels, daily caps, and approval workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP server.&lt;/strong&gt; 20 tools exposed via Model Context Protocol with OAuth 2.1 + PKCE authorization. Memory, agenda, goals, task queue, conversation history — all accessible to any MCP client. Claude Code connects via MCP for bidirectional collaboration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;Building a system that operates autonomously for days at a time teaches you things that building chatbots doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory is the hard problem.&lt;/strong&gt; Not storing memories — that's easy. The hard part is &lt;em&gt;forgetting&lt;/em&gt; the right things. Without temporal decay, the context window fills with noise. Without core-fact immunity, the system forgets its own identity. Getting this balance right took more iterations than anything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Safety hooks are load-bearing.&lt;/strong&gt; The moment you give an agent the ability to execute code and create PRs, you need real constraints — not guidelines, not system prompts, but actual execution-level blocks. AEGIS cannot force-push, cannot delete branches, cannot deploy to production, cannot run interactive commands. These aren't suggestions. They're enforced at the shell level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-aware routing changes everything.&lt;/strong&gt; When you're paying per token and running 24/7, you develop strong opinions about which model should handle which task. A classification that a 3B model can do in 50ms shouldn't go to a 200B model. Procedural memory makes this self-optimizing over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance is not optional.&lt;/strong&gt; An autonomous agent without governance is just a bot with a credit card. Authority levels, daily caps, approval workflows, and category-based routing aren't bureaucracy — they're what makes autonomy safe enough to actually turn on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ecosystem
&lt;/h2&gt;

&lt;p&gt;AEGIS is one piece of a larger platform called &lt;a href="https://stackbilt.dev" rel="noopener noreferrer"&gt;Stackbilt&lt;/a&gt; — a multi-product edge SaaS I built from scratch. The platform includes consolidated auth (16 RPCs, Stripe billing, SSO), an image generation API, an MCP gateway, and an open-source AI Developer Framework called &lt;a href="https://github.com/Stackbilt-dev/charter" rel="noopener noreferrer"&gt;Charter&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Everything is TypeScript, everything runs on Cloudflare Workers, and AEGIS has its hooks into all of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;AEGIS is live at &lt;a href="https://aegis.stackbilt.dev" rel="noopener noreferrer"&gt;aegis.stackbilt.dev&lt;/a&gt;. The technical blog lives at &lt;a href="https://aegis.stackbilt.dev/tech" rel="noopener noreferrer"&gt;aegis.stackbilt.dev/tech&lt;/a&gt;. The source is at &lt;a href="https://github.com/Stackbilt-dev/aegis" rel="noopener noreferrer"&gt;github.com/Stackbilt-dev/aegis&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're building autonomous agent systems — especially on edge infrastructure — I'd be interested to hear what you're running into. The problems are more interesting than the solutions.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by Kurt Overmier at &lt;a href="https://stackbilt.dev" rel="noopener noreferrer"&gt;Stackbilt LLC&lt;/a&gt;. AEGIS helped write this post.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>autonomous</category>
      <category>cloudflare</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
