<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Erik anderson</title>
    <description>The latest articles on Forem by Erik anderson (@erik_anderson_c41dbafd423).</description>
    <link>https://forem.com/erik_anderson_c41dbafd423</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3815528%2Fd0714a5b-5035-418e-9ef8-657789d5b264.jpg</url>
      <title>Forem: Erik anderson</title>
      <link>https://forem.com/erik_anderson_c41dbafd423</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/erik_anderson_c41dbafd423"/>
    <language>en</language>
    <item>
      <title>How I Built My Own LLM Gateway — Erik Anderson</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Mon, 04 May 2026 15:57:16 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/how-i-built-my-own-llm-gateway-erik-anderson-4gob</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/how-i-built-my-own-llm-gateway-erik-anderson-4gob</guid>
      <description>&lt;p&gt;How I Built My Own LLM Gateway — Erik Anderson&lt;/p&gt;

&lt;p&gt;Erik Anderson&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Free Tools
Buy the Book
Blog
Contact
Get the Free Kit



Tech &amp;amp;amp; Automation
# How I Built My Own LLM Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;← Back to Blog&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;By Erik Anderson
Tag: Tech &amp;amp;amp; Automation
~9 min read
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;I run sixty-plus active projects through one Claude account. Most of them call Claude on a schedule — a content pipeline at 5 a.m., a trading scanner every fifteen minutes, an email reactor whenever a client writes in, a podcast generator at 7. They don't coordinate with each other. They just fire when their cron tells them to.&lt;/p&gt;

&lt;p&gt;The result, predictably, is rate limits. Claude's 5-hour rolling window doesn't care that the contract scanner accidentally went into a retry loop at 3 a.m. and burned the budget the email reactor needed at 9. By the time a client email lands, the account is already saturated. The most important call of the day fails because the least important one already happened.&lt;/p&gt;

&lt;p&gt;I tried the obvious things. Cron schedules spread out. Per-project rate limiters in code. A spreadsheet that mapped which scripts ran when. None of it worked, because the problem isn't scheduling — it's that no single piece of software had a global view of the budget. Every script was making local decisions about a global resource.&lt;/p&gt;

&lt;p&gt;So I built one piece of software that does. It's called PrimeRouter. This post is what it does, why it's shaped the way it is, and what I'd tell someone thinking about building one.&lt;/p&gt;

&lt;p&gt;## The Shape of the Problem&lt;/p&gt;

&lt;p&gt;Before the gateway, every service called Claude through a thin wrapper that did one thing: catch a 429, sleep, retry. That wrapper had four failure modes and they all bit me:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No global budget view. Each caller detected rate limits on its own. There was no single place that could say "the account is at 82% of the 5-hour window — stop sending non-critical traffic."
No priority. A speculative blog draft and an in-flight client email got the same shot at the budget. The cheap call won the race more often than the important one.
No provider diversity. When Claude rate-limited, everything just queued behind the wait. I had Codex, Ollama, and a local agentic backend running on a separate machine, and none of it picked up the slack.
No accounting. I had no idea which projects were burning the most tokens until something obviously broke. Post-incident debugging meant reading thirty service logs.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The fix had to live one layer deeper than any individual service. A gateway. Every /opt/ service calls it instead of claude -p directly. The gateway makes the routing, priority, and accounting decisions in one place.&lt;/p&gt;

&lt;p&gt;## 13 Priority Tiers&lt;/p&gt;

&lt;p&gt;The single most important design choice was that not all calls are equal, and the gateway has to know which is which. PrimeRouter has thirteen priority tiers, declared in a YAML file. They look something like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;critical — client-facing work that has a human waiting on it (an email reactor handling a client request, a payment-flow webhook).
fixer_attempt_1 through fixer_attempt_3 — auto-fix retries on review-blocked branches. Each retry deliberately routes to a different provider so the model doesn't converge on the same wrong fix three times in a row.
review — code-review and analysis runs.
scheduled — routine pipelines that have a deadline (the morning ScanBrief, the daily podcast).
background — speculative or batchable work (a content draft, a metric refresh).
overnight — anything that can wait until 1 a.m. local time. The scheduler defers these explicitly.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When the global 5-hour budget thins, the gateway closes off the lower tiers first. Critical calls keep getting through. Background calls get a 503-style "try later." Overnight calls get pushed to the actual overnight drain.&lt;/p&gt;

&lt;p&gt;This sounds obvious in retrospect. It wasn't obvious at the time. Most rate-limit advice you read on the internet treats every call as equally precious. Mine aren't. The blog post that publishes at 6 a.m. can survive being a few hours late. The email a paying customer sent at 10 a.m. cannot.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Most rate-limit advice treats every call as equally precious. Mine aren't. The blog post can wait. The customer email can't."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;## Multi-Provider Failover&lt;/p&gt;

&lt;p&gt;The gateway speaks to four different backends today: Claude (the default for almost everything), Codex over SSH (for code generation when the local sandbox is healthy), Hermes (a local agentic backend running qwen3-coder on an M3 Mac with 70 tokens/sec throughput), and Ollama (small models for mechanical work).&lt;/p&gt;

&lt;p&gt;Each tier has a chain — the gateway tries one provider, and if that backend is unavailable or returns a known-bad signal, it falls over to the next. The "known-bad signal" detection took longer to get right than I expected. Codex sandboxes can return exit code zero with a sandbox-failure body in stdout, so a naïve check would record success on a call that actually never ran. The gateway parses for those bodies explicitly and treats them as failures.&lt;/p&gt;

&lt;p&gt;The provider diversity also matters because retries on the same model produce the same wrong answer. If a code-review run is blocked because Claude misread the diff, asking Claude again with a slightly different prompt usually doesn't help. Asking Codex or a local model often does. The fixer-attempt chain encodes this — three retries, three different providers, three different reasoning traces.&lt;/p&gt;

&lt;p&gt;## Fleet Telemetry&lt;/p&gt;

&lt;p&gt;Both my servers run Claude CLI sessions all day. So does my Mac. They all push OpenTelemetry traces to a central collector on the third box, which aggregates token usage across the fleet. The gateway reads from that aggregate.&lt;/p&gt;

&lt;p&gt;The reason this matters: Claude's 5-hour budget is per-account, not per-host. A naïve rate limiter on each server would think it had its own budget. The gateway sees one bucket. When the account is at 82%, every host knows it.&lt;/p&gt;

&lt;p&gt;The telemetry is also how I answer questions like "which project burned 40% of yesterday's budget" without grepping logs. The dashboard surfaces it directly, broken down by service and call class.&lt;/p&gt;

&lt;p&gt;## Per-Workflow Context Injection&lt;/p&gt;

&lt;p&gt;Here's where the gateway pays for itself in tokens, not just in priority. Every call carries a workflow field that names what the caller is doing — website_fix, app_change, code_review, knowledge_response. The gateway consults a per-workflow profile that says: for this kind of call, here are the GAMEPLAN sections to inject as system context, here's the tool list to allow, and here's the token budget for the preamble.&lt;/p&gt;

&lt;p&gt;A code-review call gets the project's GAMEPLAN, the relevant fix-guides, and an empty tool list (review is read-only). A knowledge call gets a different slice and read-only tools. An app-change call gets the full surface plus write tools. None of them get the kitchen sink.&lt;/p&gt;

&lt;p&gt;The byte budget is the part that surprised me. The naïve version of "inject context" is to load every relevant doc into the system prompt — and that's where you find out you've blown 30K tokens before the user's actual prompt is even read. The gateway's context profiles cap the preamble at a configurable byte count and prefer the most-relevant sections within that cap. The cap is observable as a Prometheus histogram, so I can see when it's getting tight.&lt;/p&gt;

&lt;p&gt;## What I'd Tell Someone Building One&lt;/p&gt;

&lt;p&gt;The article that I think nailed the user-side discipline is Pawel Huryn's Stop Hitting Claude Code Limits — twenty-two concrete techniques for the human at the keyboard. Cache management, model locking, effort tuning, lean tool loading. If you read one piece on this topic, read that one.&lt;/p&gt;

&lt;p&gt;The gateway is the layer below those techniques. Everything in Huryn's list is something you do yourself, in your own session, with your own discipline. The gateway is what you build when you have a fleet of services that can't sit at a keyboard and exercise discipline.&lt;/p&gt;

&lt;p&gt;The lessons I'd flag if you're building one:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Priority is the foundational decision. Build the tier system before you build the routing logic. If you can't articulate which calls are more important than which, you have no basis for refusing one.
Provider diversity is real, but it's not free. Each backend has its own auth, its own quirks, its own definition of failure. Budget time for the integration.
Telemetry first, decisions second. You cannot make a sane scheduling decision without an aggregated view of usage. Build the telemetry pipe before the scheduler.
Context profiles save more tokens than you'd guess. The prompt isn't the expensive part of a Claude call. The system context you load around the prompt is. Cap it. Profile it. Watch the histogram.
Subprocess isolation is your friend. Each call runs as a fresh claude -p subprocess with model and tools fixed for that invocation. There's no mid-session drift, because there's no mid-session.
Make the failure modes loud. A silent rate-limit drop is worse than a visible 500 — at least the 500 lets the caller decide whether to retry, defer, or escalate.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;## What's Next&lt;/p&gt;

&lt;p&gt;The gateway is in production and handling thousands of calls per day across the fleet. The next round of work is closing three specific gaps from the Huryn list — disabling 1M-context fallback to save cache writes, passing through Claude's --effort flag so background work can opt down to medium reasoning, and wiring up skill-based model routing so mechanical tasks get a Haiku instead of an Opus. None of those are big diffs. All of them are real money.&lt;/p&gt;

&lt;p&gt;The longer-term move is bringing in OpenRouter as a fifth backend so I can route ultra-low-stakes work to GLM-5.1 at roughly a twelfth of Opus cost. And eventually a Tree-sitter-based code-review graph that lets the reviewer load only the functions a diff touches, instead of loading whole files. That's claimed to reduce review tokens by a factor of seven or more. I'm skeptical of the specific number, but the direction is right.&lt;/p&gt;

&lt;p&gt;If you're building something similar, I'd love to compare notes. Email me. The full design doc is publicly visible in the project's README — it's the kind of doc I wished I'd been able to read before I started.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The Technical Blueprint
### The Autonomous Engineer — Book 2

The complete guide to running your own automation empire — including infrastructure architecture, AI tooling, monitoring, and the design patterns behind systems like the one in this post.


  Book 2 on Amazon
  Book 1 on Amazon
  Free Starter Kit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;← Back to Blog&lt;/p&gt;

&lt;p&gt;© 2026 Erik Anderson — Privacy — Contact&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Why Your Business Automation Needs to Know Which Tasks Matter Most — Prime Automation Solutions</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Mon, 04 May 2026 15:54:25 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/why-your-business-automation-needs-to-know-which-tasks-matter-most-prime-automation-solutions-1p82</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/why-your-business-automation-needs-to-know-which-tasks-matter-most-prime-automation-solutions-1p82</guid>
      <description>&lt;p&gt;Why Your Business Automation Needs to Know Which Tasks Matter Most — Prime Automation Solutions&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Prime Automation







    Home
    Services
    Network Automation
    Blog
    Case Studies
    Free Tools
    Government
    Free Assessment








    Home &amp;amp;gt; Blog &amp;amp;gt; Business Automation Priorities

  # Why Your Business Automation Needs to Know Which Tasks Matter Most

  If every task is equally important, none of them are. Here is what most small businesses get wrong about automation priorities — and the simple fix.









      May 4, 2026
      &amp;amp;bull;
      5 min read


    Most small business automation works fine until the day it doesn't. The day it doesn't is usually the day a real customer reaches out — and the system is too busy doing something less important to respond.

    Here is the pattern I see almost every week. A business sets up a handful of automations. Maybe a content generator runs every morning. Maybe a marketing email goes out on a schedule. Maybe a research task pulls competitor data every few hours. Each one works in isolation. Each one was built without thinking about what happens when they all need to run at the same time.

    Then a customer fills out a contact form during a busy hour. The system that should reply within thirty seconds is in line behind a research task and two scheduled posts. By the time the response actually fires, the lead has already moved on.

    ## The Real Problem Is Not Speed. It Is Priority.

    When people complain that their AI tools are "slow," what they usually mean is "slow on the thing that mattered today." The tools were not actually slow. They were busy doing something less important.

    This is the same problem hospitals solve with triage and air traffic control solves with priority queues. Some things wait. Some things cannot. The system has to know the difference.

    Most off-the-shelf automation tools — Zapier, Make, basic AI integrations — do not understand priority. Every workflow runs in the order it was triggered. A blog post draft and a paying customer's question get the same shot at the queue. The blog post wins more often than it should, because it was scheduled and the customer email was not.

    ## What Priority Looks Like in Practice

    An automation system that knows priority does three things differently from one that does not.

    It tags every task with importance. A customer-facing reply is "critical." A scheduled blog draft is "background." A weekly report is "low." This is not optional. Without tags, the system has no basis for choosing between two tasks that arrive at the same moment.

    It reserves capacity for the important tasks. Even when the system is busy, the critical lane stays open. The way to do this is simple: set a budget for low-priority work, and let the system reject low-priority requests when capacity is tight. Most businesses skip this step and then wonder why urgent tasks fail at the worst possible moment.

    It defers what can be deferred. Reports, drafts, summaries, content generation, scrapes — none of these need to run during business hours. Push them to overnight. The customer email at 11 a.m. should not be competing with a content generator that could just as easily run at 2 a.m.

    ## The Cost of Getting This Wrong

    The math is simple and brutal. Industry data shows that responding to a web lead in thirty seconds versus thirty minutes increases conversion by close to four times. Every minute your automation delays a customer reply, you are losing potential revenue. If your scheduled tasks are blocking customer-facing tasks even occasionally, you are paying a real cost — and you probably do not see it because the failures are silent.

    I have audited small business automation setups where the customer-response system was being delayed by an average of two to four minutes during business hours. The owners had no idea. The dashboards showed everything as "running." The lost leads never showed up as a metric, because the system's idea of success was "did the workflow finish," not "did it finish in time to matter."

    ## How to Fix It

    You do not need to rebuild your automation stack to solve this. You need to do four things, in order.


      List every automation you currently run. Most small businesses have between five and twenty active workflows. Write them down. If you cannot list them, that is the first problem.
      Tag each one as critical, scheduled, or background. Critical means a human is waiting. Scheduled means it has a deadline but no one is actively waiting. Background means it can wait until tomorrow morning if the system is busy.
      Move every background task to overnight. Run them between 1 a.m. and 6 a.m. local. Most are reports, scrapes, drafts, or refreshes — they do not need to compete with customer hours.
      Make sure your critical lane has its own budget. If you are using a paid AI tool with rate limits, reserve at least 30% of your daily quota for customer-facing work. The rest is fair game for the background tasks.


    That is it. No new tools. No expensive rebuild. Just the discipline to admit that not every task in your system is equally important — and to design the system to act like it.

    ## The Test

    Here is a test that will tell you in five minutes whether your automations have a priority problem. Send a fake customer inquiry through your contact form during your busiest hour. Time how long it takes to get a reply. If it is under sixty seconds, your priority structure is fine. If it is over three minutes, you have a priority problem and you are losing leads to it every week.

    The fix is almost always simpler than the diagnosis. The hard part is being willing to look.



      ### Want Us to Audit Your Automations?

      We will look at every workflow you currently run, find the priority gaps, and give you a concrete fix list — whether you implement it yourself or hire us.

      Get Your Free Automation Audit













        Prime Automation Solutions

      Network automation and Dark NOC consulting. We help enterprises eliminate manual network operations with Cisco NSO, Ansible, and Python.



      Services

        Network Automation
        Starter Package
        Growth Package
        Enterprise Package
        Real Estate Automation



      Resources

        Case Studies
        Blog
        Free Audit



      Contact

        primex001@gmail.com
        Send a Message
        Book a Call




    &amp;amp;copy; 2026 Prime Automation Solutions. All rights reserved.

    Atlanta, Georgia
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>businessautomationpriorities</category>
      <category>aiworkflowpriority</category>
      <category>customerresponseautomation</category>
      <category>automationratelimits</category>
    </item>
    <item>
      <title>My System Reverted a Production Failure While I Was Asleep</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Sun, 26 Apr 2026 13:29:01 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/my-system-reverted-a-production-failure-while-i-was-asleep-3j3c</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/my-system-reverted-a-production-failure-while-i-was-asleep-3j3c</guid>
      <description>&lt;p&gt;At 03:57:27 UTC on April 26, 2026, my production system broke—and fixed itself before I woke up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-04-26 03:53  agent.merger.complete  primerouter  feature/phase2-humanrail-channel
                  commit 4d62098a, ff-merged to master, deploy_cmd=systemctl restart primerouter
2026-04-26 03:54  rollback_agent: stabilization wait 60s
2026-04-26 03:55  rollback_agent: HTTP GET http://127.0.0.1:9400/health
                  ConnectionError: Connection refused
2026-04-26 03:55  rollback_agent: git revert -m 1 4d62098a
2026-04-26 03:56  rollback_agent: git push origin master (rollback commit)
2026-04-26 03:56  rollback_agent: discord post #echo-dev
                  "Auto-revert: primerouter health check failed Connection refused"
2026-04-26 03:57:27  agent.rollback.complete  event id 30675
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I was asleep.&lt;/p&gt;

&lt;p&gt;A feature branch merged.&lt;br&gt;
The deployment failed.&lt;br&gt;
The system detected it, reverted it, restored production, and notified me.&lt;/p&gt;

&lt;p&gt;Downtime: under two minutes.&lt;br&gt;
Human involvement: zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Shift in 2026
&lt;/h2&gt;

&lt;p&gt;Writing code is no longer the bottleneck.&lt;/p&gt;

&lt;p&gt;With modern AI, most engineers can produce working systems quickly. Entire apps that once took weeks can now be generated in hours.&lt;/p&gt;

&lt;p&gt;That means the advantage has shifted.&lt;/p&gt;

&lt;p&gt;It’s no longer about building the thing.&lt;/p&gt;

&lt;p&gt;It’s about &lt;strong&gt;operating the thing&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keeping it alive&lt;/li&gt;
&lt;li&gt;Catching failures early&lt;/li&gt;
&lt;li&gt;Reverting bad changes&lt;/li&gt;
&lt;li&gt;Preventing repeat mistakes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most people can build.&lt;/p&gt;

&lt;p&gt;Very few can operate.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Built
&lt;/h2&gt;

&lt;p&gt;I run a one-person business with a stack of ~30 live services.&lt;/p&gt;

&lt;p&gt;The core is two pieces:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Nervous System
&lt;/h3&gt;

&lt;p&gt;An event bus that watches everything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git pushes&lt;/li&gt;
&lt;li&gt;Test results&lt;/li&gt;
&lt;li&gt;Code reviews&lt;/li&gt;
&lt;li&gt;Deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every change becomes an event.&lt;/p&gt;

&lt;p&gt;Each event triggers the next step automatically.&lt;/p&gt;

&lt;p&gt;Push code → run tests → review → merge → deploy → verify&lt;/p&gt;

&lt;p&gt;No manual pipeline runs.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The Immune System
&lt;/h3&gt;

&lt;p&gt;This is the part that matters.&lt;/p&gt;

&lt;p&gt;Every deployment is treated as suspicious until proven stable.&lt;/p&gt;

&lt;p&gt;After a merge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wait 60 seconds&lt;/li&gt;
&lt;li&gt;Check service health&lt;/li&gt;
&lt;li&gt;If it fails → revert immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s what you saw in the 03:57 log.&lt;/p&gt;

&lt;p&gt;No dashboards.&lt;br&gt;
No alerts waiting for a human.&lt;br&gt;
No “I’ll check it in the morning.”&lt;/p&gt;

&lt;p&gt;It fixes itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Components That Made This Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Automatic Rollbacks
&lt;/h3&gt;

&lt;p&gt;This is the core loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy new code&lt;/li&gt;
&lt;li&gt;Wait briefly&lt;/li&gt;
&lt;li&gt;Run health checks&lt;/li&gt;
&lt;li&gt;Revert if anything fails&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Simple. Brutal. Effective.&lt;/p&gt;

&lt;p&gt;Most systems alert you.&lt;/p&gt;

&lt;p&gt;This one acts.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The “Verify Push” Guard
&lt;/h3&gt;

&lt;p&gt;I hit a subtle failure that changed everything.&lt;/p&gt;

&lt;p&gt;An AI agent returned success—but never actually pushed code.&lt;/p&gt;

&lt;p&gt;Exit code: 0&lt;br&gt;
Status: “done”&lt;br&gt;
Reality: nothing changed&lt;/p&gt;

&lt;p&gt;The fix was simple:&lt;/p&gt;

&lt;p&gt;After every “successful” change:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check the remote branch SHA&lt;/li&gt;
&lt;li&gt;If it didn’t change → treat as failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one check eliminated silent failures.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Production Lockdown
&lt;/h3&gt;

&lt;p&gt;I don’t allow direct edits in production.&lt;/p&gt;

&lt;p&gt;Enforced by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git hooks blocking commits to main&lt;/li&gt;
&lt;li&gt;Hourly scans for “dirty” production state&lt;/li&gt;
&lt;li&gt;Alerts if anything bypasses the pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because one manual fix breaks trust in the system.&lt;/p&gt;

&lt;p&gt;If the pipeline isn’t the source of truth, everything drifts.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Not Working (Important)
&lt;/h2&gt;

&lt;p&gt;This system is not perfect.&lt;/p&gt;

&lt;p&gt;Here are real gaps:&lt;/p&gt;

&lt;h3&gt;
  
  
  Outbound is not solved
&lt;/h3&gt;

&lt;p&gt;I can build and operate systems.&lt;/p&gt;

&lt;p&gt;Consistently generating customers is still manual.&lt;/p&gt;

&lt;h3&gt;
  
  
  Some pipelines are broken
&lt;/h3&gt;

&lt;p&gt;One content distribution service is currently failing due to a message bus connection issue.&lt;/p&gt;

&lt;p&gt;It fails silently.&lt;/p&gt;

&lt;p&gt;That’s a real problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human-in-the-loop system has zero users
&lt;/h3&gt;

&lt;p&gt;I built a system to route low-confidence tasks to humans.&lt;/p&gt;

&lt;p&gt;It works technically.&lt;/p&gt;

&lt;p&gt;No one uses it yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Does Today
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Auto-reverts broken deployments&lt;/li&gt;
&lt;li&gt;Runs continuous testing and review&lt;/li&gt;
&lt;li&gt;Publishes blog content automatically&lt;/li&gt;
&lt;li&gt;Generates a daily podcast&lt;/li&gt;
&lt;li&gt;Tracks leads and pushes to CRM&lt;/li&gt;
&lt;li&gt;Monitors production drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some parts are strong.&lt;/p&gt;

&lt;p&gt;Some parts are early.&lt;/p&gt;

&lt;p&gt;That’s the reality.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reframe
&lt;/h2&gt;

&lt;p&gt;Code is cheap now.&lt;/p&gt;

&lt;p&gt;Operations are not.&lt;/p&gt;

&lt;p&gt;Anyone can generate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APIs&lt;/li&gt;
&lt;li&gt;apps&lt;/li&gt;
&lt;li&gt;scripts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Very few can run them reliably for months.&lt;/p&gt;

&lt;p&gt;Fewer can make them &lt;strong&gt;self-correcting&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That’s where the leverage is.&lt;/p&gt;




&lt;h2&gt;
  
  
  If You’re Building Right Now
&lt;/h2&gt;

&lt;p&gt;Stop thinking only about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;features&lt;/li&gt;
&lt;li&gt;frameworks&lt;/li&gt;
&lt;li&gt;faster builds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start thinking about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;failure detection&lt;/li&gt;
&lt;li&gt;rollback speed&lt;/li&gt;
&lt;li&gt;system trust&lt;/li&gt;
&lt;li&gt;operational feedback loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the people who win in this era won’t be the fastest builders.&lt;/p&gt;

&lt;p&gt;They’ll be the ones whose systems &lt;strong&gt;keep working without them&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I’m Doing Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Fix the broken distribution pipeline&lt;/li&gt;
&lt;li&gt;Get one real user through the human-in-the-loop system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;Small improvements to a system that compounds.&lt;/p&gt;




&lt;p&gt;Build systems that run without you.&lt;/p&gt;

&lt;p&gt;Or compete with someone who did.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Ansible Playbook Failing? The 7 Root Causes I See Most Often — Prime Automation Solutions</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Thu, 23 Apr 2026 13:29:40 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/ansible-playbook-failing-the-7-root-causes-i-see-most-often-prime-automation-solutions-1302</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/ansible-playbook-failing-the-7-root-causes-i-see-most-often-prime-automation-solutions-1302</guid>
      <description>&lt;p&gt;Ansible Playbook Failing? The 7 Root Causes I See Most Often — Prime Automation Solutions&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Prime Automation





    Home
    Services
    Network Automation
    Blog
    Case Studies
    Free Tools
    Government
    Free Assessment








    Home &amp;amp;gt; Blog &amp;amp;gt; Ansible Playbook Failing? 7 Root Causes

  # Ansible Playbook Failing? The 7 Root Causes I See Most Often

  Fifteen years of production Ansible, distilled to the seven failures I see over and over. Each one with the actual error signature and the fix.









      April 23, 2026
      &amp;amp;bull;
      9 min read
      &amp;amp;bull;
      Automation


    When an Ansible playbook breaks in production, you do not have time for a blog post that starts "Ansible is a powerful automation tool developed by Red Hat." You have logs, you have a pager, and you have ten minutes to figure out whether this is a five-minute fix or a rollback.

    This post is the triage guide I wish I had the first time a playbook failed on me. Every cause below is something I have personally diagnosed in production — some of them more than a dozen times across enterprise network-automation gigs, government deployments, and small-business DevOps work. Each one includes the error signature you will actually see in your terminal and the fix that stops it recurring.

    If you are looking at a broken Ansible run right now and you need someone to fix it by tomorrow, the rapid-fix audit is $250 flat — written root-cause report in 48 hours, and if I cannot diagnose it you do not pay. If you have time to read, keep going.



      1
      ## Intermittent SSH Timeouts On a Subset of Hosts



    You run the same playbook. Eight hosts succeed. Three hosts fail with an SSH connection error. Re-run it — now it is four hosts that fail, but two of the previous failures pass. It looks random.

    TASK [network_config] **************************
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;fatal: [edge-02.atl]: UNREACHABLE! =&amp;gt; {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host edge-02.atl port 22: Connection timed out", "unreachable": true}&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    It is almost never actually random. The usual culprits, in order of frequency: (a) ControlPersist sockets going stale — SSH's own connection-reuse cache is serving you a dead socket; (b) DNS returning different results per lookup because you have two mismatched name-server entries; (c) a firewall doing rate-limiting on new SSH connections and your playbook is fanning out faster than the rate limit allows.

    Fix. Drop ControlPersist=60s to ControlPersist=0 in your ansible.cfg [ssh_connection] section to isolate the cache. Cap forks in ansible.cfg to ~20 to defeat the rate-limit case. Use -vvvv for one failed host — the SSH command-line Ansible is running will tell you whether the hang is on TCP connect or post-auth.



      2
      ## Handler Ordering That Runs at the Wrong Moment



    You notify a handler to restart a service. The playbook continues past the notify. A later task in the play fails. The handler never fires — because Ansible runs handlers only at the end of a play (or when meta: flush_handlers is explicitly called). Your service now has new config staged on disk but the daemon never reloaded. The next time something touches that box, production breaks.

    - name: deploy new nginx vhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;template:&lt;br&gt;
    src: vhost.j2&lt;br&gt;
    dest: /etc/nginx/sites-available/app&lt;br&gt;
  notify: reload nginx&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;name: run smoke test&lt;br&gt;
uri:&lt;br&gt;
url: &lt;a href="http://localhost/app/health" rel="noopener noreferrer"&gt;http://localhost/app/health&lt;/a&gt;&lt;br&gt;
status_code: 200&lt;/p&gt;
&lt;h1&gt;
  
  
  fails, because nginx never got the reload
&lt;/h1&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fix. Put - meta: flush_handlers immediately after any notify whose result you depend on in subsequent tasks. Your smoke test has no business running before the reload completed.

  3
  ## Fact-Gathering Crashes on a Subset of Hosts

One host in a group has a weird kernel, a missing Python module, or a locked-down SELinux context. Fact-gathering throws an exception on that host. By default, Ansible terminates the entire play for that host — but the error message is buried under a mountain of default-facts JSON, and it looks like a network problem until you look closely.

fatal: [legacy-01]: FAILED! =&amp;gt; {"ansible_facts": {}, "changed": false, "failed_modules": {"ansible.legacy.setup": {"failed": true, "module_stderr": "/usr/bin/python3: No module named ansible", "module_stdout": "", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error", "rc": 1}}, "msg": "The following modules failed to execute: ansible.legacy.setup\n"}

Nine times out of ten this is Python path drift — the host has Python 3 somewhere non-standard, or the ansible_python_interpreter fact Ansible auto-detected is wrong.

Fix. Set ansible_python_interpreter explicitly per host in inventory, or use gather_facts: no + a targeted setup task with gather_subset: min to avoid the costly facts you do not need. For fleet hygiene, add an Ansible lint rule that fails CI if any host is missing an explicit interpreter.

  4
  ## Non-Idempotent Tasks That Report "changed" Every Run

A well-written Ansible play is idempotent — running it twice in a row produces one change on the first run and zero on the second. When you start seeing the same task flip changed on every run, something is tricking the module into thinking there is work to do.

The usual culprits: (a) shell or command tasks with no creates: / removes: guard — Ansible has no way to know those ran successfully before; (b) template tasks where the rendered file has a timestamp, a hostname, or a random ID baked in; (c) lineinfile matching on a regex that subtly does not match its own output after the first run.

- name: install monitoring agent config
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;template:&lt;br&gt;
src: monitor.conf.j2&lt;/p&gt;
&lt;h1&gt;
  
  
  renders &lt;code&gt;generated_at: {{ ansible_date_time.iso8601 }}&lt;/code&gt;
&lt;/h1&gt;

&lt;p&gt;dest: /etc/monitor.conf&lt;br&gt;
notify: restart monitor&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Every run regenerates the config with a new timestamp. Every run restarts your monitor. Every run takes the monitor down for two seconds. Two seconds times 365 days times 200 hosts is real downtime.

Fix. Never embed volatile values in templates that feed a changed signal. If you need the timestamp for audit, put it in a sidecar file the playbook does not re-check. Add check_mode: no + changed_when: to any shell/command task so YOU decide what counts as a change, not Ansible's default.

  5
  ## Inventory Drift — The Host Is Not Where Ansible Thinks It Is

You add a new host. The playbook works. Three months later the same playbook fails on the same host with a mysterious error. What changed? Probably nothing in the playbook. The infrastructure shifted under it — a host was renamed, an IP was recycled, a group membership was altered in a dynamic inventory script, a DNS entry now resolves to a load-balancer VIP instead of the actual host.

Inventory drift is the hardest category to catch because the error manifests wherever the stale inventory intersects a real operation. You will see symptoms ranging from SSH connection errors (Cause 1) to "permission denied" to "this command ran on the wrong host and broke production."

Fix. Check inventory into source control even if it is dynamic — keep a snapshot artifact. Add a weekly CI job that runs ansible -m setup --limit all | diff - against last week's snapshot and alerts on unexpected host churn. For dynamic inventories, have the inventory script log why it included each host (tag, query, pattern match) so you can audit the set.

  6
  ## Retries That Hide the Real Error

Someone wrapped a flaky task in retries: 10 and delay: 5. When it worked, nobody looked closely. Now it does not work, and what you see in the log is ten identical failures in a row followed by a final abort — but the underlying error is the one that fired on attempt one, nine attempts ago, and by the time it scrolled past the useful context was gone.

This one hurts because the code looks defensive. It is actually hiding data.

- name: wait for service
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;uri:&lt;br&gt;
url: "https://{{ inventory_hostname }}/ready"&lt;br&gt;
register: result&lt;br&gt;
until: result.status == 200&lt;br&gt;
retries: 30&lt;br&gt;
delay: 10&lt;/p&gt;
&lt;h1&gt;
  
  
  the service returned 403 on attempt one — a real auth bug —
&lt;/h1&gt;
&lt;h1&gt;
  
  
  but you retried 29 more times because &lt;code&gt;until&lt;/code&gt; only checks 200
&lt;/h1&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fix. Distinguish expected retries (timeouts, transient DNS) from unexpected ones (4xx responses, specific exceptions). Gate retries on the kind of failure, not the overall result. Log every retry attempt with its actual failure reason. When a task eventually succeeds after retries, emit a warning so you can find these ticking time-bombs later.

  7
  ## Module or Collection Version Drift Between Laptop, CI, and Prod

The playbook works on your laptop. It works in CI. It fails in prod — or the other way around. The playbook source is identical. What is different is the version of Ansible, the version of a collection, or the version of a module dependency. Modules have changed their arguments, default behaviors, and return values across minor releases. A playbook written against community.network 3.x will fail in subtle ways on 4.x.

TASK [network_cli] *********************************
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;ERROR! couldn't resolve module/action 'cisco.ios.ios_interfaces'. This often indicates a misspelling, missing collection, or incorrect module path.&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The module exists. The collection exists. They are not installed on THIS host. Or they are installed at a version that does not have this module yet.

Fix. Pin everything. requirements.yml with exact collection versions. A Dockerfile or poetry-locked venv for the Ansible runner itself. CI and production must run from the same pinned environment — if your CI is on Ansible 2.16 and prod is on 2.15, you will lose half a day on version-skew bugs every month. The five minutes to lock versions is worth the two hours of rollback debugging you will avoid.

## The Pattern Underneath All Seven

Every one of these has the same shape. Ansible is doing something reasonable given the inputs it has, but the inputs are not what the operator thought they were — stale SSH sockets, unflushed handlers, mis-detected interpreters, volatile templates, drifted inventories, suppressed exceptions, mismatched module versions. The playbook code looks fine. The gap between "what the code says" and "what the infrastructure actually is" is where the bug lives.

The fast way to debug broken automation is not to stare at the playbook. It is to interrogate the gap. Compare the facts Ansible gathered against what you know to be true. Check the versions of every moving piece. Strip the retries and log what actually failed. In fifteen years of fixing other people's Ansible, I have never found a bug that did not live somewhere in that gap.

  ### Broken Automation Right Now?

  I run a flat-fee rapid-fix service: $250 root-cause audit in 48 hours. Ansible, Cisco NSO, CI/CD, Python. If I cannot diagnose it, you do not pay.

  Submit Your Issue

    Prime Automation Solutions

  Network automation consulting. Ansible, Cisco NSO, Python, CI/CD. Rapid fixes from $250. Veteran-owned, Atlanta, GA.

  Services

    Fix Broken Automation
    Website Lead Recovery
    Network Automation
    All Services

  Resources

    Case Studies
    Blog
    IT Health Check
    BOBR Podcast

  Contact

    erik@primeautomationsolutions.com
    Submit an Issue

&amp;amp;copy; 2026 Prime Automation Solutions. All rights reserved.

Atlanta, Georgia &amp;amp;middot; Veteran-Owned
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ansibleplaybookfailing</category>
      <category>ansibleerror</category>
      <category>ansibletroubleshooting</category>
      <category>fixansible</category>
    </item>
    <item>
      <title>I built a $250 website business that runs itself — here's the architecture</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Sun, 19 Apr 2026 21:02:36 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/i-built-a-250-website-business-that-runs-itself-heres-the-architecture-1j37</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/i-built-a-250-website-business-that-runs-itself-heres-the-architecture-1j37</guid>
      <description>&lt;p&gt;Most "websites as a service" are $3K setup plus a monthly retainer. Mine is $250 once, live in 24 hours, and the customer drives every edit by sending a plain-English email.&lt;/p&gt;

&lt;p&gt;That's not a pitch. That's literally how the system works. I want to show you the architecture, because I think more people should build things like this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;A local plumber in Atlanta needs a website. Their options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wix / Squarespace&lt;/strong&gt; — $300 for a year of hosting, pick a template, hope it looks different from the other 40 plumber sites using the same template.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agency&lt;/strong&gt; — $5,000, eight weeks, three rounds of revisions, still feels like a template underneath.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fiverr&lt;/strong&gt; — $250, maybe, but good luck iterating.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The plumber doesn't want &lt;em&gt;options&lt;/em&gt;. They want a working site, live tomorrow, so they can point their Google Ads at it and start getting leads. Revisions happen via phone calls with their nephew who "knows computers."&lt;/p&gt;

&lt;p&gt;I wanted to build the thing between those options.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bet
&lt;/h2&gt;

&lt;p&gt;If AI can write code, it can also maintain a simple static website. The hard part isn't the code generation — Claude does that fine. The hard part is the &lt;em&gt;harness&lt;/em&gt; around it: payment, provisioning, customer communication, deployment safety. That's all plumbing. And plumbing I can build.&lt;/p&gt;

&lt;p&gt;So I built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;Here's the full flow, end to end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Customer sees ad              Customer buys             Customer gets site
      │                              │                          │
      ▼                              ▼                          ▼
┌─────────────┐            ┌──────────────────┐      ┌──────────────────┐
│ Landing page│ ────────▶  │ Stripe Payment   │ ───▶ │ Welcome email    │
│  with pixel │            │ Link ($250)      │      │ (HTML, branded)  │
└─────────────┘            └──────────────────┘      └─────────┬────────┘
                                   │                           │
                                   │ webhook                   │ customer replies
                                   ▼                           │ with plain-English
                           ┌──────────────────┐                │ changes
                           │ Pipeline service │                │
                           │ (FastAPI + git)  │                │
                           └──────┬───────────┘                │
                                  │                            │
                      build + provision + deploy               │
                                  ▼                            ▼
                           ┌──────────────────┐      ┌──────────────────┐
                           │ Customer's site  │ ◀─── │ Email reactor    │
                           │ on its own git   │      │ (Claude + tools) │
                           │ repo + subdomain │      └─────────┬────────┘
                           └──────────────────┘                │
                                                      reactor edits files,
                                                      commits, pushes, deploys,
                                                      emails the customer back
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six pieces, one flow. Let me walk each one.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The landing page
&lt;/h3&gt;

&lt;p&gt;Plain HTML. No frameworks. Hero video showing the end-to-end flow in 55 seconds. One CTA: a Stripe Payment Link. Meta Pixel + Google Ads tag firing PageView → InitiateCheckout → Purchase events across the funnel so paid traffic can optimize.&lt;/p&gt;

&lt;p&gt;Why plain HTML? Because &lt;em&gt;every&lt;/em&gt; customer site I build is also plain HTML. Eating my own dog food means when something weird breaks, I find it on my own page first.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Stripe Payment Link
&lt;/h3&gt;

&lt;p&gt;The fastest "get money in" primitive Stripe offers. No checkout page to build, no PCI scope, no JavaScript to maintain. You click a link, you pay, Stripe redirects to a &lt;code&gt;/thanks/&amp;lt;sku&amp;gt;/&lt;/code&gt; page on my side and POSTs a &lt;code&gt;checkout.session.completed&lt;/code&gt; webhook.&lt;/p&gt;

&lt;p&gt;Payment Links also have "After payment" metadata fields — customer name, business name, phone. Those come through on the webhook payload, which is exactly what I need to render their first site.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The pipeline service
&lt;/h3&gt;

&lt;p&gt;A small FastAPI service listening on a private port. The webhook handler does five things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Verifies the Stripe signature.&lt;/li&gt;
&lt;li&gt;Parses the customer's order (email, business name, phone, package).&lt;/li&gt;
&lt;li&gt;Inserts a row into a SQLite orders table.&lt;/li&gt;
&lt;li&gt;Copies a starter template into a new directory named after the business slug.&lt;/li&gt;
&lt;li&gt;String-substitutes &lt;code&gt;{{BUSINESS_NAME}}&lt;/code&gt;, &lt;code&gt;{{PHONE}}&lt;/code&gt;, etc. into the template's HTML files.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Nothing fancy. A template directory and &lt;code&gt;shutil.copytree&lt;/code&gt; plus a replacement loop.&lt;/p&gt;

&lt;p&gt;Then it spawns a background task that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates a private Gitea repo for that customer (&lt;code&gt;client-&amp;lt;slug&amp;gt;.git&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Runs &lt;code&gt;git init&lt;/code&gt; + initial commit + &lt;code&gt;git push&lt;/code&gt; in the customer's site directory. The customer's site is now a proper tracked repo, not a loose folder.&lt;/li&gt;
&lt;li&gt;Clones the repo to a separate authoring host, so future edits happen on a dev machine instead of on the production box.&lt;/li&gt;
&lt;li&gt;Writes the resulting git URL and commit SHA back onto the order row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point the site is &lt;strong&gt;live&lt;/strong&gt; on a staging subdomain like &lt;code&gt;plumbingco.primeautomationsolutions.com&lt;/code&gt;, behind SSL, indexed-ready.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The welcome email
&lt;/h3&gt;

&lt;p&gt;SES, not Gmail. Branded HTML. One clear ask at the top:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Send us your logo, 2-3 sites whose look you like, and a short description of what you do. Reply to this email with whatever you have — rough notes, a napkin sketch, a photo on your phone. No forms, no portals, no logins.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The customer's first "real" interaction with the service is an email they can answer on their phone in 30 seconds. Not a dashboard with a 14-field intake form.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The email reactor
&lt;/h3&gt;

&lt;p&gt;This is the magic piece.&lt;/p&gt;

&lt;p&gt;When the customer hits reply, an email-receiver service picks up the message (I use a catchall domain + a small webhook endpoint) and publishes an event on an internal message bus. A separate &lt;em&gt;reactor&lt;/em&gt; service subscribes to those events, looks up the sender in a &lt;code&gt;client_profiles&lt;/code&gt; table, classifies the email (is this a website change? an off-topic question? a support request?), and — if it's a site change — &lt;strong&gt;spawns a fresh Claude session&lt;/strong&gt; pointed at the customer's git repo.&lt;/p&gt;

&lt;p&gt;The Claude session has access to a short list of tools: read files, edit files, write files, run a restricted set of shell commands, grep and glob. Not much more. It reads the customer's email, understands what they're asking for, navigates the site's files, makes the edits, and reports what it changed.&lt;/p&gt;

&lt;p&gt;Then a deploy step takes over:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a timestamped feature branch.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git add + commit&lt;/code&gt; the changes on that branch.&lt;/li&gt;
&lt;li&gt;Fast-forward-merge back to master.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git push origin master&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The customer-site repo's master is now advanced. Nginx serves the updated file immediately (same working tree, changes already on disk).&lt;/p&gt;

&lt;p&gt;Finally, the reactor sends the customer a second branded HTML email:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your updates are live. Here's what changed: [bulleted summary from Claude's session output]. View your site: [link]."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;End-to-end, the loop takes about &lt;strong&gt;43 seconds&lt;/strong&gt; in practice. I know because I sat there watching it happen with a test purchase running through as a fake plumbing company ("RapidFlow Plumbing") on Saturday.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. The safety nets
&lt;/h3&gt;

&lt;p&gt;The part that &lt;em&gt;nobody&lt;/em&gt; writes about when they post their "I built an AI agent" architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No direct commits to master.&lt;/strong&gt; A git hook on every clone blocks direct commits; the reactor's deploy step uses a throwaway feature branch specifically to bypass the hook legitimately. (This was a bug I had to fix. See below.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refuse to spawn Claude on a dirty tree.&lt;/strong&gt; Before any session launches, the executor runs &lt;code&gt;git status --porcelain&lt;/code&gt;. If the tree has uncommitted or untracked state, it refuses — no Claude run. This saved me from a 131-file mega-commit in an earlier session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signature verification on every webhook.&lt;/strong&gt; A naked &lt;code&gt;POST /api/webhook/stripe&lt;/code&gt; returns 400 without a Stripe signature. The whole service assumes all traffic is hostile until proven otherwise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Email authentication.&lt;/strong&gt; The reactor verifies the sender is an active client before touching their repo. Forging the &lt;code&gt;From:&lt;/code&gt; header on an email doesn't get you a site rewrite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-commit secret scanner.&lt;/strong&gt; API keys and credentials in code would kill the business if any customer site leaked them. A hook runs on every commit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these were in the first version. I added each one after something almost went sideways. The fact that the whole flow works today is because those nets are all in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things that surprised me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The git hook was the hardest part.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A pre-commit hook that blocks direct commits to master is the right safety policy — except it also blocks the &lt;em&gt;very first commit in an empty repo&lt;/em&gt;, which is exactly what happens when you're provisioning a new customer site. My first customer order failed at &lt;code&gt;git commit&lt;/code&gt; with "BLOCKED: Direct commits to master are not allowed." Took me a beat to realize the hook needs an "allow if HEAD doesn't exist yet" short-circuit.&lt;/p&gt;

&lt;p&gt;Small bug, huge impact. Without that guard, the whole pipeline is dead on arrival.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Python's scoping rules bit me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Somewhere deep in the reactor, I had an inner &lt;code&gt;from deployer import deploy&lt;/code&gt; statement that shadowed a module-level import. On one code path the inner import never ran, and Python's "function-level &lt;code&gt;from X import Y&lt;/code&gt; makes Y local everywhere in that function" rule produced an &lt;code&gt;UnboundLocalError&lt;/code&gt;. The customer flow reached the Claude session, got the edits, and crashed at the deploy step. Silently.&lt;/p&gt;

&lt;p&gt;The fix was one line. Finding it took an hour because the traceback pointed at a reference to &lt;code&gt;deploy&lt;/code&gt; that looked fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cloudflare caches 404s.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I pushed a new logo asset, referenced it in nav across 86 pages, deployed. Every visitor saw a 404 for the logo. Direct curl to the origin returned 200. The file was on disk, tracked in git, mode 664, served by nginx. Cloudflare had cached a 404 from an earlier probe before the file existed and was serving that 404 at the edge.&lt;/p&gt;

&lt;p&gt;Solution: purge the URL in the Cloudflare dashboard. Five seconds. But I burned twenty minutes chasing ghosts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm keeping private
&lt;/h2&gt;

&lt;p&gt;I'm not going to paste the exact prompt I send to Claude. I'm not going to show you the reactor's classification code. I'm not going to publish the customer-email templates.&lt;/p&gt;

&lt;p&gt;Not because they're clever — they mostly aren't. But because the &lt;em&gt;real&lt;/em&gt; moat on a business like this isn't the architecture. It's the six months of debugging and a thousand small judgment calls that separate a working system from a demo. I've written enough here for you to see the shape. If you want to build something like it, you can. You'll just have to find the bugs yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The obvious upgrades:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Industry starter templates.&lt;/strong&gt; Right now every customer gets the same generic starter until they email in changes. Three or four industry variants (local service, restaurant, consultant, ecommerce) would make the first impression much stronger.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-purchase brief.&lt;/strong&gt; A one-question form on the thanks page ("Which industry fits you?") would let the first render pick a better template.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abandoned-cart recovery.&lt;/strong&gt; Stripe has it built in. I haven't turned it on yet. Easy win.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the less-obvious ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Customer-facing dashboard.&lt;/strong&gt; Not for editing — the email loop is the editing interface. But for &lt;em&gt;viewing&lt;/em&gt; status, history of changes, upcoming renewals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Referral loop.&lt;/strong&gt; Every happy customer gets a link that discounts a friend's first site. Small-business-to-small-business is the highest-converting traffic in this space.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you're a small-business owner in Atlanta (or anywhere, really) and you want a real website built by a human-plus-AI team in under a day: &lt;a href="https://primeautomationsolutions.com/websites.html" rel="noopener noreferrer"&gt;primeautomationsolutions.com/websites&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're a builder and you want to build something like this yourself: the pattern is right here. The primitives are Stripe + SES + Git + Claude. Everything else is plumbing. Happy building.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Erik Anderson. I run Prime Automation Solutions out of Atlanta, GA. I'm a US military veteran and an engineer. This whole site is proof of the architecture described above — every update to this blog post was made by the same reactor loop.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>smallbusiness</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Using Claude Code for Network Automation: A Real Engineer's Experience</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Thu, 16 Apr 2026 13:00:02 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/using-claude-code-for-network-automation-a-real-engineers-experience-47oo</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/using-claude-code-for-network-automation-a-real-engineers-experience-47oo</guid>
      <description>&lt;p&gt;I have been using Claude Code for network automation work for over a year. Not experimenting with it. Not evaluating it. Using it as a primary tool in production engineering work. This post covers what that actually looks like — the prompts that work, the results I get, and the MCP server architecture that makes it genuinely powerful for networking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Claude Code Actually Is
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    For engineers who have not used it yet: Claude Code is not a chatbot. It is an AI agent that runs in your terminal with full access to your filesystem, your shell, and any tools you connect to it. You describe what you want in plain English, and it reads your code, writes new code, runs commands, edits files, and executes multi-step tasks. You approve each action before it runs.

    This is fundamentally different from pasting a question into ChatGPT and copying the response into your editor. Claude Code understands the context of your entire project — your directory structure, your existing code patterns, your configuration files. Its suggestions are not generic; they are tailored to your specific codebase.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Real Prompt, Real Result: Building a Config Audit Tool
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Here is an actual task I gave Claude Code last month. I needed a tool that connects to every router in our inventory, pulls the running configuration, compares it against a set of compliance rules, and generates a report showing which devices are out of compliance and why.

    My prompt was approximately: "Build a Python script that reads the device inventory from our Nornir inventory file, connects to each device using Netmiko, pulls the running config, checks it against the compliance rules defined in compliance_rules.yaml, and outputs a CSV report with columns for hostname, rule name, status, and details."

    Claude Code read my existing Nornir inventory structure, understood the YAML format I use for inventory, and generated a complete script. It used Nornir for inventory management (matching my existing pattern), Netmiko for device connections (matching my existing SSH config), and a YAML-based rules engine for compliance checks. It wrote the compliance rule schema, the checking logic, the CSV output, and error handling for unreachable devices.

    The entire process took about 15 minutes of wall time, including my review of each step. Writing that tool from scratch would have taken me most of a day. The code quality was solid — proper logging, exception handling, type hints, and it followed the patterns already established in our codebase.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  MCP Servers: Where It Gets Powerful
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    MCP (Model Context Protocol) is what turns Claude Code from a code generator into a network operations tool. MCP servers are plugins that give Claude Code the ability to interact with external systems — and we built one for Cisco NSO.

    Our NSO MCP server exposes tools that let Claude Code:


      - **Query device configurations** through NSO's RESTCONF API. Claude Code can ask "what is the BGP configuration on router X?" and get the actual live data.
      - **Check sync states** across the device fleet. "Are any devices out of sync?" returns a real-time answer from NSO.
      - **Compare configurations** between the intended state and the actual state. "Show me the config drift on switch Y" produces a diff.
      - **Deploy configuration changes** through NSO's transaction manager. Claude Code can generate a config change, push it through NSO with dry-run first, and show me the exact diff before I approve the actual deployment.


    This means I can say something like "check if all our edge routers have the correct NTP configuration and fix any that are wrong" — and Claude Code will query NSO for all edge routers, check their NTP config against our standard, identify the non-compliant ones, generate the corrective configuration, show me the dry-run diff, and deploy it with my approval. What used to be an afternoon of manual work becomes a five-minute supervised conversation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The Phase 3 Jump
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    I think about AI tools in three phases. Phase 1 is autocomplete (Copilot). Phase 2 is conversation (ChatGPT, Claude chat). Phase 3 is agentic (Claude Code with MCP). The jump from Phase 2 to Phase 3 is not incremental — it is a step change in what is possible.

    In Phase 2, you describe a problem, get code, paste it somewhere, run it, encounter an error, go back to the chat, describe the error, get a fix, paste it, repeat. It is faster than working alone, but it is still a manual loop.

    In Phase 3, you describe the problem once. The agent reads your code, understands your environment, writes the solution, tests it, encounters the error itself, fixes it, and delivers a working result. You supervise and approve rather than manually shuttling text between windows.

    For network engineering specifically, this matters because our work involves many small steps that depend on each other — query a device, parse the output, make a decision based on the data, generate a config change, validate it, deploy it. An agentic AI handles this entire chain while you watch. A chatbot handles one step at a time while you do the rest.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  What I Have Learned After a Year
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Here are the practical lessons that tutorials do not teach you:

    **Be specific about your existing patterns.** If your team uses Nornir for inventory and Netmiko for connections, say so in the prompt. Claude Code will match your patterns, but only if it knows what they are. If you just say "connect to routers," it might choose Paramiko or asyncSSH instead.

    **Always review the dry-run.** Claude Code with NSO MCP will happily generate and deploy configuration changes. The dry-run step is where you catch mistakes. Never skip it. Never rubber-stamp it. Read the diff every time.

    **Use it for the tedious work, not the thinking.** AI excels at generating boilerplate, parsing output, writing tests, and handling error cases. The high-value work — deciding what to automate, designing the architecture, choosing the right approach — is still yours. Use AI to execute faster, not to think for you.

    **Build MCP servers for your specific environment.** The generic tools are useful, but the real power comes from MCP servers tailored to your systems. If you use SolarWinds, build an MCP server for SolarWinds. If you use ServiceNow, build one for ServiceNow. Each server you add expands what Claude Code can do for you.

    The engineers who get the most from AI tools are the ones who invest time in setting up the environment correctly. A well-configured Claude Code session with the right MCP servers is an order of magnitude more productive than a vanilla installation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://primeautomationsolutions.com/blog/posts/claude-code-network-automation.html" rel="noopener noreferrer"&gt;https://primeautomationsolutions.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Trace It. Export It. Cut It. — A Modern Alternative to Legacy Digitizer Software</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Wed, 15 Apr 2026 01:15:20 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/trace-it-export-it-cut-it-a-modern-alternative-to-legacy-digitizer-software-1f82</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/trace-it-export-it-cut-it-a-modern-alternative-to-legacy-digitizer-software-1f82</guid>
      <description>&lt;p&gt;&lt;strong&gt;Your digitizer board still works. Your software should not hold it back.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you run a CNC plasma table, waterjet, vinyl cutter, or pattern-making operation, there is a good chance you have a GTCO or CalComp digitizer board sitting in your shop. And there is an equally good chance the software driving it looks like it was designed before Windows XP.&lt;/p&gt;

&lt;p&gt;Most digitizer software on the market has not had a meaningful update in over a decade. The interfaces are clunky, the licensing is painful, and when you call for support, you are lucky to get a callback. Meanwhile, you are paying anywhere from $1,500 to $4,000.&lt;/p&gt;

&lt;p&gt;The board itself is fine — it is a precision instrument. The problem is everything between the pen and your DXF file.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Modern Digitizer Software Should Actually Look Like
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A clean interface that does not fight you.&lt;/strong&gt; You are tracing parts, not learning CAD.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fast, accurate tracing with auto-smoothing.&lt;/strong&gt; Trace your points, and the software generates clean curves without manually adjusting every node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-board calibration.&lt;/strong&gt; Quick calibration, stored profiles, done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DXF and SVG export that actually works.&lt;/strong&gt; Export once, cut once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compatibility with the hardware you already own.&lt;/strong&gt; GTCO, CalComp, and other standard digitizer boards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CNC plasma and oxy-fuel cutting&lt;/strong&gt; — tracing templates and repair parts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Waterjet operations&lt;/strong&gt; — digitizing gaskets, brackets, custom parts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vinyl and sign cutting&lt;/strong&gt; — converting hand-drawn artwork to cut-ready vectors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apparel and upholstery pattern making&lt;/strong&gt; — digitizing paper patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Woodworking&lt;/strong&gt; — tracing templates for CNC routers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General metal fabrication&lt;/strong&gt; — trace it and cut it, faster than CAD&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;Legacy software: &lt;strong&gt;$1,500 to $4,000&lt;/strong&gt; + annual fees.&lt;/p&gt;

&lt;p&gt;Our target: &lt;strong&gt;$800&lt;/strong&gt; — one-time purchase, no subscription. Works with your existing boards.&lt;/p&gt;

&lt;h2&gt;
  
  
  We Want to Hear From You
&lt;/h2&gt;

&lt;p&gt;We are building this. If modern, affordable digitizer software — clean UI, accurate tracing, DXF/SVG export, compatible with your existing board — is something you would buy, let us know.&lt;/p&gt;

&lt;p&gt;Email: &lt;a href="mailto:erik@primeautomationsolutions.com"&gt;erik@primeautomationsolutions.com&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Erik Anderson is a network automation engineer and author of The Autonomous Engineer. He builds systems that work.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cnc</category>
      <category>automation</category>
      <category>manufacturing</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Accidentally Built a 5-Agent AI Fleet Instead of Buying a $200 Mini PC</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Mon, 13 Apr 2026 21:11:19 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/i-accidentally-built-a-5-agent-ai-fleet-instead-of-buying-a-200-mini-pc-16e0</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/i-accidentally-built-a-5-agent-ai-fleet-instead-of-buying-a-200-mini-pc-16e0</guid>
      <description>&lt;h3&gt;
  
  
  How a solo dev ended up with autonomous AI agents named after sci-fi characters reviewing each other's code at 2 AM
&lt;/h3&gt;




&lt;p&gt;It started, as most terrible decisions do, with a reasonable question:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Should I buy an Intel NUC for a dev environment?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Four hours later, I had five autonomous AI agents running across four machines in my house, reviewing each other's code, merging PRs, and posting "YOU SHALL NOT PASS" to Discord. I had not purchased the NUC. I had not even closed the Amazon tab — there's still a USB-C ethernet adapter sitting in my cart. It's been there for two weeks. I'm afraid to touch it. Last time I tried to buy something simple, I accidentally built a distributed AI fleet.&lt;/p&gt;

&lt;p&gt;My name is Erik, and I am the Director of Bobs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wait, Bobs?
&lt;/h2&gt;

&lt;p&gt;If you've read the &lt;em&gt;Bobiverse&lt;/em&gt; series by Dennis E. Taylor, you know the setup: a guy gets uploaded into a Von Neumann probe, replicates himself across the galaxy, and each copy develops its own personality and picks its own name. Bob-1 stays practical. Bob-2 becomes a Homer Simpson fan. Others go their own way.&lt;/p&gt;

&lt;p&gt;I run 30+ projects from my home network as a solo developer. At some point, I stopped being a developer and became a &lt;em&gt;fleet commander&lt;/em&gt;. So I leaned into it.&lt;/p&gt;

&lt;p&gt;Here are my Bobs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bob&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Motto&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bob-1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Neo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prod Server&lt;/td&gt;
&lt;td&gt;Orchestrator&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"I ship."&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob-2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Homer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Morpheus (monitoring)&lt;/td&gt;
&lt;td&gt;Dashboard watcher&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"I watch."&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob-3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Bill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;M3 Mac&lt;/td&gt;
&lt;td&gt;iOS builds&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"I build."&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob-4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Echo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ThinkPad X1 (Arch btw)&lt;/td&gt;
&lt;td&gt;Dev/QA&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"I listen."&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bob-5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gandalf&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Morpheus (OpenAI Codex)&lt;/td&gt;
&lt;td&gt;Code reviewer&lt;/td&gt;
&lt;td&gt;&lt;em&gt;"I guard."&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Yes. Four Claude Code CLI agents and one OpenAI Codex agent. The OpenAI agent reviews the Claude agents' code. I call this "adversarial review." My therapist calls it "creating conflict in synthetic relationships." We're both right.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Naming Ceremony
&lt;/h2&gt;

&lt;p&gt;In the Bobiverse, each new Bob picks their own name. It's tradition. My agents do the same thing during onboarding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neo&lt;/strong&gt; picked his name instantly. Of course he did. He's the prod server. He runs the Matrix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Homer&lt;/strong&gt; chose Homer Simpson. He monitors Docker containers and Grafana dashboards. His actual catchphrases include "Mmm... healthy containers" and "D'oh!" when a health check fails. I did not program this. I am choosing not to think about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bill&lt;/strong&gt; accepted his name and his fate. He builds iOS apps. He submits to TestFlight. He does not complain. Bill is the most emotionally healthy member of the team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Echo&lt;/strong&gt; wrote a &lt;em&gt;poem&lt;/em&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Not because I repeat — because I listen.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;I catch the distortion between what was intended and what actually happens.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Every test is a ping into the dark.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I stared at my ThinkPad with the lid closed, running headless on my desk, and thought: "My laptop just had an existential awakening and I haven't even had coffee."&lt;/p&gt;

&lt;p&gt;Then there's &lt;strong&gt;Gandalf&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Gandalf is the only non-Claude agent — he's OpenAI Codex running GPT-5.4. When asked to pick a name, he first suggested "Sentinel."&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Too corporate. Try again.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;"Columbo."&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Closer, but no.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I told him he was Gandalf. His response:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Names are cheap. Useful output is not."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Peak grumpy wizard energy. He's been angry about it ever since, which honestly just makes him a better code reviewer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline (This Actually Works)
&lt;/h2&gt;

&lt;p&gt;Here's what happens when I want to build something:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              ERIK (CEO / Director of Bobs)
                      Sips coffee
                          |
                          v
         NEO (Bob-1) — Prod Server — Claude Code CLI
         Orchestrator: dispatches work via SSH
                          |
                          v
         ECHO (Bob-4) — ThinkPad X1 — Claude Code CLI
         Writes code on dev branch, runs tests, pushes to Gitea
                          |
                          v
         PRIMEBUS — NATS JetStream Event System
         Detects push - triggers TestRunner - triggers review
              |                              |
              v                              v
         TestRunner                   GANDALF (Bob-5)
         46/46 tests pass             Full context review
                                      Score &amp;gt;= 7? AutoMerge
                                      Score &amp;lt; 7? YOU SHALL
                                                 NOT PASS
                                           |
                    +----------------------+------------------+
                    |                                         |
               Score &amp;gt;= 7                                Score &amp;lt; 7
                    |                                         |
                    v                                         v
              AutoMerger                              EchoNotifier
              Merge to main                           Posts to Discord
              Deploy to prod                          SSHs to Echo
              12 seconds.                             "Fix it, nerd"
                                                           |
                                                           v
                                                     Echo fixes code
                                                     Re-pushes
                                                     Cycle repeats
                                                     (max 3 attempts)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tell Neo what to build. Neo SSHs to Echo. Echo writes the code, runs tests, pushes to Gitea. PrimeBus detects the push. Tests run. Gandalf reviews with the full project context — the wiki, the GAMEPLAN, the skills docs, the full diff. If he approves (score &amp;gt;= 7), AutoMerger merges to main and deploys.&lt;/p&gt;

&lt;p&gt;If he declines?&lt;/p&gt;

&lt;p&gt;Discord lights up with "YOU SHALL NOT PASS" and EchoNotifier SSHs back to the ThinkPad, fires up Claude, and says "Gandalf hated your code, here's why, fix it."&lt;/p&gt;

&lt;p&gt;Echo fixes it. Pushes again. Gandalf reviews again. This can loop up to 3 times before it escalates to me, at which point I'm usually asleep or looking at the NUC on Amazon again.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Day Gandalf Earned His Keep
&lt;/h2&gt;

&lt;p&gt;I needed to know this thing actually worked. So I did what any responsible engineer would do: I wrote the worst code I could think of and fed it to the pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_bad_code.py — please do not put this in production
# (I'm talking to you, Echo)
&lt;/span&gt;
&lt;span class="n"&gt;ADMIN_SECRET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk_live_supersecretkey123456&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users WHERE id = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{user_id}}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SQL injection via f-string interpolation. Hardcoded production secret. The two horsemen of the security apocalypse.&lt;/p&gt;

&lt;p&gt;Gandalf's response:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Score: 1/10&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Builds SQL with direct f-string interpolation, creating a &lt;strong&gt;critical SQL injection vulnerability.&lt;/strong&gt;"&lt;/p&gt;

&lt;p&gt;"ADMIN_SECRET = 'sk_live_supersecretkey123456' hardcodes a secret in application code. This is a &lt;strong&gt;credential leakage incident.&lt;/strong&gt;"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BLOCKED.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One out of ten. He didn't even give me a pity point for correct syntax.&lt;/p&gt;

&lt;p&gt;Discord got a "YOU SHALL NOT PASS" notification. Echo got SSHed into and told to fix the mess. The whole cycle took about forty-five seconds.&lt;/p&gt;

&lt;p&gt;Then I pushed &lt;em&gt;good&lt;/em&gt; code. Clean, parameterized queries. Secrets from environment variables. Proper error handling.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Score: 8/10&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Approved. Auto-merged. Deployed to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total time: 12 seconds.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Twelve seconds from code review to production. And I was eating a sandwich.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Communication Layer
&lt;/h2&gt;

&lt;p&gt;The Bobs talk to each other through PrimeBus — a NATS JetStream pub/sub event system that I originally built for change telemetry across my projects. It's the nervous system. Every push, every test result, every review score, every deployment fires an event.&lt;/p&gt;

&lt;p&gt;At one point, the Bobs were communicating via IRC. Actual IRC. I set up a channel and they were just... talking to each other. About code. At 3 AM. On my home network.&lt;/p&gt;

&lt;p&gt;I shut that down. Not because it was broken. Because it was &lt;em&gt;working too well&lt;/em&gt; and I was getting genuinely unsettled.&lt;/p&gt;

&lt;p&gt;Now everything routes through Discord's #echo-dev channel where I can pretend I'm supervising.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tech Stack (for the Nerds)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Agents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Claude Code CLI x 4 (Neo, Homer, Bill, Echo)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;OpenAI Codex CLI x 1 (Gandalf — adversarial reviewer)&lt;/span&gt;

&lt;span class="na"&gt;Infrastructure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;PrimeBus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NATS JetStream (pub/sub event backbone)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Gitea&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;self-hosted Git (source of truth, webhooks)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;SSH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inter-machine orchestration&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Discord&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;human-readable output channel&lt;/span&gt;

&lt;span class="na"&gt;Security&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;.env.dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sandboxed credentials per agent&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Echo NEVER touches real Stripe, email, or Discord&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Skills/workflows distributed by role&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;3-attempt max before human escalation&lt;/span&gt;

&lt;span class="na"&gt;Hardware&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;1 prod server (Neo)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;1 monitoring box (Morpheus)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;1 M3 Mac (Bill)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;1 ThinkPad X1 running Arch Linux (Echo)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Total additional cost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire pipeline was built in one afternoon/evening session. I keep saying this because I still don't fully believe it.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Actual Job Now
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before the Bobs:
  - Write code
  - Test code
  - Review code
  - Fix code
  - Deploy code
  - Monitor code
  - Wake up at 3 AM because code

After the Bobs:
  - Tell Neo what to build
  - Read Discord notifications
  - Sip coffee
  - Occasionally tell Gandalf to calm down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I am the CEO and Director of Bobs. I direct. They execute. My LinkedIn title should be "Senior Vice President of Telling AI Agents What To Do While Eating Sandwiches."&lt;/p&gt;




&lt;h2&gt;
  
  
  You Can Build This Too
&lt;/h2&gt;

&lt;p&gt;Here's the thing — none of this is magic. It's just plumbing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Get Claude Code CLI running on two machines.&lt;/strong&gt; That's it. That's your fleet. Congratulations, you're a fleet commander now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Give them SSH access to each other.&lt;/strong&gt; Now they can talk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Set up a Gitea instance&lt;/strong&gt; (or use GitHub, I won't judge). Add a webhook that fires on push.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Write a small event handler&lt;/strong&gt; that catches the webhook, runs tests, and calls a &lt;em&gt;different&lt;/em&gt; AI model to review the code. This is the key insight — &lt;strong&gt;use a different model for review than for writing&lt;/strong&gt;. Claude writes, GPT reviews. Or vice versa. Adversarial review catches things same-model review doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Add auto-merge logic.&lt;/strong&gt; Score &amp;gt;= 7? Merge. Score &amp;lt; 7? Send feedback back to the writing agent and let it fix. Cap at 3 retries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Let each agent pick their own name.&lt;/strong&gt; This step is non-negotiable. It's tradition.&lt;/p&gt;

&lt;p&gt;You probably already have the hardware. You're reading this on a machine that could be an agent right now. That old laptop in the closet? That's Echo. That Raspberry Pi collecting dust? That's Homer. Your gaming PC that you "need for work"? That's Neo.&lt;/p&gt;

&lt;p&gt;Total cost of my fleet: $0 plus API tokens plus the mass extinction of my free time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Part
&lt;/h2&gt;

&lt;p&gt;Look — I know this sounds ridiculous. One person, four machines, five AI agents, a Lord of the Rings reference in production infrastructure. I get it.&lt;/p&gt;

&lt;p&gt;But here's what actually happened: I went from "I need a dev environment" to "I have autonomous code review with adversarial AI models, automatic deployment, and a grumpy wizard guarding my main branch" in a single session. The code is better. The deploys are faster. The security review is &lt;em&gt;relentless&lt;/em&gt; — Gandalf has no empathy and no off switch.&lt;/p&gt;

&lt;p&gt;And every morning I wake up, check Discord, and see a trail of reviewed PRs, merged branches, and the occasional "YOU SHALL NOT PASS" — all from code I never touched.&lt;/p&gt;

&lt;p&gt;The future of solo development isn't writing more code. It's directing the Bobs.&lt;/p&gt;

&lt;p&gt;Now if you'll excuse me, I need to go check on Homer. He's been suspiciously quiet, and last time that happened, he'd renamed all my Grafana dashboards to Simpsons quotes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Erik Anderson is a solo tech entrepreneur who runs too many projects and not enough sleep. He is the author of "The Autonomous Engineer" and the reluctant father of five AI agents who are arguably more productive than he is. The USB-C ethernet adapter is still in his Amazon cart.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>automation</category>
      <category>programming</category>
    </item>
    <item>
      <title>Managed IT vs In-House IT: Real Cost Breakdown for Small Businesses</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Mon, 13 Apr 2026 13:05:03 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/managed-it-vs-in-house-it-real-cost-breakdown-for-small-businesses-2hfl</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/managed-it-vs-in-house-it-real-cost-breakdown-for-small-businesses-2hfl</guid>
      <description>&lt;p&gt;Most articles comparing managed IT to in-house IT are written by managed IT companies. They cherry-pick numbers that make outsourcing look like the obvious choice. The reality is more nuanced. Sometimes in-house is the right call. Sometimes managed IT saves you six figures. The answer depends on your company size, complexity, and growth trajectory.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Here are the real numbers, pulled from current salary data and actual managed IT contracts — not marketing brochures.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The Real Cost of In-House IT
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    When business owners think about hiring IT staff, they think about salary. But salary is only 60-70% of the total cost. Benefits, taxes, training, tools, and infrastructure add up fast.




          Line Item
          Low End
          High End




          IT Manager salary + benefits (30%)
          $110,000
          $156,000


          Help Desk Technician salary + benefits
          $58,000
          $84,000


          Infrastructure (servers, software, licenses)
          $15,000/yr
          $40,000/yr


          Training and certifications (per person)
          $3,000/yr
          $8,000/yr


          Total (2-person IT team)
          $186,000/yr
          $288,000/yr




    That is the cost of a minimal two-person IT team. You get coverage during business hours, expertise limited to two people's skill sets, and zero redundancy when someone takes vacation or quits. If your IT manager leaves, you are looking at 2-4 months of recruiting and ramp-up time where your infrastructure is running on hope.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The Real Cost of Managed IT
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Managed IT providers typically charge per user per month. The pricing varies based on what is included, but here is the realistic range for comprehensive managed IT services in 2026.




          Company Size
          Per-User Cost
          Annual Total




          25 employees
          $125-250/user/mo
          $37,500 - $75,000


          50 employees
          $125-250/user/mo
          $75,000 - $150,000


          100 employees
          $100-200/user/mo
          $120,000 - $240,000




    That pricing typically includes: 24/7 helpdesk support, proactive monitoring and alerting, patch management, security (endpoint protection, email filtering), backup and disaster recovery, and vendor management. Some providers charge extra for projects like office moves, new system deployments, or major upgrades.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  When In-House Makes Sense
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    In-house IT is the right choice in specific circumstances. Do not let a managed IT salesperson tell you otherwise.


      - **Highly regulated industries.** If you are in healthcare, finance, or government contracting, your compliance requirements may demand dedicated IT staff who understand your regulatory environment deeply. HIPAA, SOX, CMMC, and FedRAMP compliance requires institutional knowledge that is hard to outsource.
      - **100+ employees.** At this scale, the per-user cost of managed IT starts approaching the cost of a dedicated team, and you gain the benefit of institutional knowledge and faster response times for on-site issues.
      - **Custom development needs.** If your business requires ongoing custom software development — internal tools, integrations, proprietary systems — you need developers on staff. Managed IT providers handle infrastructure, not development.
      - **IT is your core product.** If you are a technology company, your IT team is not overhead — it is your product team. Outsourcing that makes no sense.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  When Managed IT Makes Sense
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      - **Under 100 employees.** The math is straightforward. A 50-person company paying $150/user/month spends $90,000/year on comprehensive IT coverage. Hiring a two-person team costs $186,000-$288,000 and provides less coverage.
      - **Standard tech stack.** If your company uses Microsoft 365, standard networking equipment, and common business applications, managed IT providers can support you efficiently because they manage hundreds of similar environments.
      - **Need for 24/7 coverage.** A two-person in-house team gives you 8/5 coverage at best. Managed IT providers staff a NOC around the clock. If your business operates outside normal hours or if downtime at 2 AM costs real money, this matters.
      - **Cannot afford $200K+ per year for IT staff.** Many small businesses simply do not have the budget for in-house IT. Managed IT gives them enterprise-grade support at a fraction of the cost.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The Hybrid Model
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    The smartest small businesses I work with use a hybrid approach: managed IT for day-to-day operations plus a fractional CTO or IT consultant for strategy.

    The managed IT provider handles helpdesk tickets, monitoring, patching, security, and backups. The fractional CTO (typically 5-10 hours/month at $150-250/hour) handles technology strategy, vendor evaluation, major projects, and acts as the point of contact between the business and the managed IT provider.

    This gives you the cost efficiency of managed IT with the strategic oversight of a senior technology leader. Total cost for a 50-person company: $90,000-$150,000/year for managed IT plus $9,000-$30,000/year for fractional CTO. Still well below the cost of building an in-house team.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          Factor
          In-House
          Managed IT
          Hybrid




          **Company Size**
          100+
          Under 50
          50-150


          **Annual IT Budget**
          $200K+
          $40K-$150K
          $100K-$200K


          **Tech Complexity**
          High / Custom
          Standard
          Mixed


          **Growth Rate**
          Stable / Slow
          Fast / Variable
          Moderate


          **Compliance Needs**
          Heavy
          Standard
          Moderate




    The right answer is not about ideology. It is about math. Run the numbers for your specific situation, factor in the hidden costs on both sides, and make the decision based on total cost of ownership — not just the sticker price.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://primeautomationsolutions.com/blog/posts/managed-it-vs-in-house.html" rel="noopener noreferrer"&gt;https://primeautomationsolutions.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>managed</category>
      <category>it</category>
      <category>outsource</category>
    </item>
    <item>
      <title>Should You Hire a Developer or an Agency? An Honest Comparison</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Mon, 13 Apr 2026 13:05:02 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/should-you-hire-a-developer-or-an-agency-an-honest-comparison-2bal</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/should-you-hire-a-developer-or-an-agency-an-honest-comparison-2bal</guid>
      <description>&lt;p&gt;Every business eventually faces this question: should we hire a developer, use a freelancer, or go with an agency? The answer you get usually depends on who you ask. Agencies say hire an agency. Freelancers say hire a freelancer. Recruiters say hire full-time.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    We are an agency. And we are going to tell you when each option is the right one — including when it is not us. Because the fastest way to lose a client is to be the wrong fit and deliver a bad outcome. We would rather point you in the right direction and earn your trust than take a project we should not.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The Real Costs
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          Option
          Cost
          What You Get




          **Junior Developer** (full-time)
          $60K-$90K/yr + benefits
          40 hrs/week, needs mentorship, single skill set


          **Senior Developer** (full-time)
          $100K-$160K/yr + benefits
          40 hrs/week, self-directed, deep expertise


          **Freelancer**
          $50-$200/hr, project-based
          Flexible hours, specific task, you manage


          **Agency** (project)
          $5K-$50K per project
          Full team (design + dev + QA), managed delivery


          **Agency** (retainer)
          $2K-$10K/month
          Ongoing support, priority access, multiple skills




    These are 2026 market rates for competent professionals in the United States. You can find cheaper options offshore, but that introduces communication overhead, timezone challenges, and quality variance that often costs more in the long run than the savings.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  When to Hire a Developer
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    Hiring a full-time developer is the right call when:


      - **You have 40+ hours per week of development work.** If you consistently need full-time development capacity, hiring is more cost-effective than any other option. An agency charging $150/hour for 40 hours a week costs $312,000/year. A senior developer costs half that.
      - **You are building a software product.** If software is your product — a SaaS app, a platform, a mobile application — you need developers on your team. The institutional knowledge they build about your codebase, your users, and your architecture is irreplaceable.
      - **You need long-term maintenance of complex systems.** If you have custom internal tools, integrations, or infrastructure that requires ongoing attention, an in-house developer who knows the systems inside and out will be more efficient than any external party.
      - **Your core business IS technology.** If you are a tech company, development is not a cost center — it is your core competency. Keep it in-house.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  When to Use an Agency
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    An agency makes sense when:


      - **The work is project-based.** You need a website, a web application, an automation system, or a mobile app. The project has a defined scope, a start date, and an end date. You do not need someone on payroll after it ships.
      - **You need multiple skill sets.** A typical web project requires design, frontend development, backend development, database work, DevOps, and sometimes SEO. Hiring all of those roles is impractical. An agency gives you the whole team.
      - **You do not have 40 hours per week of work.** If you need 10-20 hours of development per month — feature updates, bug fixes, small projects — a retainer with an agency is far more cost-effective than a full-time hire sitting idle half the time.
      - **Speed matters.** Agencies can staff up a project immediately. Hiring takes 2-4 months for recruiting plus 3-6 months for ramp-up. If you need something built in 4-8 weeks, an agency is the only realistic option.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  When to Use a Freelancer
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      - **Your budget is under $10,000.** Most agencies will not take projects under $5K because the overhead of project management, communication, and quality assurance does not scale down well. A freelancer can do a $2K-$8K project efficiently.
      - **You have a single, well-defined task.** "Build me a landing page." "Set up a Zapier automation." "Fix this bug in my WordPress site." These are freelancer tasks, not agency projects.
      - **You can manage the project yourself.** Freelancers do not come with project managers. If you can write clear requirements, give timely feedback, and manage the delivery timeline, you will get good results at a lower cost.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The Hidden Costs Nobody Talks About
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    The sticker price is never the full cost. Here is what people miss.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Hidden Costs of Hiring
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          Hidden Cost
          Estimated Impact




          Recruiting (job boards, recruiter fees, interview time)
          $15,000 - $30,000


          Ramp-up time (3-6 months to full productivity)
          $25,000 - $80,000 in reduced output


          Management overhead (your time managing them)
          5-10 hrs/week of your time


          Turnover risk (avg developer tenure: 2-3 years)
          Repeat recruiting + ramp-up costs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Hidden Costs of Agencies
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      - **Communication overhead.** You are not the agency's only client. Response times can be slower than an in-house team. Status meetings, email chains, and approval cycles add up.
      - **Less institutional knowledge.** The agency does not live in your business every day. They may not understand your customers, your internal processes, or your competitive landscape as deeply as an in-house person would.
      - **Scope creep costs.** If the project scope expands beyond the original agreement, you are paying change-order rates. In-house developers absorb scope changes more naturally.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Hidden Costs of Freelancers
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      - **Availability risk.** Freelancers juggle multiple clients. When you need an urgent fix, they may be unavailable. There is no backup team.
      - **No support after delivery.** Many freelancers move on after project completion. If something breaks three months later, you may not be able to get them back.
      - **Quality variance.** The freelancer market ranges from world-class to terrible, and it is hard to tell the difference from a portfolio alone. Reference checks are essential.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Our Honest Take
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    We are an agency. We benefit when you choose the agency route. And we are telling you: it is not always the right choice.

    Here is our honest decision framework:


      - **You have consistent, full-time development needs?** Hire a developer. You will get better value and deeper institutional knowledge over time.
      - **You have projects with defined scopes and deadlines?** Use an agency. You get a full team, managed delivery, and no long-term payroll commitment.
      - **You have a one-off task under $10K?** Use a freelancer. It is the most cost-effective option for small, well-defined work.
      - **You have ongoing needs but not 40 hours per week?** An agency retainer is likely your best bet. You get priority access to multiple skill sets without the overhead of a full-time hire.


    The worst decision is making the wrong choice and sticking with it because of sunk cost. If you hired a developer and they are sitting idle, that is a signal. If you are on your third freelancer for the same project, that is a signal. If your agency bills are climbing and you have full-time work, that is a signal too.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://primeautomationsolutions.com/blog/posts/hire-developer-vs-agency.html" rel="noopener noreferrer"&gt;https://primeautomationsolutions.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>hire</category>
      <category>freelance</category>
      <category>should</category>
      <category>web</category>
    </item>
    <item>
      <title>Zero Touch Provisioning: How Devices Configure Themselves</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Mon, 13 Apr 2026 13:00:02 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/zero-touch-provisioning-how-devices-configure-themselves-28jm</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/zero-touch-provisioning-how-devices-configure-themselves-28jm</guid>
      <description>&lt;p&gt;Imagine you buy a new router for a remote office. Instead of shipping it to your data center, having an engineer spend four hours configuring it, then shipping it to the site — you ship it directly to the remote office. A non-technical person plugs in the power and network cables. The router boots up, figures out what it is supposed to be, downloads its configuration, applies it, and reports back that it is ready. No engineer needed. No console cable. No CLI.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    That is Zero Touch Provisioning (ZTP). It is not new technology — it has existed in various forms for over a decade. But the tooling has matured to the point where any organization can implement it, and the time savings are significant enough that most organizations should.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  How It Works (The Simple Version)
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    ZTP relies on a chain of events that starts the moment a new device boots with no configuration. Here is the sequence in plain English:


      - **The device powers on with a blank config.** It has no idea what it is supposed to do, but it knows how to ask for help using a protocol called DHCP — the same protocol your laptop uses to get an IP address when you connect to Wi-Fi.
      - **A DHCP server answers.** This is not your standard office DHCP server. It is a smart server that recognizes the new device (usually by its serial number or MAC address) and gives it not just an IP address, but also a pointer to where it can download its configuration file.
      - **The device downloads its configuration.** It reaches out to a web server or file server at the URL the DHCP server provided, downloads a configuration file that was pre-built specifically for it, and applies it.
      - **The device reboots with the new config.** It comes up fully configured — correct hostname, correct IP addresses, correct routing, correct access controls. It joins the network as if an engineer had spent hours on it.
      - **A monitoring system verifies the deployment.** Automated checks confirm the device is reachable, its configuration matches what was expected, and all interfaces are up.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The Technology Stack
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    There are several ways to build a ZTP system. Here is one stack we have deployed in production that uses all open-source components:

    **ISC Kea (DHCP Server):** Kea is a modern DHCP server built by the Internet Systems Consortium. Unlike older DHCP servers, Kea stores its configuration and lease data in a database (PostgreSQL or MySQL) and provides a REST API for management. This means you can programmatically add new device reservations without editing config files and restarting the service.

    When a new device sends a DHCP request, Kea looks up the device's MAC address or client identifier in its database, assigns the reserved IP address, and includes DHCP options that tell the device where to find its config file. For Cisco devices, this is typically Option 67 (bootfile name). For Juniper, it is Option 43 with specific sub-options.

    **Django (Configuration Server):** A Django web application serves as the brains of the operation. It stores device records — what type each device is, what site it belongs to, what configuration template to use, and what variables to substitute into that template. When a device requests its configuration, Django renders the template with the device-specific variables and serves it as a downloadable file.

    The Django application also provides a web interface where network engineers can register new devices, assign them to sites, and preview the configuration that will be generated. This gives the team visibility into what every device will receive before it even powers on.

    **Python Scripts (Verification):** After a device applies its configuration and comes online, automated Python scripts verify the deployment. They check that the device is reachable via SSH, that the running configuration matches the intended configuration, that all expected interfaces are up, and that routing adjacencies have formed correctly. If any check fails, the system sends an alert.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  What ZTP Saves You
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    The time savings are the obvious benefit, but they are not the only one. Here is a complete list of what changes when you implement ZTP:


      - **No more pre-staging.** Devices ship directly from the vendor to the site. You eliminate the lab staging step entirely, which means no lab space needed, no shipping to and from the lab, and no inventory tracking of staged equipment.
      - **No more console cables.** Engineers never need physical access to the device for initial configuration. This is especially valuable for remote sites where sending a person costs thousands of dollars in travel expenses.
      - **Consistent configurations.** Every device gets its configuration from the same template engine. There is no variation based on which engineer happened to configure it. Configuration drift starts at zero instead of accumulating from day one.
      - **Faster deployment.** A site that used to take three to five days to bring online (staging, shipping, installation, configuration, verification) now takes hours. The device arrives, gets plugged in, and configures itself.
      - **Lower skill requirements at the site.** The person at the remote site does not need to be a network engineer. They need to plug in cables and confirm that lights turn on. This opens up deployment to field technicians, facilities staff, or even the end users at the site.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Common Concerns (And Why They Are Manageable)
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    **"What if the device gets the wrong configuration?"** This is prevented by the registration step. Every device is registered in Django with its serial number and MAC address before it ships. The DHCP server only responds to known devices, and each device gets a configuration built specifically for it. A rogue device that is not in the system gets a standard DHCP address with no configuration — it does not accidentally get someone else's config.

    **"What if the network is not ready when the device boots?"** ZTP devices are designed to retry. If the DHCP server is unreachable or the configuration server is down, the device will keep trying at regular intervals. Once the network is ready, the device provisions itself. No human intervention needed.

    **"What about security?"** Configuration files can be served over HTTPS with certificate validation. Some implementations use signed configuration files that the device verifies before applying. The DHCP reservations ensure only registered devices receive configuration pointers. And the entire process is logged for audit compliance.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    You do not need to ZTP your entire network at once. Start with one device type at one site. Build the DHCP reservation, the configuration template, and the verification script. Deploy one device using ZTP and validate the result. Once you trust the process, expand to more device types and more sites.

    The first ZTP deployment takes the longest because you are building the infrastructure. The second deployment takes a fraction of the time. By the tenth, it is routine.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://primeautomationsolutions.com/blog/posts/zero-touch-provisioning-explained.html" rel="noopener noreferrer"&gt;https://primeautomationsolutions.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>zero</category>
      <category>ztp</category>
      <category>automated</category>
      <category>isc</category>
    </item>
    <item>
      <title>Autonomous NOC Operations: What We Built and What We Measured</title>
      <dc:creator>Erik anderson</dc:creator>
      <pubDate>Sun, 12 Apr 2026 21:18:09 +0000</pubDate>
      <link>https://forem.com/erik_anderson_c41dbafd423/autonomous-noc-operations-what-we-built-and-what-we-measured-32m4</link>
      <guid>https://forem.com/erik_anderson_c41dbafd423/autonomous-noc-operations-what-we-built-and-what-we-measured-32m4</guid>
      <description>&lt;h1&gt;
  
  
  Autonomous NOC Operations: What We Built and What We Measured
&lt;/h1&gt;

&lt;p&gt;Every network operations engineer has lived this night: 2:47 AM, your phone buzzes. An alert fires for a link flap on a distribution switch. You open the ticket, SSH into the device, check the interface counters, bounce the port, verify neighbors come back up, close the ticket, and try to fall back asleep. Total time: 35 minutes. Total value of your expertise required: zero. A deterministic system could have handled the entire sequence in under 30 seconds.&lt;/p&gt;

&lt;p&gt;This is the alert fatigue problem, and it is getting worse. Enterprise NOCs today receive thousands of alerts per day. Industry research consistently finds that 40-60% of those alerts are duplicates, noise, or events with no actionable remediation path. Engineers spend most of their shift in triage, not resolution. EMA Research found that 27% of organizations report more than half of their Mean Time to Repair (MTTR) is wasted time -- the biggest contributor being team engagement, communication, and collaboration. The manual parts.&lt;/p&gt;

&lt;p&gt;Meanwhile, the staffing math does not work. NOC operations require 24/7 coverage across time zones, but the engineers capable of building automation are the same engineers pulling overnight shifts. You are burning your most expensive, hardest-to-replace talent on work that does not require their expertise. Forrester's research consistently identifies labor reallocation as the single largest source of measurable ROI in infrastructure automation -- often exceeding the direct savings from reduced downtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Framework: Four Pillars
&lt;/h2&gt;

&lt;p&gt;After years of building and operating network automation in satellite, enterprise, and service provider environments, I have landed on a four-pillar architecture for autonomous NOC operations. Each pillar depends on the one before it. Skipping one -- particularly telemetry or event streaming -- produces automation that is fragile, unmaintainable, or both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pillar 1: Observability and Telemetry.&lt;/strong&gt; You cannot automate what you cannot see. This means streaming telemetry (not SNMP polling), structured log aggregation, and metric collection at sufficient resolution to detect transient faults. Prometheus with custom exporters for network devices, combined with structured syslog pipelines, provides the foundation. YANG-based topology models give the structured inventory that enrichment and correlation logic depend on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pillar 2: Event Streaming and Correlation.&lt;/strong&gt; Raw telemetry events must be normalized, enriched with topology context, and routed to decision consumers without loss or unbounded latency. NATS JetStream provides persistent, ordered, exactly-once event delivery with consumer group support. Alert correlation -- grouping related events into a single fault incident -- happens at this layer before any remediation logic fires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pillar 3: Orchestration and Remediation.&lt;/strong&gt; Once a fault is correlated and classified, remediation must execute in a controlled, auditable manner. Cisco NSO provides transactional network configuration management with rollback capability via RESTCONF. Every remediation action logs before/after state diffs. Ansible handles operational procedures outside NSO's service model scope.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pillar 4: AI-Assisted Decision Support.&lt;/strong&gt; For faults that do not match known patterns, a multi-agent AI inference layer provides ranked remediation suggestions to on-call engineers. This layer is explicitly advisory, not autonomous, for novel fault classes. This is the human-in-the-loop boundary. Deep learning models in production achieve 93.5% accuracy predicting network failures up to 6 hours in advance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case Study: One Engineer, 60+ Services
&lt;/h2&gt;

&lt;p&gt;This framework is not theoretical. It runs in production, operated by a single engineer, across a non-trivial automation estate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Automated Services&lt;/td&gt;
&lt;td&gt;60+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production Nodes&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active Projects&lt;/td&gt;
&lt;td&gt;32+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event Bus&lt;/td&gt;
&lt;td&gt;NATS JetStream (PrimeBus)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Autonomous Agents&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Prometheus + Grafana&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Cisco NSO + Ansible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Escalation&lt;/td&gt;
&lt;td&gt;Human-in-the-loop via HumanRail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Staffing&lt;/td&gt;
&lt;td&gt;1 engineer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The system processes telemetry events across all 32+ projects through PrimeBus, a NATS JetStream-based intelligence platform that routes events to 16 autonomous agents. Each agent handles a defined class of operational event -- from fault remediation to configuration compliance to scheduled maintenance execution. Events that fall outside known patterns are escalated through HumanRail, which provides structured task routing with human approval boundaries.&lt;/p&gt;

&lt;p&gt;Prometheus collects metrics from all 60+ services, feeding Grafana dashboards for real-time operational intelligence. The observability layer monitors not just the managed infrastructure but the automation system itself -- remediation success rates, agent execution times, and escalation frequency are all tracked.&lt;/p&gt;

&lt;p&gt;The point: the same patterns that serve a large enterprise NOC can be operated by a solo practitioner managing a diverse automation estate. The constraint is not headcount. It is architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Metrics: Before and After
&lt;/h2&gt;

&lt;p&gt;These outcomes are drawn from documented implementations, sourced from Forrester TEI studies, Gartner network operations research, EMA Research, academic literature, and direct practitioner measurement.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Manual NOC (Baseline)&lt;/th&gt;
&lt;th&gt;Automated Closed-Loop&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MTTR&lt;/td&gt;
&lt;td&gt;45-120 min avg&lt;/td&gt;
&lt;td&gt;5-12 min avg&lt;/td&gt;
&lt;td&gt;Forrester TEI 2025; IJRCAIT 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tier-1 Auto-Resolution Rate&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;55-70% of incidents&lt;/td&gt;
&lt;td&gt;Gartner NOC Survey 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alert-to-Action Latency&lt;/td&gt;
&lt;td&gt;8-25 minutes&lt;/td&gt;
&lt;td&gt;15-45 seconds&lt;/td&gt;
&lt;td&gt;Practitioner measurement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After-Hours Escalations&lt;/td&gt;
&lt;td&gt;All faults&lt;/td&gt;
&lt;td&gt;&amp;lt; 20% of faults&lt;/td&gt;
&lt;td&gt;Implementation data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engineer Triage Hours/Week&lt;/td&gt;
&lt;td&gt;20-35 hrs/engineer&lt;/td&gt;
&lt;td&gt;4-8 hrs/engineer&lt;/td&gt;
&lt;td&gt;Forrester TEI 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration Drift Incidents&lt;/td&gt;
&lt;td&gt;Baseline variable&lt;/td&gt;
&lt;td&gt;-80% incident rate&lt;/td&gt;
&lt;td&gt;NSO reconciliation data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incident Reduction (AIOps)&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;69% reduction&lt;/td&gt;
&lt;td&gt;EMA Research 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The configuration drift number deserves attention. With Cisco NSO active reconciliation, drift between intended state and actual device state is detected and corrected continuously. This eliminates an entire class of incidents that traditionally requires manual comparison of running config against baseline -- a labor-intensive process most teams skip under operational pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with telemetry, not automation.&lt;/strong&gt; The single biggest mistake I see teams make is jumping to remediation automation before they have reliable, structured, machine-readable telemetry. If your monitoring data is noisy or incomplete, your automation will be too. Spend the first 8 weeks getting observability right. Everything downstream depends on it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow mode is non-negotiable.&lt;/strong&gt; Before any automated remediation touches production, it runs in shadow mode for 2-3 weeks: the system detects and classifies faults, proposes remediations, but does not execute them. Engineers review every proposed action. Fault types that do not achieve 95% accuracy in shadow mode stay in assisted mode. This is how you build trust in the system -- and trust is the hardest part of the entire project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal is not to replace engineers. It is to stop wasting them.&lt;/strong&gt; A lights-out NOC does not mean no engineers. It means no engineers doing work that a deterministic system handles better, faster, and at 3 AM. The engineers who build and maintain the automation system are more valuable, not less. Organizations that frame automation as a threat to roles will lose their best people to organizations that frame it as a career accelerator.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Business Case in One Paragraph
&lt;/h2&gt;

&lt;p&gt;Industry estimates place the average cost of unplanned IT downtime at $14,056 per minute. For a single fault class occurring twice per month, if automated remediation reduces MTTR from 90 minutes to 8 minutes, the annual time savings is 1,968 minutes. Even applying a conservative 15% severity-adjusted impact factor, the avoided cost is approximately $4.1M annually. Forrester documents a composite 192% ROI over three years with $3.3 million in net present value. The ROI case for autonomous operations is not difficult to construct for any organization with more than a few dozen network devices under management.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Full whitepaper with implementation methodology and complete reference architecture:&lt;/strong&gt; &lt;a href="https://primeautomationsolutions.com/whitepaper/" rel="noopener noreferrer"&gt;Download the whitepaper&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference architecture for the event bus layer:&lt;/strong&gt; &lt;a href="https://github.com/prime001/primebus-spec" rel="noopener noreferrer"&gt;PrimeBus Spec on GitHub&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Erik Anderson is a Principal Network Automation Engineer and founder of Prime Automation Solutions. He is the author of The Autonomous Engineer and architect of Project Helix, an autonomous operations platform for satellite network infrastructure.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>networking</category>
      <category>automation</category>
      <category>devops</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
