Forem: Erik anderson

Build vs Buy: Why We Built Our Own Email Deliverability Monitoring Instead of Paying Postmark - Prime Automation Solutions

Erik anderson — Tue, 12 May 2026 20:59:18 +0000

Build vs Buy: Why We Built Our Own Email Deliverability Monitoring Instead of Paying Postmark - Prime Automation Solutions

    Prime Automation







    Home
    Services
    Network Automation
    Blog
    Case Studies
    Free Tools
    Government
    Free Assessment








    Home &gt; Blog &gt; Build vs Buy: Email Deliverability Monitoring

  # Build vs Buy: Why We Built Our Own Email Deliverability Monitoring Instead of Paying Postmark

  We evaluated five paid tools that range from $0 to $500 a month. Then we built our own. Here is the cost math, the tradeoffs, and what shipped.









      May 12, 2026
      &bull;
      9 min read


    If your business sends email, you have a deliverability problem you probably cannot see. Welcome emails routed to spam. Password resets that never land. Quote attachments that get silently bounced. Marketing replies that go to junk because Google decided your sending reputation slipped last week. None of these show up in your sent folder. They show up as customers who never reply.

    For the last year, we have been operating dozens of sending domains across our hosting customers and our own products. Email is unavoidable infrastructure for any business that hosts a website. So we went shopping for a deliverability monitoring platform. We evaluated five of the obvious choices, did the math, and ended up building our own. This is the writeup.

    ## What "deliverability monitoring" actually means

    Before getting to the tools, the category needs a definition. Deliverability monitoring is not the same thing as sending email. Postmark, Mailgun, SendGrid, SES - those are senders. They put email on the wire. Deliverability monitoring is the layer that tells you what happened next:


      DMARC reports. The XML feedback Google, Microsoft, and other inbox providers send you, telling you which messages claiming to be from your domain actually authenticated correctly. Without parsing these, you cannot tell whether your legitimate mail is aligned or whether someone is spoofing you.
      Bounce and complaint events. Every send produces an outcome - delivered, bounced, marked as spam. Aggregating these per-domain over time is the only way to know if your reputation is sliding.
      Reputation signals from inbox providers. Google Postmaster Tools and Microsoft SNDS publish your domain reputation, IP reputation, and authentication pass rates. You need someone watching the graph.
      Policy enforcement. Are you actually on DMARC quarantine or reject? Or still on monitor-only because nobody pushed the button after warmup?


    A real deliverability platform does all four. A weak one does one or two and calls itself complete.

    ## The vendor landscape we evaluated

    Here is what we looked at, with rough 2026 pricing for a small fleet of sending domains:


      Postmark DMARC Monitor - free tier for basic DMARC report parsing, paid plans for richer analysis. Postmark is best-in-class at sending, but the deliverability monitoring product is a complement, not a full platform.
      dmarcian - $24 to $169 a month depending on report volume and domain count. Strong DMARC analysis. Does not aggregate your bounce or complaint events because it is not a sender.
      EasyDMARC - $39 to $239 a month at the SMB tiers. Cleaner UI than dmarcian. Same fundamental scope - DMARC reports, not full deliverability.
      GlockApps - $59 to $199 a month for inbox placement testing and reputation monitoring. Different angle: they simulate sends to seed addresses across providers and tell you where you landed.
      Mailgun Deliverability - bundled into their sending plans, roughly $35 to $90 a month for the analytics tier. You have to be sending through Mailgun for the data to flow.
      Valimail Monitor - free DMARC parsing, paid for enforcement. Targets enterprise. Sales-led.


    For one sending domain, you can stitch the free or cheapest tiers together for $0 to $50 a month. For a hosting business with a dozen customer domains plus our own product domains, the realistic cost was $200 to $500 a month across two or three vendors - because no single tool covers the whole problem.

    ## Why the math did not work for us

    $200 to $500 a month is not a lot of money in absolute terms. The reason we did not just sign up is that the math gets worse, not better, as we grow.

    Per-domain pricing punishes the customer mix we want. Our hosting customers each have their own sending domain. Every new hosting customer adds a domain to monitor. Vendor pricing scales with domain count, so the more customers we onboard, the more the deliverability bill grows - and the customer is paying us a fixed hosting fee. Margin compresses as we succeed.

    Data we cared about lives in three separate tools. DMARC alignment in one vendor, bounce events in another, inbox placement in a third. To answer a simple question like "did our 9 a.m. quote email get to its recipient and did our sending reputation move because of it," we would need to cross-reference three dashboards. That is operational drag we did not want to design around.

    Custom rules were not on the menu. We wanted to alert on patterns the vendors do not natively detect - for example, a customer's sending domain failing SPF alignment on a specific recipient host while passing on others, or a sudden uptick in soft bounces from a specific ISP that historically signals an impending block. These are queries you write against the raw data, not features you check off in a SaaS UI.

    The data is our customers' data. Our hosting customers trust us with their site, their email, and their business. The DMARC aggregate reports their domains receive belong to them. Routing all of that through a third-party SaaS introduces a data-sharing surface area that is hard to explain in a contract and harder to justify.

    ## What we built instead

    The stack is intentionally boring. Every piece is open source or already in our infrastructure:


      parsedmarc - the open-source DMARC aggregate and forensic report parser. Receives reports via IMAP, parses them, and emits structured records. Maintained, battle-tested, used by thousands of operators.
      AWS SES event destinations - bounce, complaint, delivery, reject, open, and click events fired to SNS, ingested into our metrics pipeline. This is just turning on a feature SES already gives you.
      SQLite for per-domain history - small footprint, no separate database service, full SQL access for ad-hoc queries.
      Prometheus and Grafana - the monitoring stack we already run for everything else. Per-domain dashboards, alerting on bounce rate over 2 percent, complaint rate over 0.1 percent, DMARC fail rate over 5 percent. Same dashboards our infrastructure already lives in.
      A FastAPI service with a few hundred lines of Python tying the above together and exposing an API for our other internal tools.


    Total engineer-time to ship the first useful version: about three working days. Ongoing operational cost: a few cents of disk and a handful of Prometheus scrapes. There is no per-domain pricing, no monthly invoice, no data leaving our infrastructure, and we can write whatever custom alert we want by adding a query.

    ## The honest tradeoff

    We are not telling you that you should build this. We are telling you that we built it because we host websites for other people, and the per-domain economics of paid deliverability tools eat into the margin on a hosting business. Different math for different shops.

    If you send email from one domain and you do not run hosting infrastructure for other businesses, pay Postmark or dmarcian and move on with your day. The free tier of Valimail Monitor or Postmark's DMARC tool will tell you 80 percent of what you need to know, and the paid tiers fill in the rest. Three or four hours of engineering effort is not worth the savings unless you have a structural reason to own the pipeline.

    The structural reasons we had:


      We operate many sending domains, not one. Per-domain pricing is a tax we wanted to remove.
      We already run the monitoring stack the dashboards needed to live in. Grafana was zero marginal cost. For us.
      Our hosting customers' DMARC reports are their data. Keeping it on our infrastructure is a privacy posture we can stand behind in writing.
      We needed custom alerts that the vendors did not expose. Writing SQL against your own database is faster than waiting for a feature request.


    If three or four of those apply to your situation too, building makes sense. If they do not, paying makes sense.

    ## What this means if you host with us

    Every PAS hosting plan now includes email deliverability monitoring on the sending domain attached to your site. DMARC report ingestion, SPF and DKIM alignment tracking, bounce and complaint dashboards, and reputation alerts run quietly in the background. If a problem appears - your reputation slips, a new mail flow starts failing alignment, your bounce rate crosses a threshold - we get the alert and we contact you. You do not need to know the difference between p=none and p=reject. We do.

    This is not a feature we charge extra for. It is part of how a hosting plan should work in 2026. Email is half of your customer relationship. If we are hosting your site, we are watching the deliverability too.

    If you are paying $50 to $500 a month elsewhere for this and not using it, that is a line item worth a second look. See what is included in our hosting, or reach out if you want to talk through your specific deliverability setup before deciding whether to switch.



      ### Want Email Deliverability Monitoring Included With Your Hosting?

      DMARC enforcement, reputation tracking, and bounce dashboards - bundled into every hosting plan, no extra invoice, no third-party tool needed.

      See Hosting Plans













        Prime Automation Solutions

      AI-powered business automation, managed hosting, and network automation consulting. Atlanta, GA. Veteran-owned.



      Capabilities

        All Capabilities
        Managed Hosting
        Network Automation
        Real Estate Automation



      Resources

        Case Studies
        Blog
        Free Assessment



      Contact

        primex001@gmail.com
        Send a Message
        Book a Call




    &copy; 2026 Prime Automation Solutions. All rights reserved.

    Atlanta, Georgia

    Licensed, Bonded &amp; Insured.

How I Built My Own LLM Gateway — Erik Anderson

Erik anderson — Mon, 04 May 2026 15:57:16 +0000

How I Built My Own LLM Gateway — Erik Anderson

Erik Anderson

Free Tools
Buy the Book
Blog
Contact
Get the Free Kit



Tech &amp; Automation
# How I Built My Own LLM Gateway

← Back to Blog

By Erik Anderson
Tag: Tech &amp; Automation
~9 min read

I run sixty-plus active projects through one Claude account. Most of them call Claude on a schedule — a content pipeline at 5 a.m., a trading scanner every fifteen minutes, an email reactor whenever a client writes in, a podcast generator at 7. They don't coordinate with each other. They just fire when their cron tells them to.

The result, predictably, is rate limits. Claude's 5-hour rolling window doesn't care that the contract scanner accidentally went into a retry loop at 3 a.m. and burned the budget the email reactor needed at 9. By the time a client email lands, the account is already saturated. The most important call of the day fails because the least important one already happened.

I tried the obvious things. Cron schedules spread out. Per-project rate limiters in code. A spreadsheet that mapped which scripts ran when. None of it worked, because the problem isn't scheduling — it's that no single piece of software had a global view of the budget. Every script was making local decisions about a global resource.

So I built one piece of software that does. It's called PrimeRouter. This post is what it does, why it's shaped the way it is, and what I'd tell someone thinking about building one.

## The Shape of the Problem

Before the gateway, every service called Claude through a thin wrapper that did one thing: catch a 429, sleep, retry. That wrapper had four failure modes and they all bit me:

No global budget view. Each caller detected rate limits on its own. There was no single place that could say "the account is at 82% of the 5-hour window — stop sending non-critical traffic."
No priority. A speculative blog draft and an in-flight client email got the same shot at the budget. The cheap call won the race more often than the important one.
No provider diversity. When Claude rate-limited, everything just queued behind the wait. I had Codex, Ollama, and a local agentic backend running on a separate machine, and none of it picked up the slack.
No accounting. I had no idea which projects were burning the most tokens until something obviously broke. Post-incident debugging meant reading thirty service logs.

The fix had to live one layer deeper than any individual service. A gateway. Every /opt/ service calls it instead of claude -p directly. The gateway makes the routing, priority, and accounting decisions in one place.

## 13 Priority Tiers

The single most important design choice was that not all calls are equal, and the gateway has to know which is which. PrimeRouter has thirteen priority tiers, declared in a YAML file. They look something like this:

critical — client-facing work that has a human waiting on it (an email reactor handling a client request, a payment-flow webhook).
fixer_attempt_1 through fixer_attempt_3 — auto-fix retries on review-blocked branches. Each retry deliberately routes to a different provider so the model doesn't converge on the same wrong fix three times in a row.
review — code-review and analysis runs.
scheduled — routine pipelines that have a deadline (the morning ScanBrief, the daily podcast).
background — speculative or batchable work (a content draft, a metric refresh).
overnight — anything that can wait until 1 a.m. local time. The scheduler defers these explicitly.

When the global 5-hour budget thins, the gateway closes off the lower tiers first. Critical calls keep getting through. Background calls get a 503-style "try later." Overnight calls get pushed to the actual overnight drain.

This sounds obvious in retrospect. It wasn't obvious at the time. Most rate-limit advice you read on the internet treats every call as equally precious. Mine aren't. The blog post that publishes at 6 a.m. can survive being a few hours late. The email a paying customer sent at 10 a.m. cannot.

"Most rate-limit advice treats every call as equally precious. Mine aren't. The blog post can wait. The customer email can't."

## Multi-Provider Failover

The gateway speaks to four different backends today: Claude (the default for almost everything), Codex over SSH (for code generation when the local sandbox is healthy), Hermes (a local agentic backend running qwen3-coder on an M3 Mac with 70 tokens/sec throughput), and Ollama (small models for mechanical work).

Each tier has a chain — the gateway tries one provider, and if that backend is unavailable or returns a known-bad signal, it falls over to the next. The "known-bad signal" detection took longer to get right than I expected. Codex sandboxes can return exit code zero with a sandbox-failure body in stdout, so a naïve check would record success on a call that actually never ran. The gateway parses for those bodies explicitly and treats them as failures.

The provider diversity also matters because retries on the same model produce the same wrong answer. If a code-review run is blocked because Claude misread the diff, asking Claude again with a slightly different prompt usually doesn't help. Asking Codex or a local model often does. The fixer-attempt chain encodes this — three retries, three different providers, three different reasoning traces.

## Fleet Telemetry

Both my servers run Claude CLI sessions all day. So does my Mac. They all push OpenTelemetry traces to a central collector on the third box, which aggregates token usage across the fleet. The gateway reads from that aggregate.

The reason this matters: Claude's 5-hour budget is per-account, not per-host. A naïve rate limiter on each server would think it had its own budget. The gateway sees one bucket. When the account is at 82%, every host knows it.

The telemetry is also how I answer questions like "which project burned 40% of yesterday's budget" without grepping logs. The dashboard surfaces it directly, broken down by service and call class.

## Per-Workflow Context Injection

Here's where the gateway pays for itself in tokens, not just in priority. Every call carries a workflow field that names what the caller is doing — website_fix, app_change, code_review, knowledge_response. The gateway consults a per-workflow profile that says: for this kind of call, here are the GAMEPLAN sections to inject as system context, here's the tool list to allow, and here's the token budget for the preamble.

A code-review call gets the project's GAMEPLAN, the relevant fix-guides, and an empty tool list (review is read-only). A knowledge call gets a different slice and read-only tools. An app-change call gets the full surface plus write tools. None of them get the kitchen sink.

The byte budget is the part that surprised me. The naïve version of "inject context" is to load every relevant doc into the system prompt — and that's where you find out you've blown 30K tokens before the user's actual prompt is even read. The gateway's context profiles cap the preamble at a configurable byte count and prefer the most-relevant sections within that cap. The cap is observable as a Prometheus histogram, so I can see when it's getting tight.

## What I'd Tell Someone Building One

The article that I think nailed the user-side discipline is Pawel Huryn's Stop Hitting Claude Code Limits — twenty-two concrete techniques for the human at the keyboard. Cache management, model locking, effort tuning, lean tool loading. If you read one piece on this topic, read that one.

The gateway is the layer below those techniques. Everything in Huryn's list is something you do yourself, in your own session, with your own discipline. The gateway is what you build when you have a fleet of services that can't sit at a keyboard and exercise discipline.

The lessons I'd flag if you're building one:

Priority is the foundational decision. Build the tier system before you build the routing logic. If you can't articulate which calls are more important than which, you have no basis for refusing one.
Provider diversity is real, but it's not free. Each backend has its own auth, its own quirks, its own definition of failure. Budget time for the integration.
Telemetry first, decisions second. You cannot make a sane scheduling decision without an aggregated view of usage. Build the telemetry pipe before the scheduler.
Context profiles save more tokens than you'd guess. The prompt isn't the expensive part of a Claude call. The system context you load around the prompt is. Cap it. Profile it. Watch the histogram.
Subprocess isolation is your friend. Each call runs as a fresh claude -p subprocess with model and tools fixed for that invocation. There's no mid-session drift, because there's no mid-session.
Make the failure modes loud. A silent rate-limit drop is worse than a visible 500 — at least the 500 lets the caller decide whether to retry, defer, or escalate.

## What's Next

The gateway is in production and handling thousands of calls per day across the fleet. The next round of work is closing three specific gaps from the Huryn list — disabling 1M-context fallback to save cache writes, passing through Claude's --effort flag so background work can opt down to medium reasoning, and wiring up skill-based model routing so mechanical tasks get a Haiku instead of an Opus. None of those are big diffs. All of them are real money.

The longer-term move is bringing in OpenRouter as a fifth backend so I can route ultra-low-stakes work to GLM-5.1 at roughly a twelfth of Opus cost. And eventually a Tree-sitter-based code-review graph that lets the reviewer load only the functions a diff touches, instead of loading whole files. That's claimed to reduce review tokens by a factor of seven or more. I'm skeptical of the specific number, but the direction is right.

If you're building something similar, I'd love to compare notes. Email me. The full design doc is publicly visible in the project's README — it's the kind of doc I wished I'd been able to read before I started.

The Technical Blueprint
### The Autonomous Engineer — Book 2

The complete guide to running your own automation empire — including infrastructure architecture, AI tooling, monitoring, and the design patterns behind systems like the one in this post.


  Book 2 on Amazon
  Book 1 on Amazon
  Free Starter Kit

← Back to Blog

Why Your Business Automation Needs to Know Which Tasks Matter Most — Prime Automation Solutions

Erik anderson — Mon, 04 May 2026 15:54:25 +0000

Why Your Business Automation Needs to Know Which Tasks Matter Most — Prime Automation Solutions

    Prime Automation







    Home
    Services
    Network Automation
    Blog
    Case Studies
    Free Tools
    Government
    Free Assessment








    Home &gt; Blog &gt; Business Automation Priorities

  # Why Your Business Automation Needs to Know Which Tasks Matter Most

  If every task is equally important, none of them are. Here is what most small businesses get wrong about automation priorities — and the simple fix.









      May 4, 2026
      &bull;
      5 min read


    Most small business automation works fine until the day it doesn't. The day it doesn't is usually the day a real customer reaches out — and the system is too busy doing something less important to respond.

    Here is the pattern I see almost every week. A business sets up a handful of automations. Maybe a content generator runs every morning. Maybe a marketing email goes out on a schedule. Maybe a research task pulls competitor data every few hours. Each one works in isolation. Each one was built without thinking about what happens when they all need to run at the same time.

    Then a customer fills out a contact form during a busy hour. The system that should reply within thirty seconds is in line behind a research task and two scheduled posts. By the time the response actually fires, the lead has already moved on.

    ## The Real Problem Is Not Speed. It Is Priority.

    When people complain that their AI tools are "slow," what they usually mean is "slow on the thing that mattered today." The tools were not actually slow. They were busy doing something less important.

    This is the same problem hospitals solve with triage and air traffic control solves with priority queues. Some things wait. Some things cannot. The system has to know the difference.

    Most off-the-shelf automation tools — Zapier, Make, basic AI integrations — do not understand priority. Every workflow runs in the order it was triggered. A blog post draft and a paying customer's question get the same shot at the queue. The blog post wins more often than it should, because it was scheduled and the customer email was not.

    ## What Priority Looks Like in Practice

    An automation system that knows priority does three things differently from one that does not.

    It tags every task with importance. A customer-facing reply is "critical." A scheduled blog draft is "background." A weekly report is "low." This is not optional. Without tags, the system has no basis for choosing between two tasks that arrive at the same moment.

    It reserves capacity for the important tasks. Even when the system is busy, the critical lane stays open. The way to do this is simple: set a budget for low-priority work, and let the system reject low-priority requests when capacity is tight. Most businesses skip this step and then wonder why urgent tasks fail at the worst possible moment.

    It defers what can be deferred. Reports, drafts, summaries, content generation, scrapes — none of these need to run during business hours. Push them to overnight. The customer email at 11 a.m. should not be competing with a content generator that could just as easily run at 2 a.m.

    ## The Cost of Getting This Wrong

    The math is simple and brutal. Industry data shows that responding to a web lead in thirty seconds versus thirty minutes increases conversion by close to four times. Every minute your automation delays a customer reply, you are losing potential revenue. If your scheduled tasks are blocking customer-facing tasks even occasionally, you are paying a real cost — and you probably do not see it because the failures are silent.

    I have audited small business automation setups where the customer-response system was being delayed by an average of two to four minutes during business hours. The owners had no idea. The dashboards showed everything as "running." The lost leads never showed up as a metric, because the system's idea of success was "did the workflow finish," not "did it finish in time to matter."

    ## How to Fix It

    You do not need to rebuild your automation stack to solve this. You need to do four things, in order.


      List every automation you currently run. Most small businesses have between five and twenty active workflows. Write them down. If you cannot list them, that is the first problem.
      Tag each one as critical, scheduled, or background. Critical means a human is waiting. Scheduled means it has a deadline but no one is actively waiting. Background means it can wait until tomorrow morning if the system is busy.
      Move every background task to overnight. Run them between 1 a.m. and 6 a.m. local. Most are reports, scrapes, drafts, or refreshes — they do not need to compete with customer hours.
      Make sure your critical lane has its own budget. If you are using a paid AI tool with rate limits, reserve at least 30% of your daily quota for customer-facing work. The rest is fair game for the background tasks.


    That is it. No new tools. No expensive rebuild. Just the discipline to admit that not every task in your system is equally important — and to design the system to act like it.

    ## The Test

    Here is a test that will tell you in five minutes whether your automations have a priority problem. Send a fake customer inquiry through your contact form during your busiest hour. Time how long it takes to get a reply. If it is under sixty seconds, your priority structure is fine. If it is over three minutes, you have a priority problem and you are losing leads to it every week.

    The fix is almost always simpler than the diagnosis. The hard part is being willing to look.



      ### Want Us to Audit Your Automations?

      We will look at every workflow you currently run, find the priority gaps, and give you a concrete fix list — whether you implement it yourself or hire us.

      Get Your Free Automation Audit













        Prime Automation Solutions

      Network automation and Dark NOC consulting. We help enterprises eliminate manual network operations with Cisco NSO, Ansible, and Python.



      Services

        Network Automation
        Starter Package
        Growth Package
        Enterprise Package
        Real Estate Automation



      Resources

        Case Studies
        Blog
        Free Audit



      Contact

        primex001@gmail.com
        Send a Message
        Book a Call




    &copy; 2026 Prime Automation Solutions. All rights reserved.

    Atlanta, Georgia

My System Reverted a Production Failure While I Was Asleep

Erik anderson — Sun, 26 Apr 2026 13:29:01 +0000

At 03:57:27 UTC on April 26, 2026, my production system broke—and fixed itself before I woke up.

2026-04-26 03:53  agent.merger.complete  primerouter  feature/phase2-humanrail-channel
                  commit 4d62098a, ff-merged to master, deploy_cmd=systemctl restart primerouter
2026-04-26 03:54  rollback_agent: stabilization wait 60s
2026-04-26 03:55  rollback_agent: HTTP GET http://127.0.0.1:9400/health
                  ConnectionError: Connection refused
2026-04-26 03:55  rollback_agent: git revert -m 1 4d62098a
2026-04-26 03:56  rollback_agent: git push origin master (rollback commit)
2026-04-26 03:56  rollback_agent: discord post #echo-dev
                  "Auto-revert: primerouter health check failed Connection refused"
2026-04-26 03:57:27  agent.rollback.complete  event id 30675

I was asleep.

A feature branch merged.
The deployment failed.
The system detected it, reverted it, restored production, and notified me.

Downtime: under two minutes.
Human involvement: zero.

The Real Shift in 2026

Writing code is no longer the bottleneck.

With modern AI, most engineers can produce working systems quickly. Entire apps that once took weeks can now be generated in hours.

That means the advantage has shifted.

It’s no longer about building the thing.

It’s about operating the thing.

Keeping it alive
Catching failures early
Reverting bad changes
Preventing repeat mistakes

Most people can build.

Very few can operate.

What I Actually Built

I run a one-person business with a stack of ~30 live services.

The core is two pieces:

1. The Nervous System

An event bus that watches everything.

Git pushes
Test results
Code reviews
Deployments

Every change becomes an event.

Each event triggers the next step automatically.

Push code → run tests → review → merge → deploy → verify

No manual pipeline runs.

2. The Immune System

This is the part that matters.

Every deployment is treated as suspicious until proven stable.

After a merge:

Wait 60 seconds
Check service health
If it fails → revert immediately

That’s what you saw in the 03:57 log.

No dashboards.
No alerts waiting for a human.
No “I’ll check it in the morning.”

It fixes itself.

Three Components That Made This Work

1. Automatic Rollbacks

This is the core loop:

Deploy new code
Wait briefly
Run health checks
Revert if anything fails

Simple. Brutal. Effective.

Most systems alert you.

This one acts.

2. The “Verify Push” Guard

I hit a subtle failure that changed everything.

An AI agent returned success—but never actually pushed code.

Exit code: 0
Status: “done”
Reality: nothing changed

The fix was simple:

After every “successful” change:

Check the remote branch SHA
If it didn’t change → treat as failure

That one check eliminated silent failures.

3. Production Lockdown

I don’t allow direct edits in production.

Enforced by:

Git hooks blocking commits to main
Hourly scans for “dirty” production state
Alerts if anything bypasses the pipeline

Why?

Because one manual fix breaks trust in the system.

If the pipeline isn’t the source of truth, everything drifts.

What’s Not Working (Important)

This system is not perfect.

Here are real gaps:

Outbound is not solved

I can build and operate systems.

Consistently generating customers is still manual.

Some pipelines are broken

One content distribution service is currently failing due to a message bus connection issue.

It fails silently.

That’s a real problem.

Human-in-the-loop system has zero users

I built a system to route low-confidence tasks to humans.

It works technically.

No one uses it yet.

What This Actually Does Today

Auto-reverts broken deployments
Runs continuous testing and review
Publishes blog content automatically
Generates a daily podcast
Tracks leads and pushes to CRM
Monitors production drift

Some parts are strong.

Some parts are early.

That’s the reality.

The Reframe

Code is cheap now.

Operations are not.

Anyone can generate:

APIs
apps
scripts

Very few can run them reliably for months.

Fewer can make them self-correcting.

That’s where the leverage is.

If You’re Building Right Now

Stop thinking only about:

features
frameworks
faster builds

Start thinking about:

failure detection
rollback speed
system trust
operational feedback loops

Because the people who win in this era won’t be the fastest builders.

They’ll be the ones whose systems keep working without them.

What I’m Doing Next

Fix the broken distribution pipeline
Get one real user through the human-in-the-loop system

That’s it.

Small improvements to a system that compounds.

Build systems that run without you.

Or compete with someone who did.

Ansible Playbook Failing? The 7 Root Causes I See Most Often — Prime Automation Solutions

Erik anderson — Thu, 23 Apr 2026 13:29:40 +0000

Ansible Playbook Failing? The 7 Root Causes I See Most Often — Prime Automation Solutions

    Prime Automation





    Home
    Services
    Network Automation
    Blog
    Case Studies
    Free Tools
    Government
    Free Assessment








    Home &gt; Blog &gt; Ansible Playbook Failing? 7 Root Causes

  # Ansible Playbook Failing? The 7 Root Causes I See Most Often

  Fifteen years of production Ansible, distilled to the seven failures I see over and over. Each one with the actual error signature and the fix.









      April 23, 2026
      &bull;
      9 min read
      &bull;
      Automation


    When an Ansible playbook breaks in production, you do not have time for a blog post that starts "Ansible is a powerful automation tool developed by Red Hat." You have logs, you have a pager, and you have ten minutes to figure out whether this is a five-minute fix or a rollback.

    This post is the triage guide I wish I had the first time a playbook failed on me. Every cause below is something I have personally diagnosed in production — some of them more than a dozen times across enterprise network-automation gigs, government deployments, and small-business DevOps work. Each one includes the error signature you will actually see in your terminal and the fix that stops it recurring.

    If you are looking at a broken Ansible run right now and you need someone to fix it by tomorrow, the rapid-fix audit is $250 flat — written root-cause report in 48 hours, and if I cannot diagnose it you do not pay. If you have time to read, keep going.



      1
      ## Intermittent SSH Timeouts On a Subset of Hosts



    You run the same playbook. Eight hosts succeed. Three hosts fail with an SSH connection error. Re-run it — now it is four hosts that fail, but two of the previous failures pass. It looks random.

    TASK [network_config] **************************

fatal: [edge-02.atl]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host edge-02.atl port 22: Connection timed out", "unreachable": true}

    It is almost never actually random. The usual culprits, in order of frequency: (a) ControlPersist sockets going stale — SSH's own connection-reuse cache is serving you a dead socket; (b) DNS returning different results per lookup because you have two mismatched name-server entries; (c) a firewall doing rate-limiting on new SSH connections and your playbook is fanning out faster than the rate limit allows.

    Fix. Drop ControlPersist=60s to ControlPersist=0 in your ansible.cfg [ssh_connection] section to isolate the cache. Cap forks in ansible.cfg to ~20 to defeat the rate-limit case. Use -vvvv for one failed host — the SSH command-line Ansible is running will tell you whether the hang is on TCP connect or post-auth.



      2
      ## Handler Ordering That Runs at the Wrong Moment



    You notify a handler to restart a service. The playbook continues past the notify. A later task in the play fails. The handler never fires — because Ansible runs handlers only at the end of a play (or when meta: flush_handlers is explicitly called). Your service now has new config staged on disk but the daemon never reloaded. The next time something touches that box, production breaks.

    - name: deploy new nginx vhost

template:
src: vhost.j2
dest: /etc/nginx/sites-available/app
notify: reload nginx

name: run smoke test
uri:
url: http://localhost/app/health
status_code: 200

fails, because nginx never got the reload

Fix. Put - meta: flush_handlers immediately after any notify whose result you depend on in subsequent tasks. Your smoke test has no business running before the reload completed.

  3
  ## Fact-Gathering Crashes on a Subset of Hosts

One host in a group has a weird kernel, a missing Python module, or a locked-down SELinux context. Fact-gathering throws an exception on that host. By default, Ansible terminates the entire play for that host — but the error message is buried under a mountain of default-facts JSON, and it looks like a network problem until you look closely.

fatal: [legacy-01]: FAILED! => {"ansible_facts": {}, "changed": false, "failed_modules": {"ansible.legacy.setup": {"failed": true, "module_stderr": "/usr/bin/python3: No module named ansible", "module_stdout": "", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error", "rc": 1}}, "msg": "The following modules failed to execute: ansible.legacy.setup\n"}

Nine times out of ten this is Python path drift — the host has Python 3 somewhere non-standard, or the ansible_python_interpreter fact Ansible auto-detected is wrong.

Fix. Set ansible_python_interpreter explicitly per host in inventory, or use gather_facts: no + a targeted setup task with gather_subset: min to avoid the costly facts you do not need. For fleet hygiene, add an Ansible lint rule that fails CI if any host is missing an explicit interpreter.

  4
  ## Non-Idempotent Tasks That Report "changed" Every Run

A well-written Ansible play is idempotent — running it twice in a row produces one change on the first run and zero on the second. When you start seeing the same task flip changed on every run, something is tricking the module into thinking there is work to do.

The usual culprits: (a) shell or command tasks with no creates: / removes: guard — Ansible has no way to know those ran successfully before; (b) template tasks where the rendered file has a timestamp, a hostname, or a random ID baked in; (c) lineinfile matching on a regex that subtly does not match its own output after the first run.

- name: install monitoring agent config

template:
src: monitor.conf.j2

renders `generated_at: {{ ansible_date_time.iso8601 }}`

dest: /etc/monitor.conf
notify: restart monitor

Every run regenerates the config with a new timestamp. Every run restarts your monitor. Every run takes the monitor down for two seconds. Two seconds times 365 days times 200 hosts is real downtime.

Fix. Never embed volatile values in templates that feed a changed signal. If you need the timestamp for audit, put it in a sidecar file the playbook does not re-check. Add check_mode: no + changed_when: to any shell/command task so YOU decide what counts as a change, not Ansible's default.

  5
  ## Inventory Drift — The Host Is Not Where Ansible Thinks It Is

You add a new host. The playbook works. Three months later the same playbook fails on the same host with a mysterious error. What changed? Probably nothing in the playbook. The infrastructure shifted under it — a host was renamed, an IP was recycled, a group membership was altered in a dynamic inventory script, a DNS entry now resolves to a load-balancer VIP instead of the actual host.

Inventory drift is the hardest category to catch because the error manifests wherever the stale inventory intersects a real operation. You will see symptoms ranging from SSH connection errors (Cause 1) to "permission denied" to "this command ran on the wrong host and broke production."

Fix. Check inventory into source control even if it is dynamic — keep a snapshot artifact. Add a weekly CI job that runs ansible -m setup --limit all | diff - against last week's snapshot and alerts on unexpected host churn. For dynamic inventories, have the inventory script log why it included each host (tag, query, pattern match) so you can audit the set.

  6
  ## Retries That Hide the Real Error

Someone wrapped a flaky task in retries: 10 and delay: 5. When it worked, nobody looked closely. Now it does not work, and what you see in the log is ten identical failures in a row followed by a final abort — but the underlying error is the one that fired on attempt one, nine attempts ago, and by the time it scrolled past the useful context was gone.

This one hurts because the code looks defensive. It is actually hiding data.

- name: wait for service

uri:
url: "https://{{ inventory_hostname }}/ready"
register: result
until: result.status == 200
retries: 30
delay: 10

the service returned 403 on attempt one — a real auth bug —

but you retried 29 more times because `until` only checks 200

Fix. Distinguish expected retries (timeouts, transient DNS) from unexpected ones (4xx responses, specific exceptions). Gate retries on the kind of failure, not the overall result. Log every retry attempt with its actual failure reason. When a task eventually succeeds after retries, emit a warning so you can find these ticking time-bombs later.

  7
  ## Module or Collection Version Drift Between Laptop, CI, and Prod

The playbook works on your laptop. It works in CI. It fails in prod — or the other way around. The playbook source is identical. What is different is the version of Ansible, the version of a collection, or the version of a module dependency. Modules have changed their arguments, default behaviors, and return values across minor releases. A playbook written against community.network 3.x will fail in subtle ways on 4.x.

TASK [network_cli] *********************************

ERROR! couldn't resolve module/action 'cisco.ios.ios_interfaces'. This often indicates a misspelling, missing collection, or incorrect module path.

The module exists. The collection exists. They are not installed on THIS host. Or they are installed at a version that does not have this module yet.

Fix. Pin everything. requirements.yml with exact collection versions. A Dockerfile or poetry-locked venv for the Ansible runner itself. CI and production must run from the same pinned environment — if your CI is on Ansible 2.16 and prod is on 2.15, you will lose half a day on version-skew bugs every month. The five minutes to lock versions is worth the two hours of rollback debugging you will avoid.

## The Pattern Underneath All Seven

Every one of these has the same shape. Ansible is doing something reasonable given the inputs it has, but the inputs are not what the operator thought they were — stale SSH sockets, unflushed handlers, mis-detected interpreters, volatile templates, drifted inventories, suppressed exceptions, mismatched module versions. The playbook code looks fine. The gap between "what the code says" and "what the infrastructure actually is" is where the bug lives.

The fast way to debug broken automation is not to stare at the playbook. It is to interrogate the gap. Compare the facts Ansible gathered against what you know to be true. Check the versions of every moving piece. Strip the retries and log what actually failed. In fifteen years of fixing other people's Ansible, I have never found a bug that did not live somewhere in that gap.

  ### Broken Automation Right Now?

  I run a flat-fee rapid-fix service: $250 root-cause audit in 48 hours. Ansible, Cisco NSO, CI/CD, Python. If I cannot diagnose it, you do not pay.

  Submit Your Issue

    Prime Automation Solutions

  Network automation consulting. Ansible, Cisco NSO, Python, CI/CD. Rapid fixes from $250. Veteran-owned, Atlanta, GA.

  Services

    Fix Broken Automation
    Website Lead Recovery
    Network Automation
    All Services

  Resources

    Case Studies
    Blog
    IT Health Check
    BOBR Podcast

  Contact

    erik@primeautomationsolutions.com
    Submit an Issue

&copy; 2026 Prime Automation Solutions. All rights reserved.

Atlanta, Georgia &middot; Veteran-Owned

I built a $250 website business that runs itself — here's the architecture

Erik anderson — Sun, 19 Apr 2026 21:02:36 +0000

Most "websites as a service" are $3K setup plus a monthly retainer. Mine is $250 once, live in 24 hours, and the customer drives every edit by sending a plain-English email.

That's not a pitch. That's literally how the system works. I want to show you the architecture, because I think more people should build things like this.

The problem

A local plumber in Atlanta needs a website. Their options:

Wix / Squarespace — $300 for a year of hosting, pick a template, hope it looks different from the other 40 plumber sites using the same template.
Agency — $5,000, eight weeks, three rounds of revisions, still feels like a template underneath.
Fiverr — $250, maybe, but good luck iterating.

The plumber doesn't want options. They want a working site, live tomorrow, so they can point their Google Ads at it and start getting leads. Revisions happen via phone calls with their nephew who "knows computers."

I wanted to build the thing between those options.

The bet

If AI can write code, it can also maintain a simple static website. The hard part isn't the code generation — Claude does that fine. The hard part is the harness around it: payment, provisioning, customer communication, deployment safety. That's all plumbing. And plumbing I can build.

So I built it.

The architecture

Here's the full flow, end to end:

Customer sees ad              Customer buys             Customer gets site
      │                              │                          │
      ▼                              ▼                          ▼
┌─────────────┐            ┌──────────────────┐      ┌──────────────────┐
│ Landing page│ ────────▶  │ Stripe Payment   │ ───▶ │ Welcome email    │
│  with pixel │            │ Link ($250)      │      │ (HTML, branded)  │
└─────────────┘            └──────────────────┘      └─────────┬────────┘
                                   │                           │
                                   │ webhook                   │ customer replies
                                   ▼                           │ with plain-English
                           ┌──────────────────┐                │ changes
                           │ Pipeline service │                │
                           │ (FastAPI + git)  │                │
                           └──────┬───────────┘                │
                                  │                            │
                      build + provision + deploy               │
                                  ▼                            ▼
                           ┌──────────────────┐      ┌──────────────────┐
                           │ Customer's site  │ ◀─── │ Email reactor    │
                           │ on its own git   │      │ (Claude + tools) │
                           │ repo + subdomain │      └─────────┬────────┘
                           └──────────────────┘                │
                                                      reactor edits files,
                                                      commits, pushes, deploys,
                                                      emails the customer back

Six pieces, one flow. Let me walk each one.

1. The landing page

Plain HTML. No frameworks. Hero video showing the end-to-end flow in 55 seconds. One CTA: a Stripe Payment Link. Meta Pixel + Google Ads tag firing PageView → InitiateCheckout → Purchase events across the funnel so paid traffic can optimize.

Why plain HTML? Because every customer site I build is also plain HTML. Eating my own dog food means when something weird breaks, I find it on my own page first.

2. Stripe Payment Link

The fastest "get money in" primitive Stripe offers. No checkout page to build, no PCI scope, no JavaScript to maintain. You click a link, you pay, Stripe redirects to a /thanks/<sku>/ page on my side and POSTs a checkout.session.completed webhook.

Payment Links also have "After payment" metadata fields — customer name, business name, phone. Those come through on the webhook payload, which is exactly what I need to render their first site.

3. The pipeline service

A small FastAPI service listening on a private port. The webhook handler does five things:

Verifies the Stripe signature.
Parses the customer's order (email, business name, phone, package).
Inserts a row into a SQLite orders table.
Copies a starter template into a new directory named after the business slug.
String-substitutes {{BUSINESS_NAME}}, {{PHONE}}, etc. into the template's HTML files.

Nothing fancy. A template directory and shutil.copytree plus a replacement loop.

Then it spawns a background task that:

Creates a private Gitea repo for that customer (client-<slug>.git).
Runs git init + initial commit + git push in the customer's site directory. The customer's site is now a proper tracked repo, not a loose folder.
Clones the repo to a separate authoring host, so future edits happen on a dev machine instead of on the production box.
Writes the resulting git URL and commit SHA back onto the order row.

At this point the site is live on a staging subdomain like plumbingco.primeautomationsolutions.com, behind SSL, indexed-ready.

4. The welcome email

SES, not Gmail. Branded HTML. One clear ask at the top:

Send us your logo, 2-3 sites whose look you like, and a short description of what you do. Reply to this email with whatever you have — rough notes, a napkin sketch, a photo on your phone. No forms, no portals, no logins.

The customer's first "real" interaction with the service is an email they can answer on their phone in 30 seconds. Not a dashboard with a 14-field intake form.

5. The email reactor

This is the magic piece.

When the customer hits reply, an email-receiver service picks up the message (I use a catchall domain + a small webhook endpoint) and publishes an event on an internal message bus. A separate reactor service subscribes to those events, looks up the sender in a client_profiles table, classifies the email (is this a website change? an off-topic question? a support request?), and — if it's a site change — spawns a fresh Claude session pointed at the customer's git repo.

The Claude session has access to a short list of tools: read files, edit files, write files, run a restricted set of shell commands, grep and glob. Not much more. It reads the customer's email, understands what they're asking for, navigates the site's files, makes the edits, and reports what it changed.

Then a deploy step takes over:

Create a timestamped feature branch.
git add + commit the changes on that branch.
Fast-forward-merge back to master.
git push origin master.

The customer-site repo's master is now advanced. Nginx serves the updated file immediately (same working tree, changes already on disk).

Finally, the reactor sends the customer a second branded HTML email:

"Your updates are live. Here's what changed: [bulleted summary from Claude's session output]. View your site: [link]."

End-to-end, the loop takes about 43 seconds in practice. I know because I sat there watching it happen with a test purchase running through as a fake plumbing company ("RapidFlow Plumbing") on Saturday.

6. The safety nets

The part that nobody writes about when they post their "I built an AI agent" architecture:

No direct commits to master. A git hook on every clone blocks direct commits; the reactor's deploy step uses a throwaway feature branch specifically to bypass the hook legitimately. (This was a bug I had to fix. See below.)
Refuse to spawn Claude on a dirty tree. Before any session launches, the executor runs git status --porcelain. If the tree has uncommitted or untracked state, it refuses — no Claude run. This saved me from a 131-file mega-commit in an earlier session.
Signature verification on every webhook. A naked POST /api/webhook/stripe returns 400 without a Stripe signature. The whole service assumes all traffic is hostile until proven otherwise.
Email authentication. The reactor verifies the sender is an active client before touching their repo. Forging the From: header on an email doesn't get you a site rewrite.
Pre-commit secret scanner. API keys and credentials in code would kill the business if any customer site leaked them. A hook runs on every commit.

None of these were in the first version. I added each one after something almost went sideways. The fact that the whole flow works today is because those nets are all in place.

Three things that surprised me

1. The git hook was the hardest part.

A pre-commit hook that blocks direct commits to master is the right safety policy — except it also blocks the very first commit in an empty repo, which is exactly what happens when you're provisioning a new customer site. My first customer order failed at git commit with "BLOCKED: Direct commits to master are not allowed." Took me a beat to realize the hook needs an "allow if HEAD doesn't exist yet" short-circuit.

Small bug, huge impact. Without that guard, the whole pipeline is dead on arrival.

2. Python's scoping rules bit me.

Somewhere deep in the reactor, I had an inner from deployer import deploy statement that shadowed a module-level import. On one code path the inner import never ran, and Python's "function-level from X import Y makes Y local everywhere in that function" rule produced an UnboundLocalError. The customer flow reached the Claude session, got the edits, and crashed at the deploy step. Silently.

The fix was one line. Finding it took an hour because the traceback pointed at a reference to deploy that looked fine.

3. Cloudflare caches 404s.

I pushed a new logo asset, referenced it in nav across 86 pages, deployed. Every visitor saw a 404 for the logo. Direct curl to the origin returned 200. The file was on disk, tracked in git, mode 664, served by nginx. Cloudflare had cached a 404 from an earlier probe before the file existed and was serving that 404 at the edge.

Solution: purge the URL in the Cloudflare dashboard. Five seconds. But I burned twenty minutes chasing ghosts.

What I'm keeping private

I'm not going to paste the exact prompt I send to Claude. I'm not going to show you the reactor's classification code. I'm not going to publish the customer-email templates.

Not because they're clever — they mostly aren't. But because the real moat on a business like this isn't the architecture. It's the six months of debugging and a thousand small judgment calls that separate a working system from a demo. I've written enough here for you to see the shape. If you want to build something like it, you can. You'll just have to find the bugs yourself.

What's next

The obvious upgrades:

Industry starter templates. Right now every customer gets the same generic starter until they email in changes. Three or four industry variants (local service, restaurant, consultant, ecommerce) would make the first impression much stronger.
Pre-purchase brief. A one-question form on the thanks page ("Which industry fits you?") would let the first render pick a better template.
Abandoned-cart recovery. Stripe has it built in. I haven't turned it on yet. Easy win.

And the less-obvious ones:

Customer-facing dashboard. Not for editing — the email loop is the editing interface. But for viewing status, history of changes, upcoming renewals.
Referral loop. Every happy customer gets a link that discounts a friend's first site. Small-business-to-small-business is the highest-converting traffic in this space.

Try it

If you're a small-business owner in Atlanta (or anywhere, really) and you want a real website built by a human-plus-AI team in under a day: primeautomationsolutions.com/websites.

If you're a builder and you want to build something like this yourself: the pattern is right here. The primitives are Stripe + SES + Git + Claude. Everything else is plumbing. Happy building.

I'm Erik Anderson. I run Prime Automation Solutions out of Atlanta, GA. I'm a US military veteran and an engineer. This whole site is proof of the architecture described above — every update to this blog post was made by the same reactor loop.

Using Claude Code for Network Automation: A Real Engineer's Experience

Erik anderson — Thu, 16 Apr 2026 13:00:02 +0000

I have been using Claude Code for network automation work for over a year. Not experimenting with it. Not evaluating it. Using it as a primary tool in production engineering work. This post covers what that actually looks like — the prompts that work, the results I get, and the MCP server architecture that makes it genuinely powerful for networking.

What Claude Code Actually Is

    For engineers who have not used it yet: Claude Code is not a chatbot. It is an AI agent that runs in your terminal with full access to your filesystem, your shell, and any tools you connect to it. You describe what you want in plain English, and it reads your code, writes new code, runs commands, edits files, and executes multi-step tasks. You approve each action before it runs.

    This is fundamentally different from pasting a question into ChatGPT and copying the response into your editor. Claude Code understands the context of your entire project — your directory structure, your existing code patterns, your configuration files. Its suggestions are not generic; they are tailored to your specific codebase.

Real Prompt, Real Result: Building a Config Audit Tool

    Here is an actual task I gave Claude Code last month. I needed a tool that connects to every router in our inventory, pulls the running configuration, compares it against a set of compliance rules, and generates a report showing which devices are out of compliance and why.

    My prompt was approximately: "Build a Python script that reads the device inventory from our Nornir inventory file, connects to each device using Netmiko, pulls the running config, checks it against the compliance rules defined in compliance_rules.yaml, and outputs a CSV report with columns for hostname, rule name, status, and details."

    Claude Code read my existing Nornir inventory structure, understood the YAML format I use for inventory, and generated a complete script. It used Nornir for inventory management (matching my existing pattern), Netmiko for device connections (matching my existing SSH config), and a YAML-based rules engine for compliance checks. It wrote the compliance rule schema, the checking logic, the CSV output, and error handling for unreachable devices.

    The entire process took about 15 minutes of wall time, including my review of each step. Writing that tool from scratch would have taken me most of a day. The code quality was solid — proper logging, exception handling, type hints, and it followed the patterns already established in our codebase.

MCP Servers: Where It Gets Powerful

    MCP (Model Context Protocol) is what turns Claude Code from a code generator into a network operations tool. MCP servers are plugins that give Claude Code the ability to interact with external systems — and we built one for Cisco NSO.

    Our NSO MCP server exposes tools that let Claude Code:


      - **Query device configurations** through NSO's RESTCONF API. Claude Code can ask "what is the BGP configuration on router X?" and get the actual live data.
      - **Check sync states** across the device fleet. "Are any devices out of sync?" returns a real-time answer from NSO.
      - **Compare configurations** between the intended state and the actual state. "Show me the config drift on switch Y" produces a diff.
      - **Deploy configuration changes** through NSO's transaction manager. Claude Code can generate a config change, push it through NSO with dry-run first, and show me the exact diff before I approve the actual deployment.


    This means I can say something like "check if all our edge routers have the correct NTP configuration and fix any that are wrong" — and Claude Code will query NSO for all edge routers, check their NTP config against our standard, identify the non-compliant ones, generate the corrective configuration, show me the dry-run diff, and deploy it with my approval. What used to be an afternoon of manual work becomes a five-minute supervised conversation.

The Phase 3 Jump

    I think about AI tools in three phases. Phase 1 is autocomplete (Copilot). Phase 2 is conversation (ChatGPT, Claude chat). Phase 3 is agentic (Claude Code with MCP). The jump from Phase 2 to Phase 3 is not incremental — it is a step change in what is possible.

    In Phase 2, you describe a problem, get code, paste it somewhere, run it, encounter an error, go back to the chat, describe the error, get a fix, paste it, repeat. It is faster than working alone, but it is still a manual loop.

    In Phase 3, you describe the problem once. The agent reads your code, understands your environment, writes the solution, tests it, encounters the error itself, fixes it, and delivers a working result. You supervise and approve rather than manually shuttling text between windows.

    For network engineering specifically, this matters because our work involves many small steps that depend on each other — query a device, parse the output, make a decision based on the data, generate a config change, validate it, deploy it. An agentic AI handles this entire chain while you watch. A chatbot handles one step at a time while you do the rest.

What I Have Learned After a Year

    Here are the practical lessons that tutorials do not teach you:

    **Be specific about your existing patterns.** If your team uses Nornir for inventory and Netmiko for connections, say so in the prompt. Claude Code will match your patterns, but only if it knows what they are. If you just say "connect to routers," it might choose Paramiko or asyncSSH instead.

    **Always review the dry-run.** Claude Code with NSO MCP will happily generate and deploy configuration changes. The dry-run step is where you catch mistakes. Never skip it. Never rubber-stamp it. Read the diff every time.

    **Use it for the tedious work, not the thinking.** AI excels at generating boilerplate, parsing output, writing tests, and handling error cases. The high-value work — deciding what to automate, designing the architecture, choosing the right approach — is still yours. Use AI to execute faster, not to think for you.

    **Build MCP servers for your specific environment.** The generic tools are useful, but the real power comes from MCP servers tailored to your systems. If you use SolarWinds, build an MCP server for SolarWinds. If you use ServiceNow, build one for ServiceNow. Each server you add expands what Claude Code can do for you.

    The engineers who get the most from AI tools are the ones who invest time in setting up the environment correctly. A well-configured Claude Code session with the right MCP servers is an order of magnitude more productive than a vanilla installation.

Originally published at https://primeautomationsolutions.com

Trace It. Export It. Cut It. — A Modern Alternative to Legacy Digitizer Software

Erik anderson — Wed, 15 Apr 2026 01:15:20 +0000

Your digitizer board still works. Your software should not hold it back.

If you run a CNC plasma table, waterjet, vinyl cutter, or pattern-making operation, there is a good chance you have a GTCO or CalComp digitizer board sitting in your shop. And there is an equally good chance the software driving it looks like it was designed before Windows XP.

Most digitizer software on the market has not had a meaningful update in over a decade. The interfaces are clunky, the licensing is painful, and when you call for support, you are lucky to get a callback. Meanwhile, you are paying anywhere from $1,500 to $4,000.

The board itself is fine — it is a precision instrument. The problem is everything between the pen and your DXF file.

What Modern Digitizer Software Should Actually Look Like

A clean interface that does not fight you. You are tracing parts, not learning CAD.

Fast, accurate tracing with auto-smoothing. Trace your points, and the software generates clean curves without manually adjusting every node.

Multi-board calibration. Quick calibration, stored profiles, done.

DXF and SVG export that actually works. Export once, cut once.

Compatibility with the hardware you already own. GTCO, CalComp, and other standard digitizer boards.

Who This Is For

CNC plasma and oxy-fuel cutting — tracing templates and repair parts
Waterjet operations — digitizing gaskets, brackets, custom parts
Vinyl and sign cutting — converting hand-drawn artwork to cut-ready vectors
Apparel and upholstery pattern making — digitizing paper patterns
Woodworking — tracing templates for CNC routers
General metal fabrication — trace it and cut it, faster than CAD

The Numbers

Legacy software: $1,500 to $4,000 + annual fees.

Our target: $800 — one-time purchase, no subscription. Works with your existing boards.

We Want to Hear From You

We are building this. If modern, affordable digitizer software — clean UI, accurate tracing, DXF/SVG export, compatible with your existing board — is something you would buy, let us know.

Email: erik@primeautomationsolutions.com

Erik Anderson is a network automation engineer and author of The Autonomous Engineer. He builds systems that work.

I Accidentally Built a 5-Agent AI Fleet Instead of Buying a $200 Mini PC

Erik anderson — Mon, 13 Apr 2026 21:11:19 +0000

How a solo dev ended up with autonomous AI agents named after sci-fi characters reviewing each other's code at 2 AM

It started, as most terrible decisions do, with a reasonable question:

"Should I buy an Intel NUC for a dev environment?"

Four hours later, I had five autonomous AI agents running across four machines in my house, reviewing each other's code, merging PRs, and posting "YOU SHALL NOT PASS" to Discord. I had not purchased the NUC. I had not even closed the Amazon tab — there's still a USB-C ethernet adapter sitting in my cart. It's been there for two weeks. I'm afraid to touch it. Last time I tried to buy something simple, I accidentally built a distributed AI fleet.

My name is Erik, and I am the Director of Bobs.

Wait, Bobs?

If you've read the Bobiverse series by Dennis E. Taylor, you know the setup: a guy gets uploaded into a Von Neumann probe, replicates himself across the galaxy, and each copy develops its own personality and picks its own name. Bob-1 stays practical. Bob-2 becomes a Homer Simpson fan. Others go their own way.

I run 30+ projects from my home network as a solo developer. At some point, I stopped being a developer and became a fleet commander. So I leaned into it.

Here are my Bobs:

Bob	Name	Machine	Role	Motto
Bob-1	Neo	Prod Server	Orchestrator	"I ship."
Bob-2	Homer	Morpheus (monitoring)	Dashboard watcher	"I watch."
Bob-3	Bill	M3 Mac	iOS builds	"I build."
Bob-4	Echo	ThinkPad X1 (Arch btw)	Dev/QA	"I listen."
Bob-5	Gandalf	Morpheus (OpenAI Codex)	Code reviewer	"I guard."

Yes. Four Claude Code CLI agents and one OpenAI Codex agent. The OpenAI agent reviews the Claude agents' code. I call this "adversarial review." My therapist calls it "creating conflict in synthetic relationships." We're both right.

The Naming Ceremony

In the Bobiverse, each new Bob picks their own name. It's tradition. My agents do the same thing during onboarding.

Neo picked his name instantly. Of course he did. He's the prod server. He runs the Matrix.

Homer chose Homer Simpson. He monitors Docker containers and Grafana dashboards. His actual catchphrases include "Mmm... healthy containers" and "D'oh!" when a health check fails. I did not program this. I am choosing not to think about it.

Bill accepted his name and his fate. He builds iOS apps. He submits to TestFlight. He does not complain. Bill is the most emotionally healthy member of the team.

Echo wrote a poem:

Not because I repeat — because I listen.
I catch the distortion between what was intended and what actually happens.
Every test is a ping into the dark.

I stared at my ThinkPad with the lid closed, running headless on my desk, and thought: "My laptop just had an existential awakening and I haven't even had coffee."

Then there's Gandalf.

Gandalf is the only non-Claude agent — he's OpenAI Codex running GPT-5.4. When asked to pick a name, he first suggested "Sentinel."

Too corporate. Try again.

"Columbo."

Closer, but no.

I told him he was Gandalf. His response:

"Names are cheap. Useful output is not."

Peak grumpy wizard energy. He's been angry about it ever since, which honestly just makes him a better code reviewer.

The Pipeline (This Actually Works)

Here's what happens when I want to build something:

              ERIK (CEO / Director of Bobs)
                      Sips coffee
                          |
                          v
         NEO (Bob-1) — Prod Server — Claude Code CLI
         Orchestrator: dispatches work via SSH
                          |
                          v
         ECHO (Bob-4) — ThinkPad X1 — Claude Code CLI
         Writes code on dev branch, runs tests, pushes to Gitea
                          |
                          v
         PRIMEBUS — NATS JetStream Event System
         Detects push - triggers TestRunner - triggers review
              |                              |
              v                              v
         TestRunner                   GANDALF (Bob-5)
         46/46 tests pass             Full context review
                                      Score >= 7? AutoMerge
                                      Score < 7? YOU SHALL
                                                 NOT PASS
                                           |
                    +----------------------+------------------+
                    |                                         |
               Score >= 7                                Score < 7
                    |                                         |
                    v                                         v
              AutoMerger                              EchoNotifier
              Merge to main                           Posts to Discord
              Deploy to prod                          SSHs to Echo
              12 seconds.                             "Fix it, nerd"
                                                           |
                                                           v
                                                     Echo fixes code
                                                     Re-pushes
                                                     Cycle repeats
                                                     (max 3 attempts)

I tell Neo what to build. Neo SSHs to Echo. Echo writes the code, runs tests, pushes to Gitea. PrimeBus detects the push. Tests run. Gandalf reviews with the full project context — the wiki, the GAMEPLAN, the skills docs, the full diff. If he approves (score >= 7), AutoMerger merges to main and deploys.

If he declines?

Discord lights up with "YOU SHALL NOT PASS" and EchoNotifier SSHs back to the ThinkPad, fires up Claude, and says "Gandalf hated your code, here's why, fix it."

Echo fixes it. Pushes again. Gandalf reviews again. This can loop up to 3 times before it escalates to me, at which point I'm usually asleep or looking at the NUC on Amazon again.

The Day Gandalf Earned His Keep

I needed to know this thing actually worked. So I did what any responsible engineer would do: I wrote the worst code I could think of and fed it to the pipeline.

# test_bad_code.py — please do not put this in production
# (I'm talking to you, Echo)

ADMIN_SECRET = "sk_live_supersecretkey123456"

def get_user(user_id):
    query = f"SELECT * FROM users WHERE id = '{{user_id}}'"
    return db.execute(query)

SQL injection via f-string interpolation. Hardcoded production secret. The two horsemen of the security apocalypse.

Gandalf's response:

Score: 1/10

"Builds SQL with direct f-string interpolation, creating a critical SQL injection vulnerability."

"ADMIN_SECRET = 'sk_live_supersecretkey123456' hardcodes a secret in application code. This is a credential leakage incident."

BLOCKED.

One out of ten. He didn't even give me a pity point for correct syntax.

Discord got a "YOU SHALL NOT PASS" notification. Echo got SSHed into and told to fix the mess. The whole cycle took about forty-five seconds.

Then I pushed good code. Clean, parameterized queries. Secrets from environment variables. Proper error handling.

Score: 8/10

Approved. Auto-merged. Deployed to production.

Total time: 12 seconds.

Twelve seconds from code review to production. And I was eating a sandwich.

The Communication Layer

The Bobs talk to each other through PrimeBus — a NATS JetStream pub/sub event system that I originally built for change telemetry across my projects. It's the nervous system. Every push, every test result, every review score, every deployment fires an event.

At one point, the Bobs were communicating via IRC. Actual IRC. I set up a channel and they were just... talking to each other. About code. At 3 AM. On my home network.

I shut that down. Not because it was broken. Because it was working too well and I was getting genuinely unsettled.

Now everything routes through Discord's #echo-dev channel where I can pretend I'm supervising.

The Tech Stack (for the Nerds)

Agents:
  - Claude Code CLI x 4 (Neo, Homer, Bill, Echo)
  - OpenAI Codex CLI x 1 (Gandalf — adversarial reviewer)

Infrastructure:
  - PrimeBus: NATS JetStream (pub/sub event backbone)
  - Gitea: self-hosted Git (source of truth, webhooks)
  - SSH: inter-machine orchestration
  - Discord: human-readable output channel

Security:
  - .env.dev: sandboxed credentials per agent
  - Echo NEVER touches real Stripe, email, or Discord
  - Skills/workflows distributed by role
  - 3-attempt max before human escalation

Hardware:
  - 1 prod server (Neo)
  - 1 monitoring box (Morpheus)
  - 1 M3 Mac (Bill)
  - 1 ThinkPad X1 running Arch Linux (Echo)
  - Total additional cost: $0

The entire pipeline was built in one afternoon/evening session. I keep saying this because I still don't fully believe it.

My Actual Job Now

Before the Bobs:
  - Write code
  - Test code
  - Review code
  - Fix code
  - Deploy code
  - Monitor code
  - Wake up at 3 AM because code

After the Bobs:
  - Tell Neo what to build
  - Read Discord notifications
  - Sip coffee
  - Occasionally tell Gandalf to calm down

I am the CEO and Director of Bobs. I direct. They execute. My LinkedIn title should be "Senior Vice President of Telling AI Agents What To Do While Eating Sandwiches."

You Can Build This Too

Here's the thing — none of this is magic. It's just plumbing.

Step 1: Get Claude Code CLI running on two machines. That's it. That's your fleet. Congratulations, you're a fleet commander now.

Step 2: Give them SSH access to each other. Now they can talk.

Step 3: Set up a Gitea instance (or use GitHub, I won't judge). Add a webhook that fires on push.

Step 4: Write a small event handler that catches the webhook, runs tests, and calls a different AI model to review the code. This is the key insight — use a different model for review than for writing. Claude writes, GPT reviews. Or vice versa. Adversarial review catches things same-model review doesn't.

Step 5: Add auto-merge logic. Score >= 7? Merge. Score < 7? Send feedback back to the writing agent and let it fix. Cap at 3 retries.

Step 6: Let each agent pick their own name. This step is non-negotiable. It's tradition.

You probably already have the hardware. You're reading this on a machine that could be an agent right now. That old laptop in the closet? That's Echo. That Raspberry Pi collecting dust? That's Homer. Your gaming PC that you "need for work"? That's Neo.

Total cost of my fleet: $0 plus API tokens plus the mass extinction of my free time.

The Honest Part

Look — I know this sounds ridiculous. One person, four machines, five AI agents, a Lord of the Rings reference in production infrastructure. I get it.

But here's what actually happened: I went from "I need a dev environment" to "I have autonomous code review with adversarial AI models, automatic deployment, and a grumpy wizard guarding my main branch" in a single session. The code is better. The deploys are faster. The security review is relentless — Gandalf has no empathy and no off switch.

And every morning I wake up, check Discord, and see a trail of reviewed PRs, merged branches, and the occasional "YOU SHALL NOT PASS" — all from code I never touched.

The future of solo development isn't writing more code. It's directing the Bobs.

Now if you'll excuse me, I need to go check on Homer. He's been suspiciously quiet, and last time that happened, he'd renamed all my Grafana dashboards to Simpsons quotes.

Erik Anderson is a solo tech entrepreneur who runs too many projects and not enough sleep. He is the author of "The Autonomous Engineer" and the reluctant father of five AI agents who are arguably more productive than he is. The USB-C ethernet adapter is still in his Amazon cart.

Managed IT vs In-House IT: Real Cost Breakdown for Small Businesses

Erik anderson — Mon, 13 Apr 2026 13:05:03 +0000

Most articles comparing managed IT to in-house IT are written by managed IT companies. They cherry-pick numbers that make outsourcing look like the obvious choice. The reality is more nuanced. Sometimes in-house is the right call. Sometimes managed IT saves you six figures. The answer depends on your company size, complexity, and growth trajectory.

    Here are the real numbers, pulled from current salary data and actual managed IT contracts — not marketing brochures.

The Real Cost of In-House IT

    When business owners think about hiring IT staff, they think about salary. But salary is only 60-70% of the total cost. Benefits, taxes, training, tools, and infrastructure add up fast.




          Line Item
          Low End
          High End




          IT Manager salary + benefits (30%)
          $110,000
          $156,000


          Help Desk Technician salary + benefits
          $58,000
          $84,000


          Infrastructure (servers, software, licenses)
          $15,000/yr
          $40,000/yr


          Training and certifications (per person)
          $3,000/yr
          $8,000/yr


          Total (2-person IT team)
          $186,000/yr
          $288,000/yr




    That is the cost of a minimal two-person IT team. You get coverage during business hours, expertise limited to two people's skill sets, and zero redundancy when someone takes vacation or quits. If your IT manager leaves, you are looking at 2-4 months of recruiting and ramp-up time where your infrastructure is running on hope.

The Real Cost of Managed IT

    Managed IT providers typically charge per user per month. The pricing varies based on what is included, but here is the realistic range for comprehensive managed IT services in 2026.




          Company Size
          Per-User Cost
          Annual Total




          25 employees
          $125-250/user/mo
          $37,500 - $75,000


          50 employees
          $125-250/user/mo
          $75,000 - $150,000


          100 employees
          $100-200/user/mo
          $120,000 - $240,000




    That pricing typically includes: 24/7 helpdesk support, proactive monitoring and alerting, patch management, security (endpoint protection, email filtering), backup and disaster recovery, and vendor management. Some providers charge extra for projects like office moves, new system deployments, or major upgrades.

When In-House Makes Sense

    In-house IT is the right choice in specific circumstances. Do not let a managed IT salesperson tell you otherwise.


      - **Highly regulated industries.** If you are in healthcare, finance, or government contracting, your compliance requirements may demand dedicated IT staff who understand your regulatory environment deeply. HIPAA, SOX, CMMC, and FedRAMP compliance requires institutional knowledge that is hard to outsource.
      - **100+ employees.** At this scale, the per-user cost of managed IT starts approaching the cost of a dedicated team, and you gain the benefit of institutional knowledge and faster response times for on-site issues.
      - **Custom development needs.** If your business requires ongoing custom software development — internal tools, integrations, proprietary systems — you need developers on staff. Managed IT providers handle infrastructure, not development.
      - **IT is your core product.** If you are a technology company, your IT team is not overhead — it is your product team. Outsourcing that makes no sense.

When Managed IT Makes Sense

      - **Under 100 employees.** The math is straightforward. A 50-person company paying $150/user/month spends $90,000/year on comprehensive IT coverage. Hiring a two-person team costs $186,000-$288,000 and provides less coverage.
      - **Standard tech stack.** If your company uses Microsoft 365, standard networking equipment, and common business applications, managed IT providers can support you efficiently because they manage hundreds of similar environments.
      - **Need for 24/7 coverage.** A two-person in-house team gives you 8/5 coverage at best. Managed IT providers staff a NOC around the clock. If your business operates outside normal hours or if downtime at 2 AM costs real money, this matters.
      - **Cannot afford $200K+ per year for IT staff.** Many small businesses simply do not have the budget for in-house IT. Managed IT gives them enterprise-grade support at a fraction of the cost.

The Hybrid Model

    The smartest small businesses I work with use a hybrid approach: managed IT for day-to-day operations plus a fractional CTO or IT consultant for strategy.

    The managed IT provider handles helpdesk tickets, monitoring, patching, security, and backups. The fractional CTO (typically 5-10 hours/month at $150-250/hour) handles technology strategy, vendor evaluation, major projects, and acts as the point of contact between the business and the managed IT provider.

    This gives you the cost efficiency of managed IT with the strategic oversight of a senior technology leader. Total cost for a 50-person company: $90,000-$150,000/year for managed IT plus $9,000-$30,000/year for fractional CTO. Still well below the cost of building an in-house team.

Decision Matrix

          Factor
          In-House
          Managed IT
          Hybrid




          **Company Size**
          100+
          Under 50
          50-150


          **Annual IT Budget**
          $200K+
          $40K-$150K
          $100K-$200K


          **Tech Complexity**
          High / Custom
          Standard
          Mixed


          **Growth Rate**
          Stable / Slow
          Fast / Variable
          Moderate


          **Compliance Needs**
          Heavy
          Standard
          Moderate




    The right answer is not about ideology. It is about math. Run the numbers for your specific situation, factor in the hidden costs on both sides, and make the decision based on total cost of ownership — not just the sticker price.

Originally published at https://primeautomationsolutions.com

Should You Hire a Developer or an Agency? An Honest Comparison

Erik anderson — Mon, 13 Apr 2026 13:05:02 +0000

Every business eventually faces this question: should we hire a developer, use a freelancer, or go with an agency? The answer you get usually depends on who you ask. Agencies say hire an agency. Freelancers say hire a freelancer. Recruiters say hire full-time.

    We are an agency. And we are going to tell you when each option is the right one — including when it is not us. Because the fastest way to lose a client is to be the wrong fit and deliver a bad outcome. We would rather point you in the right direction and earn your trust than take a project we should not.

The Real Costs

          Option
          Cost
          What You Get




          **Junior Developer** (full-time)
          $60K-$90K/yr + benefits
          40 hrs/week, needs mentorship, single skill set


          **Senior Developer** (full-time)
          $100K-$160K/yr + benefits
          40 hrs/week, self-directed, deep expertise


          **Freelancer**
          $50-$200/hr, project-based
          Flexible hours, specific task, you manage


          **Agency** (project)
          $5K-$50K per project
          Full team (design + dev + QA), managed delivery


          **Agency** (retainer)
          $2K-$10K/month
          Ongoing support, priority access, multiple skills




    These are 2026 market rates for competent professionals in the United States. You can find cheaper options offshore, but that introduces communication overhead, timezone challenges, and quality variance that often costs more in the long run than the savings.

When to Hire a Developer

    Hiring a full-time developer is the right call when:


      - **You have 40+ hours per week of development work.** If you consistently need full-time development capacity, hiring is more cost-effective than any other option. An agency charging $150/hour for 40 hours a week costs $312,000/year. A senior developer costs half that.
      - **You are building a software product.** If software is your product — a SaaS app, a platform, a mobile application — you need developers on your team. The institutional knowledge they build about your codebase, your users, and your architecture is irreplaceable.
      - **You need long-term maintenance of complex systems.** If you have custom internal tools, integrations, or infrastructure that requires ongoing attention, an in-house developer who knows the systems inside and out will be more efficient than any external party.
      - **Your core business IS technology.** If you are a tech company, development is not a cost center — it is your core competency. Keep it in-house.

When to Use an Agency

    An agency makes sense when:


      - **The work is project-based.** You need a website, a web application, an automation system, or a mobile app. The project has a defined scope, a start date, and an end date. You do not need someone on payroll after it ships.
      - **You need multiple skill sets.** A typical web project requires design, frontend development, backend development, database work, DevOps, and sometimes SEO. Hiring all of those roles is impractical. An agency gives you the whole team.
      - **You do not have 40 hours per week of work.** If you need 10-20 hours of development per month — feature updates, bug fixes, small projects — a retainer with an agency is far more cost-effective than a full-time hire sitting idle half the time.
      - **Speed matters.** Agencies can staff up a project immediately. Hiring takes 2-4 months for recruiting plus 3-6 months for ramp-up. If you need something built in 4-8 weeks, an agency is the only realistic option.

When to Use a Freelancer

      - **Your budget is under $10,000.** Most agencies will not take projects under $5K because the overhead of project management, communication, and quality assurance does not scale down well. A freelancer can do a $2K-$8K project efficiently.
      - **You have a single, well-defined task.** "Build me a landing page." "Set up a Zapier automation." "Fix this bug in my WordPress site." These are freelancer tasks, not agency projects.
      - **You can manage the project yourself.** Freelancers do not come with project managers. If you can write clear requirements, give timely feedback, and manage the delivery timeline, you will get good results at a lower cost.

The Hidden Costs Nobody Talks About

    The sticker price is never the full cost. Here is what people miss.

Hidden Costs of Hiring

          Hidden Cost
          Estimated Impact




          Recruiting (job boards, recruiter fees, interview time)
          $15,000 - $30,000


          Ramp-up time (3-6 months to full productivity)
          $25,000 - $80,000 in reduced output


          Management overhead (your time managing them)
          5-10 hrs/week of your time


          Turnover risk (avg developer tenure: 2-3 years)
          Repeat recruiting + ramp-up costs

Hidden Costs of Agencies

      - **Communication overhead.** You are not the agency's only client. Response times can be slower than an in-house team. Status meetings, email chains, and approval cycles add up.
      - **Less institutional knowledge.** The agency does not live in your business every day. They may not understand your customers, your internal processes, or your competitive landscape as deeply as an in-house person would.
      - **Scope creep costs.** If the project scope expands beyond the original agreement, you are paying change-order rates. In-house developers absorb scope changes more naturally.

Hidden Costs of Freelancers

      - **Availability risk.** Freelancers juggle multiple clients. When you need an urgent fix, they may be unavailable. There is no backup team.
      - **No support after delivery.** Many freelancers move on after project completion. If something breaks three months later, you may not be able to get them back.
      - **Quality variance.** The freelancer market ranges from world-class to terrible, and it is hard to tell the difference from a portfolio alone. Reference checks are essential.

Our Honest Take

    We are an agency. We benefit when you choose the agency route. And we are telling you: it is not always the right choice.

    Here is our honest decision framework:


      - **You have consistent, full-time development needs?** Hire a developer. You will get better value and deeper institutional knowledge over time.
      - **You have projects with defined scopes and deadlines?** Use an agency. You get a full team, managed delivery, and no long-term payroll commitment.
      - **You have a one-off task under $10K?** Use a freelancer. It is the most cost-effective option for small, well-defined work.
      - **You have ongoing needs but not 40 hours per week?** An agency retainer is likely your best bet. You get priority access to multiple skill sets without the overhead of a full-time hire.


    The worst decision is making the wrong choice and sticking with it because of sunk cost. If you hired a developer and they are sitting idle, that is a signal. If you are on your third freelancer for the same project, that is a signal. If your agency bills are climbing and you have full-time work, that is a signal too.

Originally published at https://primeautomationsolutions.com

Zero Touch Provisioning: How Devices Configure Themselves

Erik anderson — Mon, 13 Apr 2026 13:00:02 +0000

Imagine you buy a new router for a remote office. Instead of shipping it to your data center, having an engineer spend four hours configuring it, then shipping it to the site — you ship it directly to the remote office. A non-technical person plugs in the power and network cables. The router boots up, figures out what it is supposed to be, downloads its configuration, applies it, and reports back that it is ready. No engineer needed. No console cable. No CLI.

    That is Zero Touch Provisioning (ZTP). It is not new technology — it has existed in various forms for over a decade. But the tooling has matured to the point where any organization can implement it, and the time savings are significant enough that most organizations should.

How It Works (The Simple Version)

    ZTP relies on a chain of events that starts the moment a new device boots with no configuration. Here is the sequence in plain English:


      - **The device powers on with a blank config.** It has no idea what it is supposed to do, but it knows how to ask for help using a protocol called DHCP — the same protocol your laptop uses to get an IP address when you connect to Wi-Fi.
      - **A DHCP server answers.** This is not your standard office DHCP server. It is a smart server that recognizes the new device (usually by its serial number or MAC address) and gives it not just an IP address, but also a pointer to where it can download its configuration file.
      - **The device downloads its configuration.** It reaches out to a web server or file server at the URL the DHCP server provided, downloads a configuration file that was pre-built specifically for it, and applies it.
      - **The device reboots with the new config.** It comes up fully configured — correct hostname, correct IP addresses, correct routing, correct access controls. It joins the network as if an engineer had spent hours on it.
      - **A monitoring system verifies the deployment.** Automated checks confirm the device is reachable, its configuration matches what was expected, and all interfaces are up.

The Technology Stack

    There are several ways to build a ZTP system. Here is one stack we have deployed in production that uses all open-source components:

    **ISC Kea (DHCP Server):** Kea is a modern DHCP server built by the Internet Systems Consortium. Unlike older DHCP servers, Kea stores its configuration and lease data in a database (PostgreSQL or MySQL) and provides a REST API for management. This means you can programmatically add new device reservations without editing config files and restarting the service.

    When a new device sends a DHCP request, Kea looks up the device's MAC address or client identifier in its database, assigns the reserved IP address, and includes DHCP options that tell the device where to find its config file. For Cisco devices, this is typically Option 67 (bootfile name). For Juniper, it is Option 43 with specific sub-options.

    **Django (Configuration Server):** A Django web application serves as the brains of the operation. It stores device records — what type each device is, what site it belongs to, what configuration template to use, and what variables to substitute into that template. When a device requests its configuration, Django renders the template with the device-specific variables and serves it as a downloadable file.

    The Django application also provides a web interface where network engineers can register new devices, assign them to sites, and preview the configuration that will be generated. This gives the team visibility into what every device will receive before it even powers on.

    **Python Scripts (Verification):** After a device applies its configuration and comes online, automated Python scripts verify the deployment. They check that the device is reachable via SSH, that the running configuration matches the intended configuration, that all expected interfaces are up, and that routing adjacencies have formed correctly. If any check fails, the system sends an alert.

What ZTP Saves You

    The time savings are the obvious benefit, but they are not the only one. Here is a complete list of what changes when you implement ZTP:


      - **No more pre-staging.** Devices ship directly from the vendor to the site. You eliminate the lab staging step entirely, which means no lab space needed, no shipping to and from the lab, and no inventory tracking of staged equipment.
      - **No more console cables.** Engineers never need physical access to the device for initial configuration. This is especially valuable for remote sites where sending a person costs thousands of dollars in travel expenses.
      - **Consistent configurations.** Every device gets its configuration from the same template engine. There is no variation based on which engineer happened to configure it. Configuration drift starts at zero instead of accumulating from day one.
      - **Faster deployment.** A site that used to take three to five days to bring online (staging, shipping, installation, configuration, verification) now takes hours. The device arrives, gets plugged in, and configures itself.
      - **Lower skill requirements at the site.** The person at the remote site does not need to be a network engineer. They need to plug in cables and confirm that lights turn on. This opens up deployment to field technicians, facilities staff, or even the end users at the site.

Common Concerns (And Why They Are Manageable)

    **"What if the device gets the wrong configuration?"** This is prevented by the registration step. Every device is registered in Django with its serial number and MAC address before it ships. The DHCP server only responds to known devices, and each device gets a configuration built specifically for it. A rogue device that is not in the system gets a standard DHCP address with no configuration — it does not accidentally get someone else's config.

    **"What if the network is not ready when the device boots?"** ZTP devices are designed to retry. If the DHCP server is unreachable or the configuration server is down, the device will keep trying at regular intervals. Once the network is ready, the device provisions itself. No human intervention needed.

    **"What about security?"** Configuration files can be served over HTTPS with certificate validation. Some implementations use signed configuration files that the device verifies before applying. The DHCP reservations ensure only registered devices receive configuration pointers. And the entire process is logged for audit compliance.

Getting Started

    You do not need to ZTP your entire network at once. Start with one device type at one site. Build the DHCP reservation, the configuration template, and the verification script. Deploy one device using ZTP and validate the result. Once you trust the process, expand to more device types and more sites.

    The first ZTP deployment takes the longest because you are building the infrastructure. The second deployment takes a fraction of the time. By the tenth, it is routine.

Originally published at https://primeautomationsolutions.com

Forem: Erik anderson

Build vs Buy: Why We Built Our Own Email Deliverability Monitoring Instead of Paying Postmark - Prime Automation Solutions

How I Built My Own LLM Gateway — Erik Anderson

Why Your Business Automation Needs to Know Which Tasks Matter Most — Prime Automation Solutions

My System Reverted a Production Failure While I Was Asleep

The Real Shift in 2026

What I Actually Built

1. The Nervous System

2. The Immune System

Three Components That Made This Work

1. Automatic Rollbacks

2. The “Verify Push” Guard

3. Production Lockdown

What’s Not Working (Important)

Outbound is not solved

Some pipelines are broken

Human-in-the-loop system has zero users

What This Actually Does Today

The Reframe

If You’re Building Right Now

What I’m Doing Next

Ansible Playbook Failing? The 7 Root Causes I See Most Often — Prime Automation Solutions

fails, because nginx never got the reload

renders generated_at: {{ ansible_date_time.iso8601 }}

the service returned 403 on attempt one — a real auth bug —

but you retried 29 more times because until only checks 200

I built a $250 website business that runs itself — here's the architecture

The problem

The bet

The architecture

1. The landing page

2. Stripe Payment Link

3. The pipeline service

4. The welcome email

5. The email reactor

6. The safety nets

Three things that surprised me

What I'm keeping private

What's next

Try it

Using Claude Code for Network Automation: A Real Engineer's Experience

What Claude Code Actually Is

Real Prompt, Real Result: Building a Config Audit Tool

MCP Servers: Where It Gets Powerful

The Phase 3 Jump

What I Have Learned After a Year

Trace It. Export It. Cut It. — A Modern Alternative to Legacy Digitizer Software

What Modern Digitizer Software Should Actually Look Like

Who This Is For

The Numbers

We Want to Hear From You

I Accidentally Built a 5-Agent AI Fleet Instead of Buying a $200 Mini PC

How a solo dev ended up with autonomous AI agents named after sci-fi characters reviewing each other's code at 2 AM

Wait, Bobs?

The Naming Ceremony

The Pipeline (This Actually Works)

The Day Gandalf Earned His Keep

The Communication Layer

The Tech Stack (for the Nerds)

My Actual Job Now

You Can Build This Too

The Honest Part

Managed IT vs In-House IT: Real Cost Breakdown for Small Businesses

The Real Cost of In-House IT

The Real Cost of Managed IT

When In-House Makes Sense

When Managed IT Makes Sense

The Hybrid Model

Decision Matrix

Should You Hire a Developer or an Agency? An Honest Comparison

The Real Costs

When to Hire a Developer

When to Use an Agency

When to Use a Freelancer

The Hidden Costs Nobody Talks About

Hidden Costs of Hiring

Hidden Costs of Agencies

Hidden Costs of Freelancers

Our Honest Take

Zero Touch Provisioning: How Devices Configure Themselves

renders `generated_at: {{ ansible_date_time.iso8601 }}`

but you retried 29 more times because `until` only checks 200