Forem: Mike Falkenberg

How I Built an AI-Powered Error Triage System for SaaS at Scale — And What It Actually Costs

Mike Falkenberg — Mon, 23 Mar 2026 13:59:30 +0000

We had a monitoring problem that wasn't really a monitoring problem.

We had Datadog. We had alerts. We had dashboards. What we didn't have was signal. On any given morning, an engineer opening the console might see a large volume of errors aggregated across many customer environments — with no fast way to know if that was one cascading timeout firing repeatedly, or a dozen distinct failures quietly spreading across the fleet.

I built an internal production dashboard to surface that signal. Then I added AI-powered error analysis to it. The pipeline runs on a schedule throughout the day. Here's the architecture, the reasoning, and illustrative code for each layer — patterns you can adapt; they are not copy-pasted from a private repo — including the part many AI monitoring write-ups skip: who owns the problem once the AI summarizes it.

The Problem With Raw Error Counts

The product is SaaS, but it is not the classic “everyone on one shared multi-tenant stack” shape: customers run in separate environments, and observability still rolls up into one place. When something breaks, you want three answers quickly:

Is this one error happening repeatedly, or many different errors?
Which customers are affected, and how badly?
Does this go to the product engineering team or the platform team?

Raw error counts answer none of those questions. A single database deadlock in one busy environment can generate many log lines. Without normalization, that looks like many separate incidents. With normalization, it's one pattern, one API call, one analysis.

The Architecture: Five Layers

Layer 1: Signature Extraction

Before any AI touches the data, errors get normalized. The goal is to strip everything variable — timestamps, customer or environment identifiers, GUIDs, session tokens — and reduce each error to its structural "shape." Many near-duplicate entries collapse to one signature.

Only send redacted, normalized text to a third-party model. Treat log lines like untrusted input: strip or hash anything that could be PII, secrets, or customer-identifying before it leaves your network.

import re
import hashlib

def extract_error_signature(message: str) -> tuple[str, str]:
    """
    Normalize an error message to its structural shape,
    then hash it for consistent grouping.
    """
    normalized = message

    # Strip customer / environment / user identifiers (extend for your log formats)
    normalized = re.sub(
        r'(customer|account|tenant)[_-]?id[:\s]+\S+',
        '[CUSTOMER_SCOPE]',
        normalized,
        flags=re.IGNORECASE,
    )
    normalized = re.sub(r'user[_-]?id[:\s]+\d+', '[USER_ID]', normalized, flags=re.IGNORECASE)

    # Strip timestamps
    normalized = re.sub(
        r'\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}[\.\d]*Z?',
        '[TIMESTAMP]',
        normalized
    )

    # Strip GUIDs
    normalized = re.sub(
        r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}',
        '[GUID]',
        normalized,
        flags=re.IGNORECASE
    )

    # Strip long numeric IDs
    normalized = re.sub(r'\b\d{5,}\b', '[ID]', normalized)

    # Normalize whitespace
    normalized = re.sub(r'\s+', ' ', normalized).strip()

    # Hash the normalized shape for use as a cache/grouping key
    signature_hash = hashlib.md5(normalized.encode()).hexdigest()[:16]

    return signature_hash, normalized

The deduplication ratio is what this buys you. If hundreds of raw lines normalize to a handful of unique signatures, you make a handful of API calls — not one per line. On a noisy day that is the difference between a cheap run and an expensive one.

Layer 2: Cache With a 6-Hour TTL

The cache is what makes this economical over time. Once a signature is analyzed, that result is reused until it expires. The pipeline runs often — on most runs, the API does not fire for recurring known patterns.

import json
import hashlib
from datetime import datetime, timedelta
from pathlib import Path

class AnalysisCache:

    def __init__(self, cache_dir: str = '.cache/error-analysis'):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)

    def _cache_path(self, signature: str, analysis_type: str) -> Path:
        key = hashlib.md5(f"{signature}:{analysis_type}".encode()).hexdigest()
        return self.cache_dir / f"{key}.json"

    def get(self, signature: str, analysis_type: str = 'recent') -> dict | None:
        path = self._cache_path(signature, analysis_type)
        if not path.exists():
            return None

        cached = json.loads(path.read_text())
        cached_at = datetime.fromisoformat(cached['cached_at'])

        # Recent error analysis: 6-hour TTL
        # Long-term pattern analysis: 7-day TTL
        ttl = timedelta(hours=6) if analysis_type == 'recent' else timedelta(days=7)

        if datetime.now() - cached_at > ttl:
            return None  # Expired

        return cached['analysis']

    def set(self, signature: str, analysis_type: str, result: dict) -> None:
        path = self._cache_path(signature, analysis_type)
        path.write_text(json.dumps({
            'cached_at': datetime.now().isoformat(),
            'analysis': result
        }, indent=2))

The 6-hour TTL is a deliberate tradeoff. It is short enough that a genuinely new error variant surfaces within a typical business window. It is long enough that a stable recurring pattern does not burn tokens re-analyzing the same shape on every run.

Layer 3: LLM Analysis — Structured for Multiple Audiences

This is where the most important design decision lives. The prompt requests output in a specific JSON schema that serves several audiences simultaneously — support, operations, platform engineering, and leadership — without requiring separate reports.

The examples below use the Anthropic Python SDK; the same idea applies to any provider that accepts structured prompts and returns text you parse as JSON.

import anthropic
import json
import re

class AIErrorAnalyzer:

    def __init__(self, api_key: str, model: str = 'claude-sonnet-latest'):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model
        self.total_tokens = 0
        self.total_cost = 0.0

    def analyze(self, signature: str, error_type: str,
                occurrences: int, customers_affected: int,
                normalized_message: str) -> dict:

        prompt = f"""Analyze this production error pattern and return JSON only.

Error type: {error_type}
Occurrences: {occurrences}
Customers affected: {customers_affected}
Normalized message: {normalized_message[:400]}

Return this exact structure:
{{
  "summary": "One sentence for the dashboard",
  "explanation": "Plain English for non-technical staff",
  "severity": "Critical|High|Medium|Low",
  "user_impact": "What the end user experiences",
  "root_cause": {{
    "likely_cause": "Most probable cause",
    "confidence": 0.0
  }},
  "recommendations": {{
    "immediate_actions": [],
    "resolution_priority": "Urgent|High|Medium|Low"
  }},
  "customer_communication": "Suggested response if customer asks",
  "technical_details": {{
    "error_category": "Application|Infrastructure|Database|Network|Configuration",
    "real_application_bug": false,
    "affects_critical_operation": false
  }}
}}"""

        response = self.client.messages.create(
            model=self.model,
            max_tokens=1500,
            system="You are a production error analyst. Return only valid JSON.",
            messages=[{"role": "user", "content": prompt}]
        )

        # Replace rates with your provider's current list price (they change).
        usage = response.usage
        input_rate_per_mtok = 3.0   # example: USD per 1M input tokens
        output_rate_per_mtok = 15.0  # example: USD per 1M output tokens
        cost = (usage.input_tokens / 1_000_000 * input_rate_per_mtok) + \
               (usage.output_tokens / 1_000_000 * output_rate_per_mtok)
        self.total_tokens += usage.input_tokens + usage.output_tokens
        self.total_cost += cost

        return self._parse(response.content[0].text)

    def _parse(self, text: str) -> dict:
        # Try markdown code block first
        match = re.search(r'```

(?:json)?\s*(\{.*?\})\s*

```', text, re.DOTALL)
        if match:
            return json.loads(match.group(1))
        # Fall back to raw JSON extraction
        start = text.find('{')
        end = text.rfind('}')
        if start != -1 and end != -1:
            return json.loads(text[start:end+1])
        return {"summary": text[:200], "fallback": True}

The key fields are summary (dashboard card), explanation (support guidance), error_category and real_application_bug (routing signals). Getting those right means one analysis object can serve both someone answering a ticket and someone triaging an alert.

Ballpark cost (illustrative): Per-call totals depend on model, prompt size, and output length. With aggressive caching, many teams land in the rough range of a few dollars per month for periodic batch triage at moderate error volume — always recompute from your own token meters and current provider pricing.

Layer 4: Anomaly Detection Against a Rolling Baseline

A fresh error and a known recurring error need different responses. The anomaly detector compares each signature against N days of stored history, flagging three conditions: NEW (never seen before), SPIKE (volume far above baseline), and SPREAD (appearing for customers who have not seen it in the baseline window).

from dataclasses import dataclass
from typing import Any

@dataclass
class BaselineStats:
    days_present: int
    mean_occurrences: float
    max_occurrences: int
    max_customers: int  # peak distinct customers in baseline window
    customers_seen: set[str]

def classify_anomaly(
    signature: str,
    current: dict[str, Any],
    baseline: dict[str, BaselineStats]
) -> dict[str, Any]:

    occurrences = current.get('occurrence_count', 0)
    current_customers = set(current.get('customers', []))
    b = baseline.get(signature)

    # Never seen before
    if not b:
        return {
            'new_signature': True,
            'spike': occurrences >= 10,
            'spread': len(current_customers) >= 3,
            'new_customers': sorted(current_customers),
        }

    # Spike: meaningfully above both max and mean from baseline
    spike = (
        occurrences >= 10 and
        occurrences > max(2 * b.max_occurrences,
                          3 * max(1.0, b.mean_occurrences))
    ) or (
        occurrences >= 25 and occurrences > b.max_occurrences
    )

    # Spread: affecting customers who haven't seen this before,
    # or many more distinct customers than the baseline peak
    new_customers = sorted(c for c in current_customers
                           if c not in b.customers_seen)
    spread = len(new_customers) >= 2 or (
        len(current_customers) >= 3 and
        len(current_customers) > max(1, 2 * b.max_customers)
    )

    return {
        'new_signature': False,
        'spike': spike,
        'spread': spread,
        'new_customers': new_customers,
        'baseline_days_present': b.days_present,
        'baseline_mean': round(b.mean_occurrences, 2),
        'baseline_max': b.max_occurrences,
    }

The heuristics are deliberately simple: an explainable approach beats heavy statistics when the goal is action, not false precision. An anomaly flag you cannot explain to a stakeholder in half a minute is not operationally useful.

Layer 5: Triage Routing — Ownership, Not Just Summaries

This is what many AI monitoring articles leave out. Finding the error is half the job. Knowing who owns it is the other half — and getting that wrong is expensive. A platform issue routed to application engineering wastes time. An application bug routed to platform may never get the right fix.

The triage layer maps the model's error_category and real_application_bug fields into a stable owner bucket. When error_category is one of the known labels, it wins — even if real_application_bug is also set — so category is the primary routing signal; the bug flag mainly breaks ties when category is ambiguous.

def triage(analysis: dict) -> dict:
    """
    Route an analyzed error to the correct owner bucket.
    Returns: bucket, owner, category, reason.
    """
    technical = analysis.get('technical_details', {})
    error_category = (technical.get('error_category') or '').lower()
    real_bug = technical.get('real_application_bug', False)
    error_type = analysis.get('error_type', '').lower()

    # Explicit model-supplied category takes priority
    routing = {
        'application':    ('application', 'dev'),
        'infrastructure': ('platform',    'platform'),
        'network':        ('platform',    'platform'),
        'database':       ('platform',    'platform'),
        'configuration':  ('platform',    'platform'),
    }
    if error_category in routing:
        bucket, owner = routing[error_category]
        return {'bucket': bucket, 'owner': owner,
                'reason': f'Categorized as {error_category}'}

    # Heuristic fallback on error type
    if error_type in ('timeout', 'connection'):
        return {'bucket': 'platform', 'owner': 'platform',
                'reason': 'Connectivity errors route to platform first'}

    if error_type == 'sql':
        return {'bucket': 'platform', 'owner': 'platform',
                'reason': 'Database errors route to platform first'}

    if real_bug is True:
        return {'bucket': 'application', 'owner': 'dev',
                'reason': 'Flagged as application bug'}

    return {'bucket': 'needs_review', 'owner': 'review',
            'reason': 'Insufficient signal to auto-route'}

Below this, a known-noise list helps: signatures you have classified as benign (for example, expected churn during deploys or maintenance) can be suppressed or down-ranked. A novel signature that SPREADs to new customer environments still escalates. That distinction is what turns a monitoring view into a triage workflow: not just something is wrong, but this is new, this team owns it, and here is suggested wording for support.

What the Pipeline Actually Looks Like

Each scheduled run is roughly:

[06:15 UTC] Starting error analysis pipeline...
  Step 1: Pull errors from monitoring API
  Step 2: Extract signatures — many raw lines → few unique patterns
  Step 3: Cache check — most patterns hits, one miss
  Step 4: LLM API call for the new signature
          (token count and cost from your meter)
  Step 5: Anomaly detection against rolling baseline
          Pattern A: KNOWN (stable)
          Pattern B: KNOWN (stable)
          Pattern C: NEW SIGNATURE — flagged for review
  Step 6: Triage routing
          Pattern A: platform / database
          Pattern B: non_issue (expected noise, suppressed)
          Pattern C: needs_review (new, insufficient signal)
  Step 7: Write results to storage

[06:15 UTC] Pipeline complete in tens of seconds

Few patterns, one fresh analysis call, short wall time. The dashboard shows the cards that matter; expected noise stays out of the way.

The Actual Value

Spend is usually modest next to overall infra budget. The larger win is the morning triage ritual.

Before: pull errors, group manually, read stack traces, decide who to wake up — a long block if you are thorough.

After: open the dashboard, scan a short list of cards. The model did the grouping, drafted support-facing language, and highlighted what needs a human decision.

That time compounds across a team and across a year. That is the leverage case — not the per-token line item.

If this was useful, leave a comment below — I like comparing notes with people building similar systems.

Find me: LinkedIn | GitLab

Mike Falkenberg is a technologist with 20+ years leading development, operations, and security teams. He shares practical insights from building technology organizations. Connect on LinkedIn and follow GitLab for code.

The Hardest Part of AI Isn't the AI

Mike Falkenberg — Sun, 01 Mar 2026 18:30:36 +0000

After 6 months of building, shipping, and leading with AI tools every day, I can tell you the technology was the easy part.

A Quick Rewind

Last fall, I wrote about AI being the first thing in 20 years that genuinely changed how I work and then the workflow shifts that followed—builder to architect, the "worth doing" threshold dropping, parallel execution changing everything.

Then I went quiet. Not because I lost interest. Because I went deep—building infrastructure, integrating tools, navigating the organizational reality of AI adoption. Living it instead of writing about it.

Here's what I learned that I didn't expect.

The Technology Figured Itself Out

Let's get this out of the way: the tools are incredible now.

I run a stack that would've sounded fictional two years ago. Cursor for deep coding context. CodeRabbit scanning every PR before I look at it. Claude for the kind of architectural reasoning that used to require a whiteboard and three senior engineers. GitLab Duo woven into the platform workflow.

These tools work. They work well. They're getting better every month.

But here's what I've realized after months of using them in production: the tools were never the hard part.

Getting Cursor set up takes an afternoon. Integrating CodeRabbit takes a few hours. The technology adoption curve is the flattest I've seen in my career.

The hard part is everything that happens after the tools are running.

The Leadership Shift Nobody Talks About

When I wrote about moving from builder to architect, I thought I understood the shift. I understood maybe a third of it.

The real shift isn't in what you do. It's in what you decide.

Every day now, I make judgment calls that didn't exist before:

When AI generates a solution that works but doesn't match our patterns, do I accept the velocity or enforce the standards? When a junior engineer ships twice as much code because AI is writing most of it, how do I evaluate their growth? When I can prototype three approaches in the time it used to take to spec one, how do I decide which to invest in?

These aren't technology problems. They're leadership problems. And my 20 years of experience matter more for these decisions than they ever did for writing code.

That's the part nobody warned me about. AI doesn't reduce the need for experienced judgment. It concentrates it.

What "Leading by Example" Means Now

I've always believed leaders should be hands-on. Build what you ask others to build. Understand the work at the level you're asking people to do it.

AI changed what that means.

Leading by example used to mean I could sit down and write the code myself. Now it means I can sit down and orchestrate the solution myself—and more importantly, that I can show my team how I think through the orchestration.

The valuable demonstration isn't "watch me use Cursor." It's "watch me decide what to point Cursor at, what context to give it, and what to reject from the output."

I've started doing something I never did before: I walk through my AI-assisted problem-solving process out loud with my team. Not the tool mechanics. The judgment. Why I gave it this context and not that context. Why I rejected a technically correct solution because it didn't fit our operational reality. Why I chose to do something manually when AI could have done it faster.

That's the new version of leading by example. And it might be the most important thing I do now.

The Judgment Gap

Here's something uncomfortable I've learned: the gap between "I use AI effectively" and "my organization uses AI effectively" is enormous. And it's not a training gap.

Everyone on my team has access to the same tools I do. They can all prompt an LLM.

The gap is in knowing what problems to solve.

After 20 years, I carry a mental model of what matters—which architectural decisions will haunt us, which shortcuts are fine, which edge cases will wake someone up at 3 AM. AI amplifies that mental model. I point AI at the right problems, give it the right context, and validate the output against real operational experience.

Without that kind of judgment, AI is incredibly productive at building the wrong things very fast.

This isn't an argument that junior engineers can't use AI. They absolutely can, and they should. It's an observation that AI makes experience more valuable, not less. The people who've been around long enough to know where the landmines are buried? They're the ones who get the most leverage from AI.

That's a leadership insight that matters right now, because a lot of organizations are treating AI adoption as a training problem when it's really a mentorship problem.

What I Got Wrong (And What It Taught My Team)

In the spirit of honesty that started this series, here's what I got wrong in my earlier posts—and what we learned from it:

I underestimated the security complexity. I wrote about security implications, but I didn't appreciate how much the attack surface changes when AI tools have context about your systems. Context engineering isn't just about making AI more effective—it's about controlling what AI knows. That's a fundamentally different security model than most organizations are built for. We had to rethink our entire approach to data classification—not because of a breach, but because we realized we were one lazy prompt away from one.

I overestimated how fast teams adopt new patterns. My personal workflow transformation happened in weeks. Organizational transformation takes months. Not because people resist change—because the coordination cost is real. Everyone needs to learn new judgment patterns, not just new tools. The fix wasn't more training sessions. It was pairing—experienced people working alongside less experienced people, making the invisible judgment visible.

I thought the "worth doing" threshold drop was purely positive. It mostly is. But when everything becomes worth doing, prioritization gets harder, not easier. A backlog that grows because you can do more is a different kind of problem than a backlog that grows because you can't. We caught ourselves three months in with too many things in flight. The discipline of saying "not now" is harder when "now" is so cheap.

AI Beyond Engineering

Here's where this gets bigger than dev teams.

Everything I've described so far happened inside engineering. But the patterns aren't engineering-specific. The judgment gap, the mentorship problem, the "worth doing" threshold—those exist in every department.

I'm now building a plan to take AI adoption org-wide. Operations. Security. Project management. Not by handing everyone a ChatGPT login and calling it transformation. By applying the same approach that worked in engineering: start with the people who have the deepest domain judgment, give them AI tools, let them demonstrate what's possible, and build the infrastructure that lets it scale.

The insight from engineering applies everywhere: AI doesn't replace domain expertise. It gives domain experts leverage they've never had. A security engineer with 15 years of experience and AI tools isn't just faster at writing policies—they're solving problems that weren't feasible before. An operations lead who knows where every process bottleneck lives can use AI to finally fix the ones that were never "worth the effort."

That's the real unlock. Not AI for engineering. AI for the entire organization, led by the people who know the work best.

The Question That Matters

If you're a leader thinking about AI adoption—or in the middle of it—here's the question I'd push you to answer:

Who in your organization has the judgment to direct AI effectively, and how are you scaling that judgment to others?

Not: what tools should we buy. Not: how do we train people on prompting. Not: what's the ROI.

Who has the mental model, and how does it spread—across teams, across departments, across the org?

Because the tools are easy. The technology is the easy part. The leadership challenge of developing AI-ready judgment across an organization—that's the work that separates companies that get real value from companies that just get faster at building the wrong things.

What's Next

I've been building the infrastructure that makes all of this work at scale—context systems, security boundaries, observability for AI-assisted workflows. I'll share the technical details in the next post, including the code. All of it is public at gitlab.com/mikefalk.

But I wanted to start here. With the human part. Because after 6 months of living with AI every day, I'm more convinced than ever that the technology will keep getting better on its own.

The leadership? That's on us.

Let's compare notes. If you're navigating AI adoption in your organization—especially if you're a hands-on leader who refuses to just delegate it—I want to hear what you're learning. The best insights I've had came from conversations, not documentation.

Find me: LinkedIn | GitLab | dev.to

Mike Falkenberg is a technologist with 20+ years leading development, operations, and security teams. He shares practical insights from building world-class technology organizations. Follow on GitLab for code and dev.to for articles.

The Workflow of the Future Is Already Here (And It's Nothing Like You Think)

Mike Falkenberg — Sat, 08 Nov 2025 14:42:14 +0000

After 20 Years in Technology, AI Changed How I Work - Part 2

Three weeks of AI-integrated work taught me more about the future of technology work than 20 years of experience. This isn't about tools—it's about a fundamental shift in how ALL work gets done.

A few weeks ago, I wrote about AI being genuinely different after 20 years in technology—the organizational challenges, the security implications, the honest uncertainties.

I was writing from experimentation and curiosity. I'd seen enough to know AI wasn't hype, but I was still testing, still exploring, still skeptical about the real-world impact.

Three weeks later, something fundamental shifted.

Now I'm writing from the other side of something I can only describe as a fundamental shift in how I work.

In the last few weeks, I've built more than I built in the previous six months. Not because I'm working longer hours or cutting corners. Because I'm working differently.

Projects that sat on my "someday" list for years are done. Automation I thought would take weeks took hours. Tools I'd mentally shelved as "not worth the time investment" exist now and are running in production.

This isn't about specific tools. Tools will change. New ones will emerge. Better ones will replace what I'm using today.

This is about the workflow pattern I discovered that I believe represents the future of technical work.

Let me show you what changed.

The Three-Week Transformation

Week 1: Integration

I spent the first week building what I now think of as an "AI-integrated work environment"—not just for coding, but for everything. Strategic thinking. Technical execution. Content creation. Problem exploration. Planning. Analysis.

The setup was tedious. Lots of experimentation. Lots of "does this actually work?" testing across different domains.

I wasn't sure it would be worth it. Spoiler: it was.

Week 2: The Breakthrough

Somewhere in week two, something clicked.

The breakthrough wasn't about one type of work. It was about how AI integrated into my entire workflow—not just writing code, but thinking through problems, exploring solutions, creating content, planning architecture, analyzing tradeoffs.

I started completing work that had been shelved for months or years. Technical projects. Strategic analysis. Documentation. Content. Things that would have taken weeks happened in hours.

That's when I realized: This isn't about AI making me faster at specific tasks. This is about AI as an integrated assistant across everything I do.

Week 3: The New Normal

By week three, I'd shifted into what I now think of as the new way of working.

My backlog started shrinking across all categories. Technical work. Strategic planning. Content creation. Analysis. Documentation. The "nice to have" items that never quite justified the time investment.

They were all suddenly worth doing. Not because I lowered my standards—because the time-to-value ratio changed fundamentally.

The Four Workflow Shifts

Let me be specific about what actually changed. These aren't incremental improvements. These are fundamental shifts in how technical work gets done.

Shift 1: From Sequential to Parallel

The old workflow:

Think → Research → Build → Test → Document → Review → Deploy

Everything sequential. One step at a time. Each step blocking the next. My time was the bottleneck for everything.

The new workflow:

Think → [Multiple parallel streams] → Orchestrate → Integrate → Review

Now multiple things happen simultaneously. While AI is generating one component, it's also writing tests for another, documenting a third, and researching implementation patterns for a fourth.

My role shifted from executor to orchestrator.

It's not about any one task moving faster. My job now is to orchestrate parallel streams of work and integrate the results into something coherent.

That's a fundamentally different job.

Shift 2: From Context-Free to Context-Aware

This is the breakthrough most people miss.

Before: Every interaction with AI started from scratch. "Here's my generic problem, give me a generic solution."

After: AI has context about my actual systems. My infrastructure. My data sources. My patterns. My constraints.

When I ask it to connect the dots across systems—operational metrics, upcoming releases, policy constraints—it doesn't respond with a generic tutorial. It understands the landscape I'm working in, pulls the signals that matter, and surfaces insights that would have taken days of manual context gathering.

The difference isn't speed. It's relevance and depth.

Instead of spending hours adapting generic examples to my specific environment, AI generates solutions that fit my environment from the start.

Context-aware AI doesn't just help me code. It helps me think through problems in the context of my actual systems.

This isn't prompt engineering—it's context engineering. It's the deliberate work of designing the systems, guardrails, and data pathways that give AI relevant situational awareness across every part of my job, not just in an IDE.

That's the shift that makes everything else possible.

But context-awareness introduces security risk.

This is where most organizations make their biggest mistakes.

When AI has access to your systems—through APIs, monitoring data, infrastructure context—you're exposing potentially sensitive information. System architectures. Data patterns. Security configurations.

The security model shifts from "AI doesn't know anything" to "AI knows what I explicitly allow it to know."

What this means in practice:

API access requires authentication controls - Not all AI services should access all systems
Context data needs filtering - Don't feed AI sensitive credentials, customer data, or proprietary algorithms
Audit logs matter - Track what context AI accesses and when
Organizational policies are essential - Clear rules about what context AI can access

The more context-aware your AI workflow becomes, the more critical your security boundaries are.

I manage this tension daily as Security Officer: context-awareness is transformative, but it's not a free pass to bypass security controls.

Shift 3: From Building to Reviewing

Twenty years in technology, my primary role has been builder.

In the last three weeks, my primary role became architect and reviewer.

The old workflow:

Me: Build the thing (80% of time)
Me: Review the thing (20% of time)

The new workflow:

Me: Design and architect (30% of time)
AI: Build the mechanical parts (happens in parallel)
Me: Review, integrate, refine (70% of time)

This isn't about AI "taking my job." It's about AI handling the parts I'm overqualified for anyway.

I don't need 20 years of experience to write boilerplate error handling. I do need 20 years of experience to know what error conditions matter, how they should be handled in the broader system, and what the architectural implications are.

AI is really good at the first part. I'm still essential for the second part.

The shift is: I now spend most of my time on the parts that actually require experience and judgment.

That's appropriate. That's where my value is.

Shift 4: From "Worth It" to "Done"

This is the shift that's changing my backlog math.

The old calculation:

Project value: Medium
Time required: 40 hours
Decision: Not worth it right now, backlog it
Result: Never gets built

The new calculation:

Project value: Medium (same value)
Time required: 4 hours (AI-assisted)
Decision: Worth doing this week
Result: Built, tested, deployed

The threshold for "worth doing" dropped dramatically.
When context engineering cuts the time-to-value across everything, the backlog math flips—"maybe someday" becomes "worth doing now."

Projects that would never have justified three weeks of my time suddenly justify four hours. That's not a 10x productivity increase. That's a fundamental change in what problems are worth solving.

My backlog isn't getting reprioritized. It's getting completed.

The Uncomfortable Productivity Math

I know how this sounds. "Weeks to hours" is the kind of claim that makes people roll their eyes.

But here's why it's real:

A typical project breaks down roughly like this:

40% Strategic work (architecture, design, integration, judgment)
60% Mechanical work (boilerplate, standard patterns, documentation)

Before AI: I did all 100% myself. Time: 40 hours.

With AI: I do the 40% strategic. AI does the 60% mechanical in parallel.

My time: ~16 hours. Total elapsed: ~8-10 hours (with iteration).

That's 4-5x faster. Sometimes 10x on boilerplate-heavy work.

But here's what matters: I don't need 20 years of experience to write standard patterns. I need it to know which patterns to use, how they integrate, and what the trade-offs are.

That's where AI can't help. That's where experience matters.

The Leadership Implications

For Knowledge Workers

Your role is shifting from executor to strategist/orchestrator.

If your value is "I execute tasks," you're replaceable. If it's "I think strategically, make judgment calls, and integrate complex work," you're more valuable than ever.

For Technology Leaders

Traditional productivity metrics are breaking. Output volume? Task completion? Velocity? All measuring the wrong thing.

The better question: "What problems did we solve that weren't worth solving before?"

When your team can produce 5-10x more with the same headcount, the hard part isn't execution—it's knowing what's worth doing.

For Organizations

The bottleneck shifts from execution capacity to strategic direction.

When you can do 10x more, strategy matters more than ever.

What's Not Solved (The Honest Limitations)

Let me be clear about what AI-integrated workflows do NOT solve:

What's Working

Mechanical execution (research, drafting, standard patterns)
Exploration and iteration
Documentation and synthesis
Analysis of known patterns
Parallel workstreams

What's NOT Working Yet

Strategic decisions: AI can't tell you what to do. It can help you execute faster once you know what you want.

Complex integration: AI struggles with integration across multiple complex domains with implicit dependencies and organizational context.

Trade-off judgment: AI can present options, but you still need experience to evaluate trade-offs in the context of your specific constraints.

Organizational context: AI doesn't understand your team dynamics, your company's risk tolerance, your customers' unspoken needs, your political landscape.

What's Still Hard

Knowing what problems are worth solving
Understanding system-wide and organizational implications
Making decisions with long-term consequences
Integrating across organizational boundaries
Managing technical and organizational complexity simultaneously
True strategic thinking and vision

The point: AI augments judgment, it doesn't replace it.

The workflow shift makes experienced professionals MORE valuable, not less—because the parts that require experience are now the majority of the work.

What's Next

Three weeks ago, I thought I understood AI's impact. I was wrong.

This isn't about tools getting incrementally better. It's about a fundamentally different way of working.

Am I 10x more productive? Wrong metric. The right questions:

What's now worth doing that wasn't before?
What quality improvements can I now afford?
What problems can I solve that I was ignoring?

For me: Almost everything on my backlog. More thorough work. All the strategic projects I'd been deferring.

That's not a productivity increase. That's a fundamental shift in what's possible.

Is this the workflow of the future? Maybe. Or maybe in another three weeks I'll discover something even better.

But right now, after 20 years in technology, this is the biggest shift in how I work that I've ever experienced.

The backlog is shrinking. Excellence is scaling. The "not worth the time" work is getting done.

And the best part? I'm spending more time on strategy, judgment, and integration—the parts that actually require 20 years of experience.

That's the workflow of the future: AI handling mechanical parts so humans can focus on expertise.

We're still early. But the direction is clear.

And if you're an experienced professional, this shift makes you more valuable—not less.

Call it context engineering if you want. The industry is starting to formalize it with standards like MCP, but the pattern is the same: treat context like infrastructure, keep the guardrails tight, and the tools can change without breaking the workflow.

Connect

I'm documenting this journey in real-time. If you're exploring similar patterns or have discovered different approaches, I'd love to hear about it.

LinkedIn: linkedin.com/in/mikefalkenberg

Dev.to: dev.to/mikefalk

Code: gitlab.com/mikefalk

All code from my experiments is publicly available. Use it, adapt it, improve it.

Mike Falkenberg is a technology leader with 20+ years of experience building scalable systems and leading engineering teams. He shares practical insights on infrastructure, security, and organizational transformation.

After 20 Years in Technology, AI is the First Thing That Actually Changed How I Work

Mike Falkenberg — Tue, 28 Oct 2025 12:27:27 +0000

The Perspective of Two Decades

I've been in technology for 20 years. I've lived through:

XML web services ("the future of integration")
Cloud migration ("everything will be in the cloud")
Containers ("Docker changes everything")
Microservices ("monoliths are dead")
DevOps transformation ("break down the silos")

Each promised to revolutionize how we work. Most were incremental improvements with new vocabulary.

AI is different.

Not because it writes code faster—that's impressive but tactical. Because it fundamentally changes the economics of what's possible. Tasks that took teams weeks now take individuals days. Problems that required specialists are now approachable by generalists. Knowledge that took years to accumulate can be accessed in seconds.

That's not incremental improvement. That's structural change.

And if you're leading a technology organization, AI isn't a tool decision—it's a strategic imperative. The question isn't whether to integrate AI. It's how to do it thoughtfully.

What Actually Changed

Let me be specific. Here's what transformed in my daily work:

Infrastructure as Code: Boilerplate to Starting Point

Before AI:
Writing infrastructure code meant starting from blank files. Research documentation, figure out syntax, handle edge cases, write examples, test. Time-consuming even for experienced engineers.

After AI:
Describe what I need, AI generates a starting point. I review for security, refine for organization standards, test.

Impact: Noticeably faster on routine work. The interesting part? I spend more time on architecture decisions and security review—higher-value work AI can't do yet.

Code Review: Still Learning This

I'm experimenting with AI-assisted code review, but haven't fully integrated it yet. The promise is faster initial screening so humans focus on architecture and business logic.

Early observations: AI is good at catching common patterns. Less good at understanding organization-specific security requirements.

Still figuring out: How to balance AI pre-screening with maintaining review quality.

Finding Information: AI Search vs. Traditional Search

Before AI:
Google search, read Stack Overflow, piece together answers from multiple sources, adapt to your context.

After AI:
Ask AI directly, get contextual answer, ask follow-up questions, iterate until you understand.

Impact: This might be the biggest change. The way I find and learn information is fundamentally different. Less time searching, more time understanding and applying.

Monitoring and Observability: AI-Enhanced Insights

Modern monitoring tools now include AI-powered features:

Anomaly detection that learns normal patterns
Intelligent alerting that reduces noise
Log analysis that surfaces unusual patterns automatically
Correlation across metrics that humans would miss

Impact: I'm catching issues I wouldn't have noticed manually. But I'm also learning to trust (and validate) AI-flagged anomalies.

Troubleshooting: Pattern Recognition

The change:
AI can analyze log volumes humans can't. Feed it symptoms, it suggests patterns and correlations.

The reality:
Still need to validate AI suggestions. Sometimes it's brilliant. Sometimes it's confidently wrong about context it doesn't have.

Still learning: When to trust AI pattern recognition vs. when to rely on experience.

The Strategic Reality: It's Not About Tools

Here's what most AI articles miss: The technology is easy. The organizational transformation is hard.

Every team can start using GitHub Copilot tomorrow. That doesn't mean they'll be more effective. In fact, without thoughtful leadership, AI can make organizations worse—faster at building the wrong things, more confident in flawed code, creating technical debt at unprecedented speed.

After leading teams through this transformation, here are the challenges that actually matter:

Challenge 1: The Skill Gap is Unpredictable

AI adoption doesn't follow seniority. I've seen senior engineers resist AI ("I know how to do it properly myself") and junior engineers embrace it faster than veterans. I've also seen the opposite.

The challenge: How do you ensure quality when skill levels and AI adoption vary widely?

What I'm exploring:

Pair programming (XP practices): Teams work together regardless of who's using AI
Explicit validation: "How do you know this suggestion is correct?"
Focus on fundamentals: Understanding WHY, not just WHAT

This is a leadership challenge, not a technology one.

Challenge 2: Security Blind Spots

As someone responsible for both development velocity and security, I see the problem: AI-generated code looks professional but can be subtly insecure.

Example: AI suggested infrastructure code that was technically valid but created overly permissive access. Traditional linters passed it.

What I'm doing:

All AI-generated code gets security review
Focus on architectural security, not just syntax
Training teams to question AI's security assumptions

AI makes us faster. It can also make us faster at building vulnerable systems.

Challenge 3: Knowledge Transfer Breakdown

When AI writes code for a junior engineer, they solve today's problem but don't build tomorrow's expertise. Six months later, you have engineers who can prompt AI but can't debug without it.

What I'm doing:

Requiring explanation: "AI generated this, now explain why it works"
Code review includes: "What did you learn?"
Balancing AI-assisted speed with manual learning

Fast today, incompetent tomorrow is not a winning strategy.

The Security Officer's Perspective

Wearing my security hat, AI introduces risks most organizations aren't addressing:

Data Exposure Through Prompts

Every time a developer pastes code into an AI tool, they might expose proprietary logic, internal APIs, or security patterns. Most AI tools' terms allow training on your data.

Our policy: Approved enterprise AI tools only. No proprietary code in public AI services.

AI-Generated Vulnerabilities

AI doesn't understand YOUR threat model. It might suggest logging that captures sensitive data, error messages revealing system internals, or authentication patterns inappropriate for regulated data.

Our approach: Security review explicitly checks "Is this AI-generated?" with different focus than traditional review.

Compliance Implications

When AI writes code that processes regulated data, who's responsible? Always the organization—not the AI vendor, tool, or developer.

Our stance: AI is a coding assistant, not a compliance consultant. Same standards apply regardless of how code was written.

The Paradox

Here's the tension I manage daily:

Productivity pressure: "AI makes us noticeably faster. We should use it everywhere."

Security responsibility: "AI introduces risks we haven't fully characterized."

Both are true. The key is thoughtful policies, not blanket approval or prohibition.

Building AI-Ready Organizations

As I work through AI integration in my organization, here's my approach:

1. Clear Policies Before Widespread Adoption

Define early:

Which AI tools are approved (and for what)
What data can be shared with AI services
Who reviews AI-generated decisions
How we measure AI effectiveness

Cleaning up after uncontrolled AI adoption is harder than setting guardrails upfront.

2. AI Literacy Across ALL Teams

Not just developers. Operations using AI for troubleshooting. Security teams for threat analysis. Product teams for research.

The goal: Everyone understands what AI can do, what it can't, and when to trust it.

3. Hybrid Skill Development

Teach both:

How to use AI effectively (speed)
Core fundamentals without AI (sustainability)

Engineers who only work with AI are fragile. Engineers who refuse AI are inefficient. The target: engineers who use AI to amplify expertise, not replace it.

4. Maintain Core Competencies

AI might go down. Terms might change. Your team still needs to function.

In practice:

Core skills development continues
Documentation assumes AI might not be available
Regular validation: Can we operate without AI?

Over-dependence on any tool is organizational risk.

5. Culture of Honest Sharing

Create safe environment to share both wins and failures:

"AI helped me solve this in minutes"
"AI suggested something dangerously wrong"
"I don't know when to trust AI on this"

Best learning comes from honest experience sharing, not success theater.

What I'm Still Learning

Full transparency: After 20 years in tech and recent months deeply exploring AI integration, I don't have all the answers.

Questions I'm still working through:

How do you balance AI speed with knowledge transfer?
What's the right level of AI assistance before it becomes a crutch?
What are the long-term implications of AI-heavy development?
How do you maintain deep technical skills in an AI-assisted world?
What organizational structure works best with AI?

If you've solved any of these, I'd genuinely love to hear about it. The best insights come from shared experience, not lone genius.

What's Coming: Near-Term Reality

I'm cautious about long-term predictions—AI is moving too fast. But here's what I'm seeing emerge in the next 6-18 months:

Agentic AI: The Shift Nobody's Talking About

The next wave isn't better code generation. It's AI agents that can execute complex multi-step tasks autonomously.

What this means for infrastructure:

AI doesn't just suggest a fix—it researches the problem, proposes solutions, tests them, and implements the best one (with human approval)
Not "here's a Terraform module" but "I analyzed your requirements, designed the architecture, wrote the code, tested it, and here's why this approach is best"
Multi-step troubleshooting: AI investigates logs, correlates across systems, identifies root cause, proposes fix, tests in staging

What I'm watching: Tools like AutoGPT, LangChain agents, and infrastructure-specific agentic systems. Early, but moving fast.

The leadership question: How do you manage teams when AI can execute entire workflows? What's the human role?

What's Actually Emerging (6-12 months):

Specialized Models: AI trained specifically on Terraform, Kubernetes, CloudFormation. Not general-purpose models trying to understand infrastructure—purpose-built for it.

Better Context Understanding: AI that knows your organization's patterns, not just generic best practices. Learns from your infrastructure decisions over time.

Improved Security Detection: Models that understand infrastructure attack patterns and your specific threat model, not just code syntax.

What I'm Experimenting With (12-18 months):

Semi-Autonomous Remediation: AI identifies issue, proposes fix with confidence score, human approves, AI implements. Not fully autonomous, but much faster than manual.

Predictive Capabilities: Pattern recognition that warns "this will fail" before it does, based on degrading metrics human wouldn't catch.

Cross-System Intelligence: AI that understands how changes in one system impact others across your entire infrastructure.

What I'm NOT Predicting:

Beyond 18 months, it's speculation. The pace of change in AI makes 2-5 year predictions meaningless.

But the trajectory is clear: More autonomous, more context-aware, more proactive. The question isn't "will this happen" but "how do we prepare for it."

My Rules for AI in Organizations

As I experiment with AI across development, operations, and security, here's what I'm learning to follow:

Rule 1: AI Suggests, Humans Decide

Never auto-apply AI recommendations without review. Context, business requirements, and risk tolerance matter. AI doesn't know these.

Rule 2: Verify Everything

AI-generated code, security recommendations, architecture suggestions—all get the same scrutiny as human work.

Rule 3: Start Small, Prove Value, Then Scale

Experiment in development. Measure results. If it works, expand to QA, then staging, then production. Don't go all-in until you've proven it works in your environment.

Rule 4: Measure ROI Ruthlessly

Track time saved, quality maintained, issues introduced, costs incurred. If ROI isn't clearly positive, stop using that AI application.

Rule 5: Keep Humans in the Loop

AI amplifies human expertise. It doesn't replace judgment, accountability, or responsibility. The most effective organizations use AI to make their humans better, not to use fewer humans.

What I'm Building

I'm actively working on AI-integrated tools exploring these concepts:

Predictive cost optimization that learns from usage patterns
Security anomaly detection for specific infrastructure
Intelligent alerting that reduces noise and surfaces real issues

These are experiments, not products. When they mature you'll find them at gitlab.com/mikefalk.

Why share this? Because the best way to learn is to build. And the best way to improve is to share what you build.

The Bottom Line for Technology Leaders

After 20 years in technology and recent deep exploration of AI integration into organizational workflows, here's what I believe:

AI is real. Not someday. Not in five years. Today.

But it's not magic. It's a powerful tool that requires thoughtful leadership.

The organizations winning with AI aren't replacing humans with AI. They're using AI to make their humans dramatically more effective.

That requires:

Clear policies and guardrails
Investment in verification skills
Balanced approach to speed and learning
Security awareness alongside productivity
Measurement, not faith
Cultural honesty about what works and what doesn't

The technology is the easy part. Building teams that use AI effectively while maintaining security, quality, and core competencies—that's the leadership challenge.

And that's always been true in technology. New tools, same fundamental leadership principles. AI just raises the stakes and accelerates everything.

What I'm Still Figuring Out

I'm sharing what I've learned so far. But I'm also still learning.

Open questions:

Optimal balance between AI assistance and skill development
Long-term career implications for engineers in AI-heavy environments
Best organizational structures for AI-first development
How to maintain innovation when AI makes execution so much faster
The competitive advantage beyond "we use AI too"

If you're leading teams through similar transformations, I'd love to compare notes. Not because I have answers, but because the best solutions come from shared learning.

Let's Discuss

What's your experience leading teams in the AI era? What's working in your organization? What challenges are you facing?

Reach out: LinkedIn

The best insights come from practitioners sharing honest experiences. If you're building AI-ready organizations, let's learn from each other.

Mike Falkenberg is a technologist with 20+ years leading development, operations, and security teams. He shares practical code and organizational insights from building world-class technology organizations. Follow on GitLab for code and Dev.to for articles.

The $200K Mistake: Why Your Dev Environments Cost as Much as Production (And how a simple automation pattern can fix it)

Mike Falkenberg — Sun, 26 Oct 2025 20:23:35 +0000

The Wake-Up Call

Let me tell you about a conversation I've had more times than I can count:

Finance: "Our AWS bill is $45,000 this month. Why is it so high?"

Engineering: "We need resources to develop and test. It's the cost of doing business."

Finance: "But your dev environment costs $18,000. That's 40% of the total. For testing?"

Engineering: "Well… it has to be available when we need it."

Here's what nobody says out loud: That dev environment is idle 70% of the time.

The Math Nobody Wants to Do

Let's break down a typical dev/test environment:

Running 24/7 (US-East-1 pricing):

3× t3.large EC2 instances: ~$61/month each = $183
1× db.t3.large RDS (SQL Server Web): ~$109/month
1× Application Load Balancer: ~$23/month
Supporting resources (EBS, data transfer, backups): ~$50/month

Monthly cost: ~$365/month

Annual cost: ~$4,380

But here's the reality:

Business hours: Monday-Friday, 6 AM - 8 PM = 70 hours/week
Total hours in a week: 168 hours
Actual usage: 42% of the time

You're paying 100% for 42% utilization.

The $200K Mistake (Real Numbers)

Now multiply that across a typical organization with multiple non-production environments:

Example organization with 6 environments:

Dev environment: $4,380/year
QA environment: $6,500/year
Staging environment: $8,200/year
Performance testing: $12,000/year
Integration environment: $5,500/year
Demo environment: $3,800/year

Total cost running 24/7: $40,380/year

With shutdown automation (14 hours/day):

Compute savings: ~58% of EC2 + RDS compute costs
Storage costs unchanged (EBS, RDS storage)
Realistic annual savings: ~$16,800/year

Scale this across different org sizes:

Small (3-4 environments): ~$10K-15K/year saved
Medium (6-8 environments): ~$25K-35K/year saved
Large (10-15 environments): ~$50K-75K/year saved
Enterprise (20+ environments): $100K-200K+/year saved

That's where the $200K comes from - organizations with extensive non-production infrastructure.

Why Smart People Keep Making This Mistake

It's not ignorance. Every engineering leader knows this. But they don't fix it because:

Reason 1: "It's Too Complex"

"We'd need to coordinate shutdowns, handle stateful applications, manage startup sequences…"

Reason 2: "Someone Might Need It"

"What if a developer needs to test something at 10 PM?"

Reason 3: "We'll Get to It Later"

"We have more important priorities right now."

Reason 4: "The Savings Aren't Worth the Risk"

"What if something breaks and we can't start it back up?"

The truth? All of these are solvable. And the ROI is massive.

The Simple Solution

Here's what works (and I've built it multiple times):

The Pattern:

Tag resources with AutoShutdown=true
Lambda function triggered by EventBridge at 8 PM → stops tagged resources
Lambda function triggered by EventBridge at 6 AM → starts tagged resources
CloudWatch Logs capture everything for debugging

Total development time: 4-6 hours

Total maintenance time: ~1 hour/year

The Results:

Dev environment runs 14 hours/day instead of 24
Cost: $365/month → $215/month = $150/month savings
Annual savings: ~$1,800 per environment
Payback: Less than 2 weeks of engineering time

Five environments? ~$9,000/year savings. Every year.

Ten environments? ~$18,000/year savings.

Real-World Implementation

I've implemented this pattern across multiple organizations. Here's what actually happens:

Month 1: Skepticism

"This won't work because [various concerns]."

Month 2: Testing

Enable dry-run mode, validate the automation, address edge cases.

Month 3: Small Scale

Apply to 1-2 non-critical environments.

Month 4: Realization

"Wait, this actually works and we haven't had issues?"

Month 6: Full Deployment

All non-production environments automated.

Month 12: Finance is Happy

Cloud bill down 30-40% with zero impact on development velocity.

Common Objections (And Answers)

"What if someone needs it after hours?"

Answer: Manual override takes 30 seconds:

aws ec2 start-instances --instance-ids i-xxxxx

Or keep a single "always-on" environment for emergencies.

"What about stateful applications?"

Answer: That's what graceful shutdown scripts are for. And honestly, if your dev environment can't handle a restart, you have bigger problems.

"What if startup fails?"

Answer: CloudWatch alarms notify you. But in 3+ years of running this, startup failures are vanishingly rare (<0.1% of attempts).

"This seems risky."

Answer: You know what's risky? Explaining to the CEO why you're spending $200K/year on environments that sit idle 60% of the time.

The Business Case

When presenting this to leadership:

Investment:

Development: 6-8 hours
Testing: 4 hours
Deployment: 2 hours

Total cost: ~$2,000 in engineering time

Return:

Monthly savings: $750 - $3,000 (depending on environment count)
Annual savings: $9,000 - $36,000 (for 5-10 environments)
Payback: First month
Year 1 ROI: 500-1800%

What executive turns down that kind of ROI?

Implementation Guide

Phase 1: Pilot (Week 1)

Choose non-critical dev environment
Tag resources with AutoShutdown=true
Deploy Lambda functions in dry-run mode
Verify it detects the right resources
Review logs daily

Phase 2: Live Test (Week 2-3)

Enable actual shutdown/startup for pilot environment
Monitor for issues
Survey developers for impact
Measure actual savings

Phase 3: Expand (Week 4-6)

Apply to QA, staging, other dev environments
Refine schedules based on actual usage
Add manual override documentation
Train team on override procedures

Phase 4: Monitor (Ongoing)

Monthly cost review
Quarterly automation health check
Adjust schedules as teams grow/change

The Code

I've made the complete solution publicly available: cloud-cost-optimizer

What's included:

Python Lambda functions (startup + shutdown)
Terraform deployment modules
EventBridge scheduling
CloudWatch logging
Dry-run testing mode
Complete documentation

Deploy it: 30 minutes

Start saving: Immediately

Beyond the Savings

Here's what I've learned implementing this across different organizations:

The Hidden Benefits:

1. Forces Infrastructure as Code
If you can't recreate your environment from code, you can't safely shut it down. This automation forces good IaC practices.

2. Identifies Zombie Resources
When you start tagging for shutdown, you find resources nobody remembers creating. Decommission those and save even more.

3. Improves Disaster Recovery
Regular shutdown/startup cycles are basically DR testing. You'll catch startup failures in dev, not during an actual outage.

4. Changes Team Behavior
When environments shut down daily, teams get better at quick provisioning and stateless design.

The Bottom Line

The $200K mistake isn't technical—it's organizational. The solution exists. The ROI is proven. The risk is minimal.

What's stopping you is inertia, not engineering.

If finance is asking questions about your cloud bill, this is the easiest win you'll get all year. Six hours of work, $50K-$200K in annual savings, and you look like a hero.

Or keep paying full price for idle resources. Your call.

A Note on Pricing

AWS pricing based on US-East-1 rates as of October 2025. Your actual costs will vary based on region, instance types, reserved instances, and specific usage patterns. Use the AWS Pricing Calculator for your exact scenario. Savings percentages are consistent regardless of specific pricing.

Try It Yourself

Calculate your current dev/test environment costs
Multiply by 0.4 (that's your 40-60% savings)
Clone the cloud-cost-optimizer
Deploy to one environment in dry-run mode
Watch the logs for a week
Enable it for real
Watch your costs drop

What do you have to lose? (Besides $200K/year.)

Let's Discuss

Have you implemented cost optimization automation? What worked? What didn't?

Reach out: LinkedIn

Or better yet, try the code and open an issue if you hit snags. That's what it's there for.