Forem: Benard Otieno

Pick Boring Technology. Yes, Especially for AI.

Benard Otieno — Fri, 15 May 2026 11:50:52 +0000

What "Boring" Actually Means

Boring technology does not mean old technology. It does not mean slow, limited, or low-quality. It means technology that has been in production long enough that its failure modes are documented, its operational characteristics are well understood, and the person debugging it at 2am has a reasonable chance of finding a Stack Overflow answer that is not from a beta forum post in 2024.

Postgres is boring. Redis is boring. S3 is boring. A plain HTTP API with JSON is boring. SQLite, for things that fit in SQLite, is boring. None of these things are slow, limited, or embarrassing to use. They are boring because they have been deployed by enough people, at enough scale, for long enough that the surprises have mostly been found. The surface area of "things that can go wrong that nobody has written about" is small.

When Dan McKinley wrote the Choose Boring Technology essay in 2015, he framed it as a budget: you get a limited number of new technologies per project, and you should spend that budget intentionally. That framing is still correct. What's changed is that AI products have a non-negotiable budget item now: the model and the scaffolding around it. That item is expensive. It is genuinely new. It has failure modes that nobody fully understands yet. That is the place you are choosing to spend your novelty budget. Everywhere else, the argument for boring is stronger than it has ever been.

The Vector Database Problem

The most common place I see this play out is in the retrieval layer. A team is building RAG — retrieval-augmented generation, some form of semantic search over a corpus. They need to store embeddings and query them by similarity. There are purpose-built vector databases for this: Pinecone, Weaviate, Qdrant, Chroma. They have impressive benchmarks, polished SDKs, and marketing copy that makes Postgres look like a horse and buggy.

So teams reach for them. Then six months later they are managing two separate databases — Postgres for everything else, Pinecone for vectors — running two separate migration workflows, debugging sync issues between them, and paying for an additional managed service. The team that wanted to move fast has added an operational surface area that requires dedicated attention.

pgvector exists. It is a Postgres extension. It is boring. It stores vectors in Postgres, queries them in Postgres, transactions with them in Postgres. You run one database. You use the migration tooling you already have. You query it with SQL you already know. The performance ceiling is lower than a dedicated system optimised for nothing but ANN search — but the teams I've talked to who hit that ceiling with pgvector are building at a scale where infrastructure complexity is genuinely their problem to manage. Most teams are not those teams.

The right question is not "what is the best vector database." It is "what is the simplest thing that handles my actual query volume, that I can operate with my existing knowledge, that does not require me to manage data consistency across two systems." The answer to that question, for most products, is Postgres.

-- pgvector: you already know how to do this
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id         uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  content    text NOT NULL,
  embedding  vector(1536),
  metadata   jsonb,
  created_at timestamptz DEFAULT now()
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

-- Retrieval: pure SQL, same connection pool as everything else
SELECT
  id,
  content,
  metadata,
  1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;

That is the retrieval layer for a production RAG system. It is a Postgres query. You already know how to read it, index it, back it up, and monitor it.

Agent Frameworks Are the Same Problem, Bigger

The vector database situation is a contained example. Agent frameworks are the same problem, scaled up.

There are now a meaningful number of agent frameworks in active development: LangChain, LangGraph, AutoGen, CrewAI, Pydantic AI, and several more depending on when you are reading this. They differ in their abstractions for tool calling, memory management, multi-agent coordination, and state persistence. Some of them are good. Some of them are in the process of becoming good. All of them are new enough that you are, to some degree, a beta tester.

The alternative is to not use a framework for the parts that don't require one. The model's tool-calling API is not complicated. You define tools as JSON schemas. The model returns a function call. You route it and return the result. That is the loop. You can implement the core of it in a hundred lines of Python that you wrote, that you understand completely, that has no transitive dependencies you didn't choose.

import anthropic
import json

client = anthropic.Anthropic()

def run_agent(system: str, user_message: str, tools: list, tool_handlers: dict) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system,
            tools=tools,
            messages=messages,
        )

        # No tool use → we're done
        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if b.type == "text")

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue
            handler = tool_handlers.get(block.name)
            if not handler:
                raise ValueError(f"No handler for tool: {block.name}")
            result = handler(**block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(result),
            })

        # Extend conversation with model turn and tool results
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

That is a complete agentic loop. No framework. No magic. Every line of it is readable by someone who has never seen it before. When it breaks, you know where to look. When you need to add a checkpoint before a destructive operation, you know exactly where to put it. When a framework update ships a breaking change to how tool results are structured, you are unaffected because you wrote the tool result handling yourself.

Frameworks earn their keep when they solve problems you genuinely have: complex multi-agent coordination, built-in state persistence, graph-based execution flows where you need cycle detection and conditional edges. If you have those problems, use a framework. But reach for your own loop first, and upgrade to a framework when you have a reason, not because the README has a compelling architecture diagram.

The Counterargument Is Real

I want to be honest about the case against this position, because it is not trivial.

Boring technology is not always available in the form you need. pgvector has a performance ceiling. If you are running similarity search across a hundred million vectors with sub-10ms latency requirements, you need a dedicated ANN index and the purpose-built databases are probably worth their operational cost. If your agent coordination is genuinely complex — multiple agents with heterogeneous capabilities, conditional routing based on intermediate state, nested tool calls — a framework that has solved those problems is better than reinventing it.

The real trap is not "using new technology." It is using new technology as the default rather than as the exception. When you reach for Pinecone before asking whether pgvector handles your actual query volume, you have made a choice you probably did not mean to make. The question is whether you made it consciously.

What Changes When AI Is Involved

The argument for boring technology is not new. What AI changes is the urgency of it, for a specific reason: the model is already the source of novel, hard-to-predict behavior in your system. The model hallucinates. The model handles edge cases in ways you did not anticipate. The model's output quality varies with context length, with phrasing, with temperature settings you forgot you changed. The model is a continuous source of surprises, and managing those surprises is the actual engineering work.

When the model is already the unpredictable component, adding unpredictable infrastructure around it is compounding risk. A flaky external API call in your tool chain plus a model that sometimes decides to call that tool three times in a row plus a vector database that occasionally returns inconsistent results under concurrent load is not three small problems. It is three small problems that interact in ways you cannot enumerate in advance.

Boring infrastructure shrinks the problem space. When the retrieval layer is Postgres and the queue is Redis and the API is plain HTTP, the list of things that can behave unexpectedly in hard-to-reproduce ways is shorter. You are not eliminating surprises — the model will still surprise you — but you are constraining where they can come from.

The system that is easiest to debug is not the one with the fewest components. It is the one where the largest number of components have predictable, documented behavior. Build toward that, and let the model be the interesting part.

The Heuristic I Actually Use

When evaluating a new technology for an AI product, I ask three questions before I let it into the stack:

1. What happens when this fails? Not "can it fail" — everything can fail. What does failure look like? Is it a clean error or silent corruption? Is it recoverable without data loss? Is there a runbook for it, or will I be writing one?

2. Can I replace it in a weekend? This is not about whether I will replace it. It is about whether the abstraction is thin enough that swapping the implementation does not require a rewrite. If replacing the vector store requires touching thirty files, the abstraction is wrong.

3. Does my boring alternative exist and have I ruled it out? Postgres, Redis, S3, plain HTTP. If one of these handles the problem, I need a specific reason not to use it — not just a feeling that the new thing is more purpose-built.

If a technology passes all three, it can earn its place in the stack. If it fails the first question and the second and the third, the burden of proof is high.

The teams that ship boring AI products are not the teams that lack ambition. They are the teams that understand where the ambition should go. The model is where the novel bets live. The model is where you spend the engineering attention on failure modes you have never seen before, on evaluation strategies that do not exist in textbooks yet, on product decisions that require genuine taste about AI behavior. That is the hard, interesting work.

Letting the infrastructure be interesting too is not ambitious. It is just expensive.

Make the retrieval layer boring. Make the queue boring. Make the API boring. Let Postgres handle the things Postgres is good at, which turn out to be most things. And spend the attention you just freed up on the part of the system that actually requires it.

Your Observability Is Looking at the Wrong Things

Benard Otieno — Thu, 14 May 2026 16:01:35 +0000

I've been in incident calls where every dashboard was green. Latency nominal. Error rate under 0.1%. CPU humming along at a comfortable 40%. And somewhere downstream, a critical workflow had been silently producing wrong results for six hours.

Nobody had an alert for "the thing is doing something, just not the right thing."

This is the gap most observability setups never close: they're watching the infrastructure, not the behavior. They'll tell you the system is alive. They won't tell you it's lying.

The Three Dials Everyone Watches

The default observability stack for most teams converges on the same three signals: uptime, latency, and error rate. These show up in every runbook, every SLA, every on-call rotation. They're not useless — a spike in error rate is real signal, a latency cliff is real signal — but they share a critical property: they're all lagging indicators of failure that's already happened.

More importantly, they only fire when the system is explicitly misbehaving. They say nothing about a system that's doing exactly what you told it to do, but where what you told it to do was wrong.

I had a recommendation service that returned results within 50ms, with a 0.02% error rate, and near-perfect uptime. It was also returning the same stale set of recommendations to every user because a cache invalidation job had silently stopped running four days earlier. The system was technically flawless. It had completely stopped serving its purpose.

The dashboard gave it a clean bill of health.

Logs Are Not a Narrative

The second failure mode is subtler. Most teams log well, in the sense that they log a lot. Request in. Response out. Exceptions caught and written somewhere. Database queries above a threshold. Auth events.

What they don't have is a narrative — a way to reconstruct what actually happened during a user's session, a job's execution, a transaction's lifecycle. Individual log lines are breadcrumbs. What you need is the trail.

The difference shows up immediately when something goes wrong. With breadcrumbs, you spend the first hour of an incident correlating timestamps across three different log streams, mentally assembling a sequence of events that should have been assembled for you. With a trail — structured traces with a shared correlation ID flowing through every service that touched a request — you open one query and see the story.

import uuid
import logging
import functools
from contextvars import ContextVar

correlation_id: ContextVar[str] = ContextVar("correlation_id", default="")

def traced(fn):
    @functools.wraps(fn)
    def wrapper(*args, **kwargs):
        cid = correlation_id.get()
        logger.info(
            "enter",
            extra={"fn": fn.__name__, "correlation_id": cid}
        )
        try:
            result = fn(*args, **kwargs)
            logger.info(
                "exit",
                extra={"fn": fn.__name__, "correlation_id": cid, "status": "ok"}
            )
            return result
        except Exception as e:
            logger.error(
                "error",
                extra={"fn": fn.__name__, "correlation_id": cid, "error": str(e)}
            )
            raise
    return wrapper

# At the edge — set once, propagate everywhere
def handle_request(request):
    correlation_id.set(request.headers.get("X-Correlation-ID") or str(uuid.uuid4()))
    return process(request)

This is not complicated. It's not expensive. The reason most teams don't have it is that they added logging incrementally — one print statement at a time — and never stepped back to ask whether the sum of those statements could tell a story.

Metrics Without a Baseline Are Just Numbers

Here's a metric: your API is returning responses in 340ms.

Is that good? Bad? Degraded from yesterday? Normal for this time of week? You cannot answer without a baseline, and most teams don't have one that's precise enough to be useful.

What typically exists is a static threshold: alert if latency exceeds 500ms. That threshold was set during initial deployment, when load was a tenth of what it is now, and hasn't been revisited since. It's not a baseline — it's a guess that calcified into a rule.

A real baseline is dynamic. It accounts for time of day, day of week, and recent trend. It flags when you're 30% above your own normal, not when you cross an arbitrary line someone set two years ago.

from collections import deque
from statistics import mean, stdev
from datetime import datetime

class AdaptiveBaseline:
    def __init__(self, window_size=1440):  # 24h of per-minute samples
        self.samples = deque(maxlen=window_size)

    def record(self, value: float):
        self.samples.append((datetime.utcnow(), value))

    def is_anomalous(self, value: float, threshold_stdev: float = 2.5) -> bool:
        if len(self.samples) < 60:
            return False  # not enough data to have an opinion
        recent = [v for _, v in self.samples]
        m = mean(recent)
        s = stdev(recent)
        if s == 0:
            return False
        return abs(value - m) > threshold_stdev * s

    def summary(self) -> dict:
        if not self.samples:
            return {}
        values = [v for _, v in self.samples]
        return {"mean": mean(values), "stdev": stdev(values), "n": len(values)}

Static thresholds are a lazy stand-in for understanding your system's normal. They exist because setting them takes five minutes, and building real baselines takes an afternoon. That tradeoff looks different at 2am when an alert fires on a load pattern that's been there for three weeks.

What Actually Belongs in Your Dashboards

The signals that matter fall into a different category than infrastructure health. They're about whether the system is doing its job, measured in terms the business cares about.

Throughput on the critical path. Not "requests per second" in aggregate — the specific count of the transactions that matter. Orders placed. Reports generated. Messages delivered. If that number is lower than expected, something is wrong, even if all your infra metrics are green.

Queue depth and processing age. If you have async workers, the age of the oldest unprocessed item is a more honest health signal than worker CPU. A queue that's growing is a system falling behind, regardless of what the workers themselves are reporting.

Business-level error rates, not HTTP error rates. A 200 response that returns an empty result set is not a success. A job that completes without exception but produces zero output has failed. You need to define success in terms of what the system was supposed to produce, then measure whether it produced it.

Derivative metrics. If your checkout conversion rate drops from 68% to 51%, that's a signal — even if no individual service is throwing errors. Tracking rates and ratios, not just raw counts, catches the class of failures where something is working but working worse.

# Prometheus recording rules — compute these, don't query them live
groups:
  - name: business_health
    interval: 60s
    rules:
      - record: job:orders_per_minute:rate
        expr: rate(orders_completed_total[5m]) * 60

      - record: job:checkout_conversion:ratio
        expr: |
          rate(checkouts_completed_total[10m])
          / rate(checkout_initiated_total[10m])

      - record: job:queue_age_seconds:max
        expr: time() - min(job_enqueued_timestamp_seconds)

  - name: alerts
    rules:
      - alert: ConversionRateDrop
        expr: job:checkout_conversion:ratio < 0.55
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Checkout conversion below 55% for 5+ minutes"

      - alert: QueueProcessingStalled
        expr: job:queue_age_seconds:max > 300
        for: 2m
        labels:
          severity: warning

Alerts Should Be Harder to Silence Than to Fix

The last thing most teams get wrong is the incentive structure around noise. When alerts fire too often on non-issues, engineers start ignoring them — or worse, start routing around them. The standard fix is to raise thresholds and add retry logic so the alert doesn't fire. This is treating the symptom. The alert was lying because the metric was wrong, and the right fix is to measure something that's actually meaningful.

There's a useful rule here: if an alert fired and the on-call engineer's first instinct was to check whether it was a false positive, the alert is already broken. A good alert should produce a specific, directed response — not a "let me see if this is real" investigation. If you find yourself constantly confirming that real alerts are real, your signal-to-noise ratio is telling you something.

Flaky alerts are the observability equivalent of flaky tests. You know you have them. You've learned to distrust them. And every week they stay in the rotation makes you slightly less responsive to the ones that actually matter.

Track your alert false-positive rate like you track your error rate. Alert on your alerts. Set a rule that any alert firing more than twice without a corresponding incident review gets flagged for audit. This sounds bureaucratic until the first time you catch that a critical alert has been misfiring for three weeks and nobody noticed because the team had learned to dismiss it.

What You're Actually Missing

Most observability stacks are built to answer one question: is the system up? That's a fine question. It's just not the most important one.

The more useful questions are: is the system doing what users need? Is it doing it as well as it was yesterday? Is anything changing that I should know about before it becomes a problem?

Those questions require measuring at the level of behavior and outcome, not infrastructure and response codes. They require traces that tell a story instead of logs that record events. They require baselines instead of thresholds, and business metrics instead of system metrics.

None of this is exotic. The tooling exists — OpenTelemetry, Prometheus recording rules, structured logging with correlation IDs. The gap isn't tooling. It's the habit of reaching for the infrastructure dashboard first and calling it observability.

Start with one question: if your system silently started doing the wrong thing at 3am, how long would it take you to find out? If the answer is "until a user complained," your dashboards are watching the machine, not the work.

That's the thing worth fixing.

blog.bennerdo.org

Benard Otieno — Fri, 08 May 2026 09:09:21 +0000