Forem: Alexandre Amado de Castro

Stop Architecture Drift: Operationalizing ADRs with Automated Fitness Functions

Alexandre Amado de Castro — Tue, 07 Apr 2026 02:43:57 +0000

Six months after we standardized on OpenSearch, a pull request introduced Datadog into a service. The ADR existed. It had been discussed, approved, and stored in the repo. The PR was still green.

That is architecture drift. Not because engineers are careless. Because memory does not scale across hundreds of people and dozens of repositories.

After we started checking ADRs in CI, we caught several violations like this in the first month and dozens more in the first quarter before they reached production.

This is the difference between documenting architecture and operationalizing it. Documentation tells people what the decision was. Operationalization puts that decision in the path of delivery.

This post is the system behind that: take an ADR, turn it into an automated check, and run it against every PR diff. I went this direction because I needed one approach that worked across Go, Python, Terraform, and Kubernetes without maintaining a separate rule engine for each.

The Problem: Architecture Decisions Nobody Remembers

Most architecture teams do good work. They write ADRs, join design reviews, publish standards, and explain trade-offs. But documents do not participate in delivery. A decision can be technically correct and still have zero enforcement once it leaves the meeting.

At scale, that becomes a latency problem. You only discover drift after merge, during an incident, or when a migration turns into a quarter-long cleanup project. The issue is not that architects need more authority. The issue is that documentation does not stop drift. Running checks do.

The fix is to move architecture into the same path as tests, policy, and build checks. Make the decision executable.

A Quick Primer on Architecture Decision Records [ADR]

If you already use Architectural Decision Records, this part will feel familiar. But the template matters here because bad ADRs are hard to operationalize. If the decision is vague, the automation will be vague too.

I still like Michael Nygard's original format, with a few practical tweaks, because it keeps the decision grounded in context instead of turning it into policy theater.

Here is the template I use:

# Architecture Decision Record

Date: YYYY-MM-DD

Status: proposed | accepted | deprecated | superseded (add reference)

## Context

This section describes the forces at play, including technological, political,
social, and project local. These forces are probably in tension and should be
called out as such. The language in this section is value-neutral. It is simply
describing facts.

## Considered Alternatives

This section describes the alternatives evaluated together with the pros, cons,
and reasoning why to move or not to move with it.

* Alternative 1: [description, pros, cons, reasoning]
* Alternative 2: [description, pros, cons, reasoning]

## Decision

This section describes our response to these forces. It is stated in full
sentences, with an active voice. "We will ..."

## Consequences

This section describes the resulting context, after applying the decision.
All consequences should be listed here, not just the "positive" ones. A
particular decision may have positive, negative, and neutral consequences,
but all of them affect the team and project in the future.

The two sections that matter most for operationalization are Considered Alternatives and Consequences. The first tells future readers what you rejected. The second tells you what kind of drift to watch for. If those sections are vague, your fitness function will be vague too.

This is the part most teams skip. The ADR is not the enforcement mechanism; it is the source material for one.

Context tells you where the rule applies.
Decision tells you what must or must not happen.
Consequences tell you what kinds of drift, exceptions, and migration warnings the check should account for.

If I cannot turn those sections into a concrete detection rule, the ADR is probably not specific enough yet.

[!TIP]
Store your ADRs in version control, not a wiki. Put them in a /docs/adr folder in your main repo or a dedicated architecture repo. This gives you history, blame, and—critically—the ability to reference them from CI pipelines.

From Documentation to Enforcement: Operationalizing ADRs

So you've got a folder full of ADRs. Now what?

This is where fitness functions come in. A fitness function is any automated check that verifies an architectural constraint — the term comes from Building Evolutionary Architecture, but the concept is just "automated architecture tests." Some examples:

Dependency fitness function: Does go.mod contain any database drivers other than our standard (PostgreSQL)?
Layer fitness function: Does the domain package import from infrastructure? (It shouldn't in hexagonal architecture.)
Security fitness function: Are all external API calls using our approved HTTP client with circuit breakers?

The key insight is that every ADR that isn't explicitly deprecated or superseded should map to at least one fitness function. That is the operationalizing step. In practice, if you only check ADRs marked "accepted," you'll miss most of them — plenty of ADRs sit at "proposed" or have no status at all. The filter should be inclusive: check everything that hasn't been explicitly retired.

Traditional fitness functions are written in code. In Go, you might use a tool like go-arch-lint or write custom tests that inspect the AST. In Java, ArchUnit is the gold standard. In TypeScript, you might use dependency-cruiser.

But here's the problem: most organizations are not monolingual.

Let's say you've decided to standardize on OpenSearch for log aggregation. That decision affects:

Your Go services (which might import a Datadog or Splunk client)
Your Python scripts (which might use different logging libraries)
Your Terraform modules (which might provision the wrong resources)
Your Kubernetes manifests (which might include sidecars for the wrong vendor)

Writing a fitness function in each language sounds manageable until you count the surfaces: Go dependencies, Python imports, Terraform resources, Helm values, raw YAML, and environment variables. Half the drift is in configuration, not application code.

That is why I started treating fitness functions as prompts instead of parsers. I already had an AI reviewer reading PR diffs through Archbot. Adding architecture checks meant adding more instructions, not shipping another language-specific rule engine.

The idea is simple. Instead of writing code that parses each language's dependency file, you describe what a violation looks like and let an LLM analyze the diff.

Operationalizing the ADR: Two Approaches

The flow is straightforward: start with an ADR, express the violation patterns so an LLM can check them, and run that check against the PR diff. If the model finds a violation, return a structured comment.

There are two ways to do this, and I've tried both.

Approach 1: Separate Fitness Function Prompts

The more structured option: for each ADR, you write a corresponding "fitness function prompt" — a standalone markdown file that describes what constitutes a violation and how to identify it. When a PR comes in, the AI analyzes the diff against all active prompts and flags any violations.

This works well when you want fine-grained control over detection rules, or when the people writing fitness functions are different from the people writing ADRs.

Let me show you a complete example.

The ADR

# ADR-042: Standardize on OpenSearch for Log Aggregation

Date: 2025-09-15

Status: accepted

## Context

We currently have three different log aggregation solutions in production:
- Datadog (legacy, from acquisition)
- Splunk (used by compliance team)
- OpenSearch (adopted by new services)

This fragmentation causes operational overhead (three dashboards, three query
languages, three billing relationships) and makes cross-service debugging
difficult. Engineers waste time context-switching between tools.

## Considered Alternatives

* **Keep multi-vendor approach:** Pros: no migration effort, teams keep familiar
  tools. Cons: ongoing operational overhead, cross-service debugging remains
  painful, costs continue to grow.

* **Standardize on Datadog:** Pros: strong APM integration, existing team
  familiarity. Cons: highest cost option, vendor lock-in concerns, no
  self-hosted option.

* **Standardize on OpenSearch:** Pros: open-source, self-hosted option available,
  good cost/feature balance, already adopted by new services. Cons: migration
  effort required, less mature APM story.

## Decision

We will standardize on OpenSearch for log aggregation. Specifically:
- Application logs ship to OpenSearch via Fluent Bit sidecars
- No direct integration with Datadog, Splunk, or other log vendors in
  application code
- Existing Datadog/Splunk integrations should be migrated by Q2 2026

## Consequences

- Single pane of glass for log queries (positive)
- Reduced vendor costs, ~40% savings projected (positive)
- Consistent logging patterns across all services (positive)
- Migration effort required for legacy services (negative)
- Team retraining on OpenSearch query syntax (negative)
- Some Datadog-specific features like APM correlation need alternatives (negative)
- Compliance team will need updated runbooks for OpenSearch (neutral)

The Fitness Function Prompt

Now here's the corresponding fitness function, written as a prompt:

## Fitness Function: ADR-042 OpenSearch Standardization

**ADR Reference:** ADR-042
**Enforcement:** Block new integrations; warn on modifications to legacy

### What to Check
- Direct Datadog/Splunk SDK imports or dependencies (Go, Python, Node, Terraform)
- Vendor-specific logging configuration (DD_*, SPLUNK_* env vars)
- Sidecar configurations for non-approved vendors

### How to Respond
- NEW integrations: Flag as BLOCKING with ADR reference and migration guide link
- MODIFICATIONS to existing: Flag as WARNING with migration deadline

### What to Ignore
- Documentation, test fixtures, removal of vendor code

You get the idea. Each ADR gets a prompt that describes what to look for, how to respond, and what to ignore. You can tune the prompt for each ADR independently.

But after running this in production, I found a simpler way.

Approach 2: Let the Agent Read ADRs Directly

Here's what I discovered after building the system: you don't need separate fitness function files at all. The ADR itself — specifically the Decision and Consequences sections — already contains everything the agent needs to check compliance.

Think about what a well-written ADR tells you:

Decision says "We will use X" and "No direct integration with Y" — that's your violation definition right there.
Consequences describes what drift looks like — "migration effort required for legacy services" tells the agent what existing code might look like versus what new code should look like.

When you write a separate fitness function file, you're translating the ADR into a second document. That creates a sync problem: the ADR gets updated, but the fitness function prompt doesn't. Or the ADR gets nuanced ("except for the compliance team's audit logs"), but the prompt still flags those cases.

The simpler approach: give the agent tools to read ADRs directly, and let it figure out what's relevant.

The agent gets two tools instead of a library of prompt files:

list_adrs — returns a summary table for triage:

ID	Title	Status	Summary
ADR-042	OpenSearch for Log Aggregation	accepted	Mandates OpenSearch, forbids Datadog/Splunk integrations...
ADR-043	API Versioning Standard	proposed	Requires URL path versioning for all public APIs...

read_adr — fetches the full content of a specific ADR by file path.

The system prompt instructs the agent to call list_adrs, identify which ADRs are relevant to the PR's changes based on summaries, then call read_adr for each relevant one. It checks the diff against each ADR's Decision and Consequences sections and reports violations with specific quotes.

The payoff is even bigger than the separate-files approach: adding a new architectural check means writing an ADR. That's it. The people writing ADRs don't need to write a second file, learn a prompt format, or coordinate with the platform team. The agent picks it up automatically.

The Kind of Diff It Catches

Here is the sort of change this catches without writing a Node-specific rule:

diff --git a/package.json b/package.json
@@ -18,6 +18,7 @@
   "dependencies": {
+    "dd-trace": "^5.4.0",
     "pino": "^9.7.0"
   }
 }

That is enough context for the model to flag a new vendor-specific logging dependency and tie it back to ADR-042.

BLOCKING: This PR introduces a new Datadog integration. Per ADR-042, we've standardized on OpenSearch for log aggregation. Please use Fluent Bit sidecars instead.

How it works in practice

One production choice matters here: decide whether model failures should fail open or fail closed. For something like ADR-042, fail open makes sense — a missed Datadog import is annoying but not dangerous. For a security ADR banning unencrypted secrets in config, you probably want fail-closed. Match the failure mode to the cost of drift for the decision you're enforcing.

Production Lessons: What I Learned After Shipping This

The conceptual model above is clean. Production was messier. Here are the three lessons that would have saved me the most time if someone had told me upfront.

Why LLMs Excuse Architecture Violations

This one surprised me. The agent saw a clear ADR mandating a specific SDK for Go services. The codebase used a different SDK. The agent's reasoning? "This codebase predates the ADR and might be an exception."

It dismissed the violation without flagging it.

LLMs are trained to be helpful, and "helpful" often means finding reasons why something is probably fine. That is exactly the wrong behavior for compliance checking.

The fix: explicit anti-rationalization rules in the system prompt.

Report what you see. Do NOT rationalize away violations.

If an ADR mandates technology X and the code uses technology Y, that is a
violation — even if you believe the codebase predates the ADR, might have an
exception, or has a reasonable justification.

Your job is to flag deviations from the ADR text, not to judge whether
exceptions apply. Humans will decide if an exception is warranted.

This is a universal lesson for anyone using LLMs for compliance, policy enforcement, or audit. If the model can find a reason to say "looks fine," it will. You have to explicitly tell it not to.

Summary-First Triage Is Required at Scale

I mentioned this above, but it deserves emphasis because the failure mode is invisible. With a handful of ADRs, you can dump everything into context and it works great. At 20+ ADRs, the API starts timing out or the model loses focus in the middle of a massive context window.

The two-step workflow — summaries for triage, full reads for relevant ADRs — is not an optimization. It's a requirement. Plan for it from the start if you have more than 10 ADRs.

Prompt Iteration Is the Real Work

Here's a timeline from my production rollout:

Iteration 1: Instructions sounded optional ("you may want to check ADRs"). The agent skipped them entirely.
Iteration 2: Made instructions mandatory. Agent read all ADR content at once. API timeout.
Iteration 3: Added summary-based triage. Agent read summaries but never called the tool to read full content.
Iteration 4: Fixed tool instructions. Agent read full content but rationalized away violations.
Iteration 5: Added anti-rationalization rules. Agent started flagging violations but was too aggressive — flagging test fixtures and migration documentation.
Iteration 6: Added "What to Ignore" guidance. Getting closer.
Iteration 7+: Tuned severity levels and response formatting.

Seven iterations. Each one taught me something about how LLMs interpret instructions.

The biggest takeaway: if an instruction sounds optional, the agent will skip it. Don't say "you may want to check." Say "You MUST call read_adr for EACH potentially relevant ADR." Mandatory language is not rude — it's how you get consistent behavior from an LLM.

Plan for at least five iterations on your prompt. Test against real PRs, not synthetic examples. The edge cases in real diffs will teach you things you can't anticipate.

Tradeoffs: AI vs Static Analysis

Let me be direct about the trade-offs.

Traditional code-based fitness functions give you:

Deterministic results (same input = same output)
Fast execution (no API calls)
No token costs

AI-based fitness functions give you:

Cross-language support with one implementation
Natural language rules anyone can write
Fuzzy matching (catches variations you didn't anticipate)
Contextual understanding (knows the difference between adding and removing a dependency)

For most organizations, especially those with polyglot stacks, the AI approach wins on maintainability. The cost? You're trading determinism for flexibility. Sometimes the LLM will miss something. Sometimes it'll flag a false positive.

But here's the thing: manual code review has the same problems. Reviewers miss things. Reviewers have false positives ("I don't like this pattern" when it's actually fine). At least with AI fitness functions, the rules are written down and consistently applied.

That said, AI fitness functions are not a replacement for linters, type systems, or targeted static analysis. I still reach for traditional code-based checks when:

The rule must be deterministic and always blocking
The violation is semantic rather than structural
A false positive would be expensive or noisy
The check needs to run with near-zero latency and no external dependency

The sweet spot for AI is cross-language policy enforcement and structural drift. The sweet spot for static analysis is precise, language-local invariants. In practice, the best system uses both.

How to Roll This Out Without Overengineering It

If you want to try this approach:

Audit your ADRs. Which ones aren't deprecated or superseded? List them.

[!TIP]
Don't filter by "accepted" only — in most repos, half the ADRs that matter are still marked "proposed" or have no status at all. Be inclusive.

Classify by verifiability. For each ADR, ask: "Can I describe what a violation looks like?" If yes, it's a candidate for automation.
Start with one high-value check. Pick an ADR that gets violated frequently. If you're using the agent-reads-ADRs approach, make sure the Decision section is specific enough. If you're writing separate prompts, write the first one. Hook it into your review pipeline.
Plan for iteration. Your first prompt will be wrong. The agent will skip checks, timeout on too much context, or rationalize away violations. Budget for at least five iterations against real PRs. This is the actual work.
Add anti-rationalization rules early. Don't wait until you see the agent excusing violations. Add explicit "report, don't judge" instructions from the start.
Build the library. Once you have a system checking 5-10 ADRs, the flywheel kicks in. New ADRs automatically get checked. Engineers start writing better ADRs because they know the Decision section will actually be enforced.

Start with the ADR that already creates expensive cleanup when it gets ignored. Do not start with your most abstract principle. Start with the one that repeatedly costs your team time.

The Datadog PR from the opening was not a one-off. It was what happens when important decisions live in documentation and delivery lives somewhere else.

Once we put the ADR in the review path, the feedback loop collapsed from months to minutes. In the first month, it caught several violations we would have otherwise discovered later. By the end of the first quarter, it had caught dozens more across repositories and file types that no single linter could have covered.

Operationalizing architecture is not magic. It is just architecture that finally made it into code.

What's Next

This is part 1 of a three-part series on operationalizing architecture:

Part 1: Operationalizing ADRs with Automated Fitness Functions (you are here)
Part 2: Spotting the Big Changes: Automating Impactful PR Detection for Architects (coming soon) — Not all PRs are equal. Learn how to automatically flag the ones that need architect review.
Part 3: Navigating Tech Choices: How to Turn Your Tech Radar into an Active Architecture Guide (coming soon) — Your Tech Radar shouldn't be a PDF. Make it a dynamic fitness function.

I'm spending more time building architecture systems that turn standards into working guardrails. If that's the kind of platform problem you're working on too, you can grab time here.

Book a mentoring session

Agentic AI Code Review: From Confidently Wrong to Evidence-Based

Alexandre Amado de Castro — Sun, 15 Mar 2026 17:39:43 +0000

✅ Links fixed!
Archbot flagged a "blocker" on a PR. It cited the diff, built a plausible chain of reasoning, and suggested a fix.

It was completely wrong. Not "LLMs are sometimes wrong" wrong — more like convincing enough that a senior engineer spent 20 minutes disproving it.

The missing detail wasn't subtle. It was a guard clause sitting in a helper two files away.
Archbot just didn't have that file.

That failure mode wasn't a prompt problem.
It was a context problem.

So I stopped trying to predict what context the model would need up-front, and switched to an agentic loop: give the model tools to fetch evidence as it goes, and require it to end with a structured "submit review" action.

This post is the architectural why and how (and the reliability plumbing that made it work). There are good hosted AI review tools now. This post is about the pattern underneath them.

[!NOTE]
Names, repos, and examples are intentionally generalized. This is about the design patterns, not a particular company.

If you want the backstory on what Archbot is and why I built it, start with the original post: Building Archbot: AI Code Review for GitHub Enterprise.

The fixed pipeline (and where it breaks)

My original design was what a lot of "LLM workflow" systems converge to:

Give the model the PR diff.
Give it some representation of the repo.
Ask it to pick the files that matter.
Feed only those files back in for the real review.
Optionally run a second pass to critique the first pass.

That critique pass is helpful for catching obvious nonsense and reducing overconfident tone.
But it can't solve the core issue: if both passes are reasoning over the same clipped context, they're both still guessing.

It looks like this:

On paper, it's elegant:

The model is great at prioritizing.
Keeping context small should reduce hallucinations.
A critique pass should catch the worst bad takes.

In practice, it fails in a very specific way.

The core failure: you can't pre-select the missing piece

Code review isn't "read these files".
It's "follow the chain of evidence until you understand the behavior".

And chains are not predictable.

You start in handler.go, notice it calls validate(), jump to validate.go, see it wraps a client, jump to client.go, and only then realize the bug is a timeout default in config/defaults.go.

The fixed pipeline made that exploration impossible.

Once Phase 1 picked the "important" files, Phase 2 was stuck. If the model realized it needed one more file to confirm a claim, it had no way to fetch it.

That led to two bad outcomes:

False positives with confidence. The model would infer behavior from a partial call chain and present it as fact.
Missed risk. The model would never see the one file that made a change dangerous, because that file didn't look important from the diff.

I could tune prompts.
I could add more heuristics to file selection.
I could increase the file budget.

But that's all the same bet: that I can guess the right context ahead of time.

That bet is wrong more often than it feels.

The shift: stop guessing context, let the model fetch it

The agentic version flips the contract:

The system does not attempt to build a perfect context.
It gives the model a toolset for finding evidence.
It loops until the model either submits a review (terminal tool) or hits a budget/timeout.

Instead of a fixed pipeline, it's an exploration loop:

You can summarize the new rule as:

"If you can't cite it, go fetch it."

Under the hood, this ended up being three pieces:

A loop that alternates between "model turn" and "tool turn", with budgets and context hygiene.
A tool interface that can mark certain tools as terminal.
Different toolsets per mode (full review vs chat vs command), reusing the same loop.

The biggest reliability improvement wasn't "more tools".
It was making the model end by calling a terminal tool.

In a fixed prompt pipeline, the model ends by printing Markdown.
That makes it tempting to optimize for sounding right.

In an agentic loop, the model ends by submitting an object.
That changes the incentives:

It's harder to hand-wave; you have to populate fields.
You can enforce structure (severity, inline comments, evidence links).
The loop can treat "no submission" as failure.

So what does the terminal tool actually look like? And how do you wire all this together without it becoming a mess? Let's build it.

What changed in review quality

The qualitative shift after going agentic wasn't "more comments".
It was more explainable comments.

The best code review feedback has three properties:

It points at a specific behavior.
It cites the relevant code.
It proposes a fix (or at least a direction) with trade-offs.

Tools make that possible.

Instead of:

"This might break retries."

You get:

"In foo/bar.go:123, the new call bypasses withRetry(...). All other call sites use withRetry(...) (see matches in search_code). If that's intentional, we should document why; otherwise, wrap it."

Here's what that looks like in practice:

When the model has the ability (and expectation) to fetch evidence, it stops guessing.

The way I measure that shift: does it cite exact locations, or hedge? Does it go fetch more evidence when uncertain, or just keep talking? How often do I have to spend time disproving a comment?

In other words: does it behave like a cautious reviewer, or a persuasive one?

Build it yourself: the pieces you need

Here's how the pieces fit together — the tool interface, a few representative tools, the terminal tool, and how they wire into the loop.

The examples below are simplified Go to show the minimum shape. They're not production code, but they're structurally complete — you could extend these into a working system.

I didn't rebuild Archbot from scratch to go agentic. I took the existing review pipeline and wrapped it in a small loop: model turn → tool turn → repeat, plus a single terminal action (submit_code_review) to end the run. A hand-rolled loop is the fastest way to prove the product behavior (does it fetch evidence? does it stop guessing?) before you commit to a framework.

[!TIP]
Starting greenfield? Google's ADK is a solid default — it provides a similar tool interface with built-in orchestration, tracing, and callback hooks. The patterns below (tool contracts, terminal actions, structured output, context hygiene) apply either way.

Step 1: Define the tool contract

Every tool implements the same contract. The key method is IsTerminal() — it's what lets the loop know when to stop.

type Tool interface {
    Name() string
    Description() string
    InputSchema() map[string]interface{}
    Execute(ctx context.Context, input map[string]interface{}) (string, error)
    IsTerminal() bool
}

Step 2: Build your context-fetching tools

Most tools follow the same pattern: parse input, call an API, return a string. Keep them boring and deterministic.

One thing I didn't appreciate early: these tools are part of your product surface area. Treat them like APIs — stable contracts, deterministic outputs, and tests that lock in behavior. If search_code returns garbage, the model will reason over garbage. The shortest path from "I suspect X" to "here's the evidence" should be one tool call.

Here's get_file_content:

type GetFileContent struct {
    ghClient *github.Client
    owner    string
    repo     string
    headSHA  string
}

func (t *GetFileContent) Name() string        { return "get_file_content" }
func (t *GetFileContent) IsTerminal() bool     { return false }
func (t *GetFileContent) Description() string  {
    return "Fetch the full content of a file at the PR's head commit."
}

func (t *GetFileContent) InputSchema() map[string]interface{} {
    return map[string]interface{}{
        "type": "object",
        "properties": map[string]interface{}{
            "path": map[string]interface{}{
                "type":        "string",
                "description": "File path relative to the repo root",
            },
        },
        "required": []string{"path"},
    }
}

func (t *GetFileContent) Execute(
    ctx context.Context, input map[string]interface{},
) (string, error) {
    path, _ := input["path"].(string)
    if path == "" {
        return "", fmt.Errorf("path is required")
    }

    content, err := t.ghClient.GetFileContent(ctx, t.owner, t.repo, path, t.headSHA)
    if err != nil {
        return "", fmt.Errorf("fetching %s: %w", path, err)
    }

    return addLineNumbers(content), nil // prefix each line with its number for precise citations
}

And here's search_code — the tool the model reaches for when it sees a function call in a diff and wants to know "where else is this used?"

type SearchCode struct {
    repoFiles map[string]string // path -> content (from repomix or local clone)
    diffText  string
}

func (t *SearchCode) Name() string    { return "search_code" }
func (t *SearchCode) IsTerminal() bool { return false }

func (t *SearchCode) Execute(
    ctx context.Context, input map[string]interface{},
) (string, error) {
    query, _ := input["query"].(string)
    if query == "" {
        return "", fmt.Errorf("query is required")
    }
    queryLower := strings.ToLower(query)

    var matches []string

    // search repo files (production version uses ripgrep or tree-sitter for precision)
    for path, content := range t.repoFiles {
        lines := strings.Split(content, "\n")
        for i, line := range lines {
            if strings.Contains(strings.ToLower(line), queryLower) {
                matches = append(matches, fmt.Sprintf("%s:%d: %s", path, i+1, strings.TrimSpace(line)))
            }
        }
    }

    // search the PR diff too
    for i, line := range strings.Split(t.diffText, "\n") {
        if strings.Contains(strings.ToLower(line), queryLower) {
            matches = append(matches, fmt.Sprintf("(diff):%d: %s", i+1, strings.TrimSpace(line)))
        }
    }

    if len(matches) == 0 {
        return fmt.Sprintf("No matches found for %q", query), nil
    }
    return strings.Join(matches, "\n"), nil
}

Step 3: Define the terminal tool

This is what makes the loop an agent instead of a chatbot. The model doesn't print freeform text — it calls submit_code_review with structured data, and the loop ends.

One subtle but important implementation detail: treat the terminal tool as non-executable. When the model emits submit_code_review, the loop captures the structured payload and ends immediately. Your application code (outside the loop) turns that payload into an actual GitHub review.

That also gives you a schema you can validate. Here's what the model submits:

{
  "summary": "What changed + why it matters",
  "inline_comments": [
    {
      "path": "foo/bar.go",
      "line": 123,
      "severity": "blocker",
      "comment": "Specific issue, evidence, and a suggested fix"
    }
  ],
  "high_level_feedback": [
    "Design-level note that isn't tied to a single line"
  ]
}

If the model puts "blocker" in the severity field but can't provide a path + line, that's the model telling you it doesn't have evidence.

Here's the implementation:

type SubmitCodeReview struct{}

func (t *SubmitCodeReview) Name() string    { return "submit_code_review" }
func (t *SubmitCodeReview) IsTerminal() bool { return true }
func (t *SubmitCodeReview) Description() string {
    return `Submit your final code review. Every blocker and should-fix MUST
include a file path and line number. If you cannot cite evidence, downgrade
to nice-to-have or omit the finding.`
}

func (t *SubmitCodeReview) InputSchema() map[string]interface{} {
    return map[string]interface{}{
        "type": "object",
        "properties": map[string]interface{}{
            "summary":            map[string]interface{}{"type": "string"},
            "inline_comments":    map[string]interface{}{
                "type": "array",
                "items": map[string]interface{}{
                    "type": "object",
                    "properties": map[string]interface{}{
                        "path":     map[string]interface{}{"type": "string"},
                        "line":     map[string]interface{}{"type": "integer"},
                        "severity": map[string]interface{}{"type": "string", "enum": []string{"blocker", "should-fix", "nice-to-have"}},
                        "comment":  map[string]interface{}{"type": "string"},
                    },
                    "required": []string{"path", "line", "severity", "comment"},
                },
            },
            "high_level_feedback": map[string]interface{}{
                "type": "array",
                "items": map[string]interface{}{"type": "string"},
            },
        },
        "required": []string{"summary", "inline_comments"},
    }
}

func (t *SubmitCodeReview) Execute(
    ctx context.Context, input map[string]interface{},
) (string, error) {
    // In practice, the loop intercepts this before Execute runs.
    out, _ := json.MarshalIndent(input, "", "  ")
    return string(out), nil
}

Step 4: Write the loop

This is the core of the agent. Conceptually, it's tiny:

// Simplified: the real code has retries, token accounting, and timeouts.
for iteration := 1; iteration <= maxIterations; iteration++ {
    resp, err := model.Converse(ctx, messages, tools)
    if err != nil {
        return nil, err
    }

    if resp.HasToolCall("submit_code_review") {
        return resp.ToolInput("submit_code_review"), nil
    }

    toolMessages, err := executeTools(ctx, resp.ToolCalls)
    if err != nil {
        return nil, err
    }

    messages = append(messages, toolMessages...)
    messages = shrinkStaleToolResults(messages, maxToolResultChars)
}
return nil, fmt.Errorf("agent loop ended without submit_code_review")

That shrinkStaleToolResults line is doing more work than it looks. The loop manages context aggressively:

It keeps the diff and the most recent evidence.
It truncates older tool results (the giant file dumps you needed 3 turns ago).
If the model hits a context overflow anyway, it retries the same iteration after shrinking tool results harder.

Why bother with all this shrinking? "Why not just stuff the entire repo into the prompt and avoid tool calls?" Because long context has its own failure modes.

Even if your model can accept huge inputs, performance doesn't scale linearly with tokens. Two patterns show up in research and in practice:

"Lost in the middle." Models tend to over-weight the beginning and end of long contexts, and under-use relevant info buried in the middle.
Distraction. Irrelevant context measurably reduces accuracy; the model starts pattern-matching on junk.

If you want to go deep on the evidence:

"Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., 2024)
"Large Language Models Can Be Easily Distracted by Irrelevant Context" (Shi et al., 2023)
"Context Rot: How Increasing Input Tokens Impacts LLM Performance" (Chroma Research, 2025)

The practical takeaway isn't "never use long context". It's this:

Treat tokens like budget, not storage.

This is boring plumbing. It is also the difference between a system that works on small PRs and a system that survives the messy ones.

Step 5: Wire it together

This is the entry point. You assemble your tools, set your budgets, run the loop, and handle the structured output.

func RunReview(ctx context.Context, model Model, pr PRContext) (*ReviewResult, error) {
    tools := []Tool{
        &GetPRInfo{pr: pr},
        &GetPRDiff{pr: pr},
        &GetFileContent{ghClient: pr.Client, owner: pr.Owner, repo: pr.Repo, headSHA: pr.HeadSHA},
        &SearchCode{repoFiles: pr.RepoFiles, diffText: pr.Diff},
        &SubmitCodeReview{},
    }

    config := LoopConfig{
        SystemPrompt: reviewSystemPrompt,
        InitialMessage: fmt.Sprintf(
            "Review PR #%d: %s\nAuthor: %s\nBase: %s ← Head: %s",
            pr.Number, pr.Title, pr.Author, pr.Base, pr.Head,
        ),
        Tools:              tools,
        MaxIterations:      15,
        MaxToolResultChars: 120_000,
    }

    result, err := RunLoop(ctx, model, config)
    if err != nil {
        return nil, err
    }

    if result.TerminalToolName != "submit_code_review" {
        return nil, fmt.Errorf("loop ended without submitting a review")
    }

    return parseReviewResult(result.TerminalToolInput)
}

The tools are simple. The loop is simple. The power comes from letting the model decide which tools to call, and in what order.

That terminal tool pattern also gives you a clean way to run the same loop in different modes:

Full PR review: includes submit_code_review.
Interactive chat: no terminal tool; the loop returns assistant text.
Command mode: terminal tool is "submit result" for that command.

One loop. Different endings.

Step 6: Add guardrails

Agentic doesn't mean "let it run forever". It means you give the model freedom inside a box:

Iteration caps. Hard limit on round-trips.
Timeouts. Total wall-clock budget.
Tool result limits. Max characters per tool output, and stricter limits for stale outputs.
Self-critique before submission. The terminal tool instructions include a checklist: cite evidence, downgrade uncertain claims, avoid bikeshedding, don't invent runtime failures.

If you're building your own version, don't copy my numbers. Copy the pattern: budgets are part of the interface.

Step 7: Evaluate on real PRs

Don't trust vibes. Pick 5-10 PRs where you already know what the real risks were — a missed nil check, a broken migration, a security issue someone caught in manual review. Run your loop against those PRs and check:

Did it find the real issue?
Did it cite the right file and line?
Did it hallucinate problems that don't exist?
When it was uncertain, did it fetch more evidence or just assert?

That gives you a ground-truth baseline. From there, iterate on your tools (not your prompts).

The trade-offs (because there are always trade-offs)

Agentic review is not a free lunch.

Latency: tool calls add round-trips. You have to make tools fast and cacheable.
Cost: more turns can mean more tokens. Budgeting is mandatory.
Tool design: bad tools produce bad behavior. (If search_code returns garbage, the model will reason over garbage.)
Security: tools are an exfiltration surface. Your tool layer needs authorization and redaction.

But for code review, the win is that the model starts behaving like a cautious reviewer: it looks things up.

Don't build a bigger prompt. Build a loop where the model can fetch evidence, update its hypothesis, and only finish when it submits a structured, checkable result. When you stop forcing the model to guess context, you stop debugging prompt vibes and start debugging actual interfaces.

Want to build your own agentic review system? Let's talk through the architecture.

Book a mentoring session

Challenging Assumptions in Technology: From Being Right to Getting It Right

Alexandre Amado de Castro — Thu, 12 Feb 2026 02:34:30 +0000

Here's a story I tell my mentees often—and it's one where I almost made an expensive mistake.

A few years ago, my team was tasked with rebuilding the authentication system for one of our core products. We walked into that room with a massive assumption already locked and loaded: "Authentication isn't our core business. We should buy, not build."

It made perfect sense on paper. We assumed buying a vendor solution like Auth0 would be faster, safer, and cheaper. We assumed building it ourselves was risky—authentication is hard, right? If you mess it up, you leak credentials, you destroy trust, you crater the business. It felt like the "smart senior engineer" decision to make.

But when we looked at the price tag, we hesitated. It was expensive. We were about to commit to substantial annual licensing fees, multi-year contracts, and integration work that everyone assumed was "just configuration."

So, we did something uncomfortable: we challenged our own "smart" assumption.

We looked deeper. We found that OAuth 2.0 is actually very standardized and prescriptive. We realized that integrating an external vendor with our complex web of internal legacy identity providers was going to be a nightmare regardless of who wrote the auth server. The hard part wasn't the OAuth protocol—it was mapping our Byzantine internal user hierarchies and permission structures.

So we decided to be naive. We decided to build it.

The result? It took us one quarter. It cost us a fraction of the vendor price—saving substantial licensing fees annually. It's been running in production for years, scaling to multiple brands.

Don't get me wrong—buying isn't bad. We don't build Git from scratch; we don't build VMware from scratch. But in this specific case, the "obvious" choice was the wrong one. We nearly let a safe assumption cost us six figures and lock us into a decision we'd regret for years.

This experience taught me a lesson that defines how I lead today: Engineering isn't about being right. It's about getting it right. And the only way to get it right is to relentlessly challenge your assumptions.

Assumptions Are Hidden Technical Debt

We talk a lot about technical debt in code—variables named temp2, lack of tests, hardcoded values. But there is a much more dangerous kind of debt: Decision Inertia.

Every time you say, "Well, we just use Git because that's what we use," or "We can't rewrite this legacy app, it's too big," or "We can't move off this bare-metal MySQL cluster, it's too stateful," you are taking out a loan. You are deferring the work of thinking. You are snowballing unverified beliefs into a mountain of constraints that eventually makes it impossible to move.

Why do we do this?

Survivorship Bias: We see that legacy system running and assume it's just "old code"—but forget it solved real business problems that still exist today. That spaghetti monolith? It handles edge cases you don't even know about yet.
Vendor Demo Trap: We've all seen the slick demo where the vendor solution handles every use case with one click. But we forget that the integration is always harder than the demo, and our snowflake infrastructure never fits their assumptions.
Migration PTSD: We've been burned by past migrations that went sideways—databases that didn't replicate correctly, services that couldn't handle production load. So we assume this one will fail too, even when the circumstances are completely different.

If you want to grow from a senior engineer to a principal or staff level, you have to stop accepting "that's just how it is" as an answer.

Assumptions are the hidden technical debt of architecture.

The ACT Framework

Over the years, I've leaned on a simple mental model to help teams break out of this inertia. I call it the ACT Framework.

Full disclosure: This isn't some new magic. It's the scientific method, simplified and repackaged for the chaos of daily engineering. I use it because it's memorable enough to apply in the heat of the moment, without needing a textbook to recall the steps.

1. Awareness

You can't fix what you don't see. Most of the time, we aren't even aware we're making an assumption.

Recognize the trap: You assume the train knows the way to the destination until you end up in the wrong city.
The Daily Question: Ask yourself, "What is one thing I am taking for granted today?"
Challenge "Facts": Maybe it's that your database needs to run on bare metal. Maybe it's that a service needs to be written in Go. Be aware of the "facts" that are actually just habits.

2. Challenge

Once you spot an assumption, poke it. Even if you think it's right, challenge it anyway.

"What if we didn't use Kubernetes for this?"
"What if we did rewrite this 10-year-old monolith?"
"What if the 'industry standard' doesn't apply to our specific constraints?"

This fosters psychological safety. When you normalize challenging ideas, you create an environment where people aren't afraid to be wrong. You shift the focus from "Who has the best idea?" to "What is the best idea?"

3. Test (Evidence is King)

This is the most critical step. Challenging without testing is just intellectual debate—it's just noise.

Start small: We didn't build the whole thing first; we spun up a quick POC and load-tested it.
Gather data: We ran benchmarks and proved that modern virtualized instances could match or exceed the bare metal performance we thought we needed.
Decide with evidence: Opinions are cheap. Senior engineers have opinions; Staff engineers have data.

Scaling to the Team: The Assumption Audit

The ACT framework is great for checking your own biases, but challenging assumptions is ultimately a team sport. If you're the only one doing it, you'll quickly become "that person" who blocks every PR with philosophical questions. You need to bring the team along.

I like to use a technique called Inversion during design reviews.

Instead of asking "Is this the right choice?", I ask the room:

"What conditions would have to be true for the **opposite* of this decision to be the right choice?"*

This forces the team to articulate the hidden constraints they are assuming exist.

For example, if the team assumes "We must use Microservices," the inversion asks: "What would have to be true for a Monolith to be the better choice here?"

Usually, the answer is something like, "Well, we'd need to not care about independent deployments." And then you can check: "Do we actually care about independent deployments for this specific internal tool?"

Here's the irony: we platform teams spend all day promoting microservices to product engineers—"decouple your domains!" "independent deployability!"—but then we turn around and build monolithic internal tools because "it's just for us" and "we don't need to scale that." When you invert your own assumptions, you often find you're giving advice you don't follow.

Suddenly, the assumption is visible. Now you can debate the constraint, not the solution.

The Power of Naive Questions

There is a trap that experienced engineers fall into. We get cynical. We've seen projects fail, we've seen technologies hype cycle and die, and we start to assume that "it's complicated" means "it's impossible."

The most valuable engineers I know retain a degree of "naivety"—the willingness to ask "How hard can it be?" even when the "smart" answer is "don't even try."

I saw this play out recently when someone asked if we could make our database infrastructure fully self-service using Ansible and Terraform. The room went quiet. Databases are the third rail—stateful, critical, full of subtle operational constraints.

I could have said no. I could have cited all the reasons it wouldn't work—stateful workloads are different, replication is hard, compliance adds complexity, we'd need a whole platform team just for this.

But instead, I said: "I don't know. How hard can it be? Let's try and find out."

We gave ourselves one quarter to prove it could work. We defined clear success criteria: spin up a production-ready, compliant Postgres instance in under an hour. The first attempt took three weeks and broke in spectacular ways. But we learned. We iterated. We automated the pain points one by one.

Nine months later, you can spin up a production-ready, compliant Postgres instance in minutes with a single PR. We unlocked velocity that we assumed was impossible because we were willing to try.

The lesson? Sometimes "it's complicated" is true. But sometimes it's just a story we tell ourselves to avoid the work of finding out.

How to Sell the Risk

Now, I can hear you saying: "This sounds great, Alex, but my PM wants this feature shipped yesterday. I don't have time to play scientist."

I get it. The pressure is real.

When you challenge an assumption—especially a "safe" one like buying a vendor or using a standard library—you are introducing risk. You are trading certainty (even if it's expensive or slow) for uncertainty.

Management hates uncertainty. So, don't sell them uncertainty. Sell them a Timeboxed Spike.

Here is the script I use:

"I have a hunch this assumption might be wrong. Give me 2 days to verify it. If I'm right, we save weeks of complexity and significant budget. If I'm wrong, we lost 2 days. Is that a trade you're willing to make?"

Why does this work?

Capped Downside: You explicitly state the maximum loss (2 days).
Asymmetric Upside: The potential gain (weeks of time, budget) massively outweighs the cost.
Business Language: You aren't asking to "refactor" or "explore"; you are proposing a calculated bet with positive expected value.

Most reasonable leaders will take that bet every single time.

The Goal: The Best Answer Wins

If you take one thing from this post, let it be this: Check your ego at the door.

Challenging assumptions isn't about being the smartest person in the room who points out logical fallacies. It's about vulnerability. It's about saying, "I might be wrong about this, and I want to find out."

It's uncomfortable. It requires you to talk to other teams, to push back on managers, to question legacy decisions made by people who are still at the company.

But that is the job.

The best engineers I know are the ones who are happiest when their own assumptions are proven wrong—because that means they just learned something, and the team is about to build something better.

So, here is my challenge to you: What is one assumption you are holding onto right now?

Document it. Challenge it. Test it.

You might just save your company a significant amount of budget, or better yet, you might build something you didn't know was possible.

You are a Senior Engineer, Mastering Communication & Influence (Part 3)

Alexandre Amado de Castro — Tue, 13 Jan 2026 15:30:00 +0000

I once watched a brilliant engineer lose a promotion to someone with half his technical depth. The difference? The other guy could explain his work in a way that made executives actually care.

This is a hard truth, but it's one we need to face. As we continue our journey through your professional development, it's time to zero in on the "soft skills"—communication, influence, and leadership—that often determine your career trajectory and impact more than your code does.

As a Principal Engineer, I have the great opportunity to mentor many seniors, and I can tell you right now: the difference between a senior who stays at that level and one who advances to staff, principal, or tech lead positions almost always comes down to these soft skills. So, grab your favorite brew, and let's dive into the essentials that will set you apart.

The Art of Communication

Communication isn't just about conveying information; it's about making connections, inspiring your team, and presenting your ideas effectively. Here's the truth: you could design the most elegant system in the world, but if you can't explain why it matters, it won't get built. Whether you're explaining complex technical details to stakeholders or mentoring a newcomer, the way you communicate can make all the difference.

Adaptability: Tailoring Your Message

In the fast-paced world of tech, your audience can range from highly technical peers to business-oriented stakeholders. The key to effective communication lies in your ability to adapt, and this is something I had to learn the hard way early in my career.

Imagine you're explaining a new architectural decision. To your engineering team, you'll dive deep into the technical specifics—maybe you're discussing the trade-offs between using a message queue versus direct API calls, the implications for eventual consistency, or the specifics of your chosen design pattern. You might talk about how this aligns with SOLID principles or how it improves your system's observability.

However, when presenting that same decision to executives, your focus should shift entirely to the business impact. They don't care about your elegant use of the Strategy pattern. What they care about is: how does this decision enhance scalability? Will it reduce our cloud costs? Does it help us ship features faster? Will it improve our uptime and reduce those 3 AM incidents that wake up the on-call team?

Here's a practical example: I once had to justify migrating our monolith to microservices. To the engineering team, I talked about service boundaries, deployment independence, and technology diversity. To the executive team, I framed it as: "This will allow us to deploy new features 10x faster without risking the entire platform, and it will reduce our incident recovery time from hours to minutes." Same decision, completely different framing.

The best part? You don't need to dumb anything down. You're just translating between contexts. Think of it like being bilingual—you're fluent in both technical and business language, and you know when to use each.

Patience and Repetition: Reinforcing Concepts

Here's something nobody tells you about being a senior: you're going to explain the same thing over and over again. And that's okay! In fact, it's part of the job.

Explaining the same concept multiple times, from different angles, until it clicks isn't a sign that you're bad at explaining or that your team isn't listening. It's simply how humans learn complex topics. Embrace this repetition—it's an opportunity to refine your explanations and ensure everyone's on the same page.

Think of it as an iterative process, much like software development itself. Each explanation is a new iteration, improving understanding and clarity. Your first explanation might be too technical. Your second might be too abstract. By the third or fourth time, you'll find that sweet spot where it finally clicks for everyone.

I remember when I was introducing my team to event-driven architecture. The first time I explained it, I lost them in the technical details of event sourcing and CQRS. The second time, I used a restaurant ordering system as an analogy—the waiter takes your order (event), the kitchen prepares it (event handler), and the waiter brings it out (another event). That's when I saw the lightbulbs go off. Sometimes you just need to find the right metaphor or example.

This patience not only helps in solidifying concepts but also in building a collaborative and inclusive environment where people feel safe asking questions and admitting when they don't understand something.

Embracing Questions and Uncertainty

Questions are inevitable, and how you handle them sets you apart. Here's something I've learned: the moment you pretend to know everything is the moment you lose credibility. Embrace curiosity, admit when you don't know something, and commit to finding the answer. This openness builds trust and fosters a learning culture.

For instance, in a meeting, if a junior developer asks a question that stumps you, instead of deflecting or making something up, acknowledge it honestly. Say something like, "That's a great question, and honestly, I don't have the answer right now. Let me look into it and get back to you by end of day." This not only shows humility but also models the behavior you want to see in your team.

I've had moments where a junior engineer asked about a specific edge case in our distributed system that I hadn't considered. Instead of brushing it off, I said, "You know what? That's a scenario I haven't thought through. Let's whiteboard this together and figure out what would happen." We ended up discovering a real bug that could have caused data inconsistency. That junior engineer's question saved us from a production incident, and they felt empowered to keep asking tough questions.

The best engineers I know are the ones who aren't afraid to say "I don't know, but let's find out together."

Feedback and Suggestions: Two-Way Communication

Constructive feedback is a two-way street, and getting good at both giving and receiving it will transform your team dynamics. For example, when reviewing a colleague's code, focus on providing specific, actionable feedback rather than vague criticisms. Instead of "This code is messy," try "I noticed this function is doing three different things. What if we extracted the validation logic into a separate function? It would make this easier to test and understand."

On the flip side, when receiving feedback, approach it with an open mind. This is hard—nobody likes hearing their work could be better. But here's the thing: even seemingly minor suggestions can spark innovation. I've had colleagues point out patterns in my code reviews that led to us establishing new team standards. I've had junior developers ask "why" questions that made me reconsider assumptions I'd been making for years.

The best feedback loop I've seen worked like this: during code reviews, we'd frame suggestions as questions. "Have you considered...?" or "What was the reasoning behind...?" instead of "You should..." This small shift made feedback feel collaborative rather than critical.

Diagrams: The Blueprint of Understanding

As a Senior Engineer, explaining complex systems clearly is a big part of your role, and this is where diagrams become your best friend. I can't tell you how many times a 10-minute whiteboard session with a diagram has solved a problem that hours of Slack messages couldn't.

Mastering Diagramming Techniques

There are three main diagramming approaches you should have in your toolkit, and each serves a different purpose:

UML (Unified Modeling Language): This is your go-to for detailing class structures and interactions. Use UML when you need to show how objects relate to each other, their methods, and their relationships. It's particularly useful when discussing design patterns with your team or planning out a new module's structure. For example, when I'm introducing the Factory pattern to my team, I'll use a UML class diagram to show the relationship between the factory, the interface, and the concrete implementations. It makes the abstract concept concrete.

C4 (Context, Containers, Components, Code): This is ideal for contextualizing software architecture within its environment, and honestly, it's become my favorite for most architecture discussions. The beauty of C4 is that it works at multiple levels of zoom. At the Context level, you're showing how your system fits into the broader ecosystem—what external systems does it talk to? Who are the users? Then you zoom into Containers to show the high-level technical building blocks (web app, database, message queue), then Components to show how the container is structured, and finally Code if you need that level of detail.

I recently used C4 diagrams to explain our platform architecture to new team members. The Context diagram showed how we integrate with payment providers and our mobile apps. The Container diagram revealed our microservices, databases, and message queues. By the time we got to the Component diagram, they understood exactly where they'd be working and how their code would fit into the bigger picture.

ArchiMate: This is your enterprise architecture tool. It's more formal and structured than C4, making it perfect when you're dealing with complex business domains and need to show not just technical architecture but also business processes, organizational structure, and how they all interconnect. I'll be honest, I don't reach for ArchiMate often, but when I'm working with enterprise architects or documenting large-scale organizational changes, it's invaluable.

Diagram Guidelines: Clarity and Consistency

Here's the thing about diagrams: a bad diagram is worse than no diagram at all. A confusing diagram will derail your meeting and leave everyone more confused than when you started.

Creating effective diagrams is both an art and a science. Here are the rules I follow:

Every element must earn its place. If a box, arrow, or label doesn't add clarity, remove it. I've seen diagrams with 50 boxes trying to show everything at once. Instead, create multiple diagrams at different levels of detail.

Clear titles and labels are non-negotiable. Your diagram should be understandable without you explaining it. I test this by showing diagrams to someone unfamiliar with the project and asking them to explain it back to me. If they can't, I simplify.

Consistency in visual language is crucial. Pick a color scheme and stick to it. For example, in my C4 diagrams, I always use blue for internal services, gray for external systems, and orange for databases. This consistency means that when someone sees a blue box, they immediately know it's something we own and control.

Use arrows deliberately. Show direction of data flow, not just "these things are related." A bidirectional arrow should mean something different than a unidirectional one.

I learned this the hard way after creating a beautiful, detailed diagram that nobody understood. Now I follow the principle: start simple, add complexity only when necessary. Your first diagram should be something you could sketch on a napkin.

Here's what this looks like in practice. Let's say you're documenting a payment processing flow. Here's what happens when you ignore these guidelines:

Before: Cluttered and Confusing

This diagram is technically accurate—all these components exist. But does it help you understand the payment flow? Probably not. You're drowning in infrastructure noise. Which path does a payment actually take? What's essential vs. what's just support infrastructure?

Now, here's the same flow following the principles I just covered:

After: Clear and Purposeful

See the difference? Every element earns its place. The numbered arrows show you exactly how data flows through the system. The colors are consistent: blue for our internal services, orange for data stores, green for external systems. You could sketch this on a napkin in 30 seconds, and a new team member could understand the payment flow immediately.

The logging service, monitoring service, and config service? They're important, but they don't help explain the payment flow, so they don't belong here. If you need to document how monitoring works, create a separate diagram for that.

The Art of Presentation

Presenting effectively is crucial, whether in a formal setting or an informal team meeting. I've seen great ideas die because they were poorly presented, and mediocre ideas gain traction because they were presented well. Your technical skills got you to senior, but your presentation skills will determine how far you go beyond that.

Here's something I wish someone had taught me earlier: know your audience and format before you build your presentation. A live presentation where you're persuading stakeholders needs minimal slides and compelling visuals—you are the presentation, and your slides just support you. An RFC or architecture decision record that people will read asynchronously needs comprehensive detail and full context, because you won't be there to explain it.

The worst mistake? Creating a dense document and trying to present it live, or creating a sparse slide deck and sending it to people expecting them to understand it without you. Match the format to the medium, and you'll save yourself hours of rework.

Leadership: Influence Over Authority

Here's something that surprised me when I became a senior: leadership in tech isn't about wielding authority; it's about inspiring and guiding your team. You don't have direct reports (unless you transition to management), but you're expected to lead. This is what we call "influence without authority," and it's one of the most important skills you'll develop.

Leading by Example

This sounds cliché, but it's true: demonstrate the behaviors and practices you want to see in your team. Leading by example builds trust and respect faster than any amount of talking.

For instance, if you advocate for clean code, ensure your contributions to the codebase reflect those principles. I learned this when I was pushing my team to write better tests. I realized that some of my recent PRs had skimped on test coverage because I was rushing. How could I expect them to prioritize testing if I wasn't? So I made a point of writing comprehensive tests, adding thoughtful comments explaining my testing strategy, and during code reviews, I'd reference my own PRs as examples. The shift in team behavior was noticeable within weeks.

If continuous learning is a value, share resources and lead knowledge-sharing sessions. I started a bi-weekly "Tech Talk Tuesday" where anyone could present something they learned—a new tool, a design pattern, a debugging technique. I made sure to present first to set the tone and show it was okay to present on simple topics or to say "I don't fully understand this yet, but here's what I've learned so far."

Another true test of leadership is how you handle failure. When a production incident happens, it's easy to point fingers. But as a Senior Engineer, I learned to lead blameless post-mortems. Instead of asking "Who broke this?", I asked "How did our system allow this to happen?" and "What information were we missing?" By shifting the focus from blame to process improvement, you don't just fix the bug—you build psychological safety for the whole team. That's leadership.

Here's the key: your team is watching you. If you say documentation is important but never update the docs, they won't either. If you say work-life balance matters but you're sending Slack messages at midnight, that's the culture you're creating. Lead by example, not by decree.

Pushing Boundaries

Growth comes from stepping out of your comfort zone, and as a senior, part of your job is to push both yourself and your team to grow. Whether it's learning a new technology, leading a challenging project, or advocating for a controversial architectural decision, continuous personal and professional development is key.

Don't shy away from opportunities that stretch your abilities—embrace them as a chance to grow. When my manager asked if I wanted to lead the migration from our monolith to microservices, my first thought was "I've never done this at this scale before." My second thought was "This is exactly why I should say yes." That project pushed me harder than anything in my career, and I learned more in those six months than in the previous two years.

But here's the thing: pushing boundaries doesn't mean being reckless. It means calculated risks. It means doing your homework, building proof of concepts, getting feedback from trusted peers, and then going for it. Some of the best innovations I've seen came from engineers who said, "I know this isn't how we've always done it, but what if we tried..."

Encourage your teammates to do the same. When a mid-level engineer proposes something ambitious, instead of listing all the reasons it might fail, ask them "What would you need to make this work?" and help them get there.

Speaking Business

Understanding and speaking the language of business is crucial, and this is where many technical folks struggle. You need to master negotiation, understand risk, and use business terminology to align technical solutions with business objectives.

Here's what I mean: when proposing a technical initiative, frame it in terms of business benefits—how it will drive revenue, reduce costs, or improve customer satisfaction. Don't just say "We need to refactor this service." Instead, say "Refactoring this service will reduce our cloud costs by 30% and cut our time to ship new features from two weeks to two days, which means we can respond to customer requests faster than our competitors."

Learn to speak in terms of ROI (Return on Investment), TCO (Total Cost of Ownership), and OKRs (Objectives and Key Results). When you're discussing technical debt, frame it as risk: "This legacy system represents significant business risk. If it fails, we lose $50K per hour of downtime. Modernizing it is an investment in business continuity."

I literally treat it like an equation: (Current Pain $$ + Risk $$) > (Migration Cost $$) = Green Light. If I can't quantify the 'Current Pain' in dollars or time, I know I'm not ready to pitch it to leadership.

I'll be honest: this felt uncomfortable at first. I became an engineer because I loved code, not because I wanted to think about quarterly revenue targets. But here's the reality: the work you do has to serve business goals. The better you are at connecting your technical work to business outcomes, the more autonomy you'll have, the more resources you'll get, and the more impact you'll make.

Some practical ways to develop this skill: sit in on product planning meetings, read your company's quarterly reports, talk to your product managers about what customers are asking for, and most importantly, ask "why" about business decisions until you understand the reasoning.

Keep Growing, Keep Leading

Your role as a Senior Engineer is complex and rewarding. The technical skills got you here, but these soft skills—communication, presentation, and leadership—will determine where you go next. By mastering these skills, you'll elevate not only your career but also those around you.

Remember, every staff engineer, every principal engineer, every tech lead you admire got there not just because they were brilliant coders, but because they could communicate their brilliance, lead their teams, and create impact beyond their individual contributions.

Stay curious, stay communicative, and keep leading the way. You've got this ❤️

Communication is the leverage that multiplies your technical impact. Want to level up your leadership? Let's tackle your challenges together.

Building a Custom AI Code Reviewer for GitHub Enterprise with Bedrock and Go

Alexandre Amado de Castro — Tue, 09 Dec 2025 14:44:05 +0000

Two weeks after deploying our custom AI code reviewer, it caught a bug that would have crashed our payment processing. A senior engineer's PR looked clean—it compiled, passed linting, followed Go idioms. But the AI spotted a subtle pattern: we'd accidentally turned a soft dependency into a hard one. If our config service blinked, our main service would die.

The engineer's response? "Oh."

That's when the team stopped treating Archbot like a toy.

Here's the problem we were solving: we run GitHub Enterprise behind a VPN. Modern AI code review tools like Cursor, Github Copilot, and CodeRabbit expect cloud-hosted repos. We couldn't pipe proprietary code into them. But we still wanted AI code review on every PR.

So I built it myself.

This is the story of Archbot, the custom AI code reviewer I built to run securely inside enterprise networks. It started as a weekend hack ("how hard could it be?"). Spoiler alert: it didn't start well.

What I learned (and want to teach you):

Why naive git diff + LLM approaches fail
How to manage context windows for large repos
Using Bedrock Tool schemas for structured LLM output
Two-phase AI architecture for intelligent filtering
Production patterns for running AI tools in CI/CD

Why Git Diff Alone Fails for AI Code Review

My first iteration was painfully simple. I wrote a Go program that fetched the git diff of a PR and sent it to the LLM with a prompt like: "Find bugs in this code."

I felt like a genius for about five minutes. Then the results came in.

The AI would confidently claim:

Variable userRepository is undefined.

I’d look at the code. It was imported right there at the top of the file. But because the git diff only shows changed lines (and a few lines of context), the AI couldn't see the imports.

It was the equivalent of trying to review a book by reading only the sentences that were edited in the second draft. You lose the plot immediately.

Lesson learned: Context is king. Without the file definitions, imports, and surrounding scope, the AI is just guessing.

So I went back to the drawing board.

GitHub API Rate Limits and the Sequential Fetch Problem

My next attempt was obvious: don't just send the diff—fetch the full files from GitHub's API.

I updated the bot to parse the diff, identify which files were changed, and then use the GitHub API to fetch the full content of those files.

This worked... until someone opened a PR that touched 40 files.

My bot fired off 40 sequential API calls to GitHub.
GitHub fired back a 429 Too Many Requests.
The bot crashed.
The developer waited 5 minutes for a review that never came.

And even when it did work, it was slow. We were burning network time just shuttling strings around. I realized I was trying to rebuild git clone over HTTP, poorly.

Using Repomix to Pack Repository Context for AI

I needed a way to pack the repository context efficiently. That's when I found Repomix (formerly repopack).

Repomix solves a subtle problem: AI models don't understand file systems. You can't just tar a repo and send it—you need structure. Repomix walks your tree and outputs either XML or Markdown with clear file boundaries, respecting .gitignore, and including metadata like file sizes and token counts. It's like pandoc for codebases.

Perfect, right?

There was just one catch: Repomix is a Node.js tool.
Archbot is written in Go.

Why didn't I just write the bot in Python/TypeScript? Two reasons: Concurrency and Deployment. We needed to handle bursts of webhook events without spinning up heavy worker processes, and shipping a single static binary to our distroless containers keeps our security team happy.

I didn't want to rewrite my entire bot in TypeScript. I didn't want to manage two separate deployment pipelines. And I definitely didn't want to write a Go port of Repomix (that's a two-month side quest).

So I did what platform engineers do best: I made the tools work together.

The Multi-Stage Build Solution

The solution was a standard Docker multi-stage build: compile the Go binary in one stage, run it in a Node runtime in the second.

# Stage 1: Build the Go binary
FROM golang:latest AS builder
WORKDIR /app
COPY . .
RUN go mod download
RUN CGO_ENABLED=0 GOOS=linux go build -a -o /app/archbot ./cmd/.

# Stage 2: Runtime with Node for Repomix
FROM node:lts-bullseye
RUN apt-get update && \
    apt-get install -y ca-certificates tzdata git && \
    rm -rf /var/lib/apt/lists/*
RUN npm install -g repomix

# Copy the pre-built Go binary from stage 1
COPY --from=builder /app/archbot /app/

EXPOSE 9000
CMD ["/app/archbot"]

We simply copy the compiled Go binary into a Node.js base image. The Go binary runs as the main process and uses exec.Command to shell out to the globally-installed repomix command when needed.

It's not the purest architectural pattern, but it shipped the feature in 2 hours instead of 2 weeks. Sometimes, platform engineering is about making the tools work for you, not being a purist.

The Performance Win

Here's what this actually improved: we stopped hammering the GitHub API.

When a PR comes in, Archbot clones the repo once (network call), runs repomix locally (disk I/O), and gets a nice XML string of the codebase in ~2 seconds. Compare that to the "Attempt 2" approach where we made 40+ sequential API calls per PR.

Git cloning is fast. S3 caching (we cache the repomix output) handles way more load than the on-prem GitHub API could. The bottleneck shifted from network to disk, which is exactly what you want.

Context Window Management for Large Repositories

We deployed the Repomix version. It worked great for small microservices.

Then we tried it on The Monolith™.

You know the one. Millions of lines of code. Hundreds of directories. When Repomix packed that repo, the resulting text file was massive—way larger than the context window of even the generous 200k token models.

And even if it did fit, sending 150k tokens for every simple bug fix is a great way to burn through your budget.

I realized I couldn't just "feed the repo" to the AI. I had to replicate how I review code.

When I review a PR, I don't read the entire codebase.

I look at the diff.
I check the file tree to see where the changes live.
I open only the files that are relevant to the change to verify imports and usage.

We needed a Two-Phase Architecture.

Instead of blindly packing the entire repo, we split the review into two phases:

Phase 1: Intelligent File Selection

Instead of sending the code immediately, we first send the AI the Directory Structure and the Diff.

We ask it a simple question: "Given these changes, which files do you need to read to verify correctness?"

The system prompt for this phase is carefully engineered to guide the AI's selection:

You are a senior code reviewer. You must use the select_files tool to identify
which files are needed to verify the correctness of these changes. Consider:
imports, interface definitions, test coverage, and related business logic.

Be selective—choose only files directly relevant to understanding the change.
If a PR modifies a handler, you likely need the service it calls and any
interfaces it implements. You don't need unrelated files in the same directory.

The AI is surprisingly good at this. It will say:

"I see you modified user_handler.go. I need to see user_service.go to check the interface and user_handler_test.go to verify coverage."

Forcing Structured Responses with Bedrock Tools

Cool, but how do we actually tell Bedrock what format to return?

This is where AWS Bedrock Tools become critical. Instead of hoping the LLM returns parseable text, you define a JSON schema that forces the model to return structured data in exactly the format you specify.

The model guarantees it returns valid JSON matching that schema. No regex parsing. No "I hope the AI formatted this correctly." Just typed data you can unmarshal directly into a Go struct.

We use this pattern for both phases: Phase 1 uses a select_files tool schema (returns an array of file paths), and Phase 2 uses a perform_code_review tool schema (returns structured review comments with severity levels and inline suggestions). The AI can't return freeform markdown even if it wants to. The schema is the contract.

We'll see the actual code for these schemas in the implementation section below.

Phase 2: The Focused Review

Now that we have the list of 5-10 relevant files, we use Repomix to pack them into a focused context. But here's the thing—even those selected files come with baggage.

Generated code. Vendored dependencies. Lock files. Binary artifacts. Test fixtures.

This is where Repomix's exclusion list feature becomes critical. Instead of naively packing everything, Repomix lets you define exclusion patterns that filter out files that don't make sense for AI review. Think .gitignore, but specifically tuned for context packing.

We configure Repomix to exclude:

Generated proto files and mocks
Third-party vendored code
Large JSON fixtures and test data
Binary assets and compiled artifacts
Package lock files (sorry package-lock.json, you're 50k lines of noise)

The result? I take the selected files and pack them cleanly—stripping out the noise that would confuse the AI or burn tokens unnecessarily.

This reduces the context size from ~100k tokens (full repo) to ~5k tokens (filtered, focused context). It's faster, cheaper, and because the noise is removed, the AI hallucinates less.

Let's talk money. The naive approach of sending 150k input tokens plus ~5k output tokens to Claude Haiku 4.5 via Bedrock would cost roughly $0.18-0.20 per review (based on direct API pricing of $1/$5 per million input/output tokens—Bedrock pricing may vary). With the two-phase approach and Repomix's exclusion filtering, I'm averaging 8k input tokens and ~2.5k output tokens per review—about $0.02 per review. At 50 PRs/day, that's the difference between $300/month (naive) vs $30/month (optimized). The two-phase architecture pays for itself immediately.

Implementation: Bedrock Tools and Structured LLM Output

Okay, enough story time. Let's look at the code.

The secret sauce isn't just the prompting—it's forcing the LLM to return structured data. Instead of parsing freeform markdown (nightmare fuel), we use AWS Bedrock's "Tool" feature—essentially a JSON schema that constrains the model's output. You define the shape you want, and the model guarantees it returns valid JSON matching that schema. No regex. No "I hope the AI formatted this correctly." Just typed data.

One quick note: we're using Bedrock's Converse API, not the older InvokeModel approach that required raw JSON payload construction. The Converse API handles message formatting, tool schemas, and multi-turn conversations for you. If you're still using InvokeModel, upgrade—it'll save you hours of JSON wrangling.

Step 1: The Contract (Bedrock Tools)

First, we define the schema for our "tools". In AWS Bedrock (or OpenAI), this tells the model exactly what JSON format we expect back.

Here is the select_files tool. Note how we explicitly ask for an array of file paths and a reasoning field.

// Imports:
// "github.com/aws/aws-sdk-go-v2/service/bedrockruntime/types"
// "github.com/aws/aws-sdk-go-v2/service/bedrockruntime/document"

// BuildSelectFilesTool creates a tool for selecting relevant files from a PR
func BuildSelectFilesTool() *bedrockTypes.ToolConfiguration {
    schemaMap := map[string]interface{}{
        "type": "object",
        "properties": map[string]interface{}{
            "files": map[string]interface{}{
                "type": "array",
                "items": map[string]interface{}{
                    "type":        "string",
                    "description": "File path relative to repo root (e.g., 'src/main.go')",
                },
                "description": "List of relevant file paths to analyze.",
            },
            "reasoning": map[string]interface{}{
                "type":        "string",
                "description": "Explanation of why these files were selected",
            },
        },
        "required": []string{"files", "reasoning"},
    }

    // Wrap the schema in Bedrock's ToolConfiguration
    toolSpec := &bedrockTypes.ToolMemberToolSpec{
        Value: bedrockTypes.ToolSpecification{
            Name:        aws.String("select_files"),
            Description: aws.String("Select the most relevant files for analysis based on the PR diff"),
            InputSchema: &bedrockTypes.ToolInputSchemaMemberJson{
                Value: document.NewLazyDocument(schemaMap),
            },
        },
    }

    return &bedrockTypes.ToolConfiguration{
        Tools: []bedrockTypes.Tool{toolSpec},
        ToolChoice: &bedrockTypes.ToolChoiceMemberTool{
            Value: bedrockTypes.SpecificToolChoice{
                Name: aws.String("select_files"),
            },
        },
    }
}

This schema tells Bedrock: "You must return a JSON object with a files array of strings and a reasoning string. No other format is acceptable."

When you pass this tool configuration to the Converse API, Bedrock guarantees it returns valid JSON matching that schema. The same pattern applies to Phase 2—we define a perform_code_review tool with fields for severity levels, inline comments, and code suggestions.

Step 2: Phase 1 - File Selection

When a PR comes in, we grab the diff and the file tree (Repomix gives us a nice "Directory Structure" summary we can reuse). We send that to the model.

Here's the core logic from selectFilesForReview. This is production code with the defensive patterns you need when running this on every PR across 30 repositories.

func (r *CodeReviewAnalyzer) selectFilesForReview(ctx context.Context, prDiff string) ([]string, error) {
    // Add timeout for production resilience
    ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
    defer cancel()

    // 1. Get directory structure from Repomix (cached)
    repoStructure := r.ChangeContext.Repomix.DirectoryStructure

    // 2. Build the prompt asking "Which files do you need?"
    phase1Prompt := fmt.Sprintf(SelectFilesForReviewPrompt, repoStructure, prDiff)

    // 3. Call Bedrock with the Tool Configuration
    toolConfig := bedrock.BuildSelectFilesTool()

    response, err := r.BedrockClient.Converse(ctx, SystemPrompt, phase1Prompt, toolConfig)
    if err != nil {
        return nil, fmt.Errorf("bedrock converse failed: %w", err)
    }

    // 4. Extract structured tool response
    toolJSON, err := bedrock.ExtractToolJSON(response)
    if err != nil {
        return nil, fmt.Errorf("failed to extract tool response: %w", err)
    }

    // 5. Unmarshal into typed struct
    var selectResponse struct {
        Files     []string `json:"files"`
        Reasoning string   `json:"reasoning"`
    }

    if err := json.Unmarshal(toolJSON, &selectResponse); err != nil {
        return nil, fmt.Errorf("invalid json structure: %w", err)
    }

    // 6. Guard against runaway file selection
    if len(selectResponse.Files) > 20 {
        slog.Warn("AI selected too many files, truncating",
            "requested", len(selectResponse.Files),
            "limit", 20)
        selectResponse.Files = selectResponse.Files[:20]
    }

    return selectResponse.Files, nil
}

Notice the defensive programming: timeouts with context.WithTimeout, explicit error wrapping with fmt.Errorf, structured logging with slog.Warn, and the guardrail cap that truncates oversized file selections. When you're running this on every PR across 30 repositories, you need to assume the AI will occasionally hallucinate, the network will flake, or someone will open a 500-file refactor. The first week we deployed this, the AI tried to select 47 files for a 3-line CSS change. Without the cap, we would have blown our context window budget. Production systems need guardrails.

Step 3: Packing Context (The Repomix Magic)

Now that we have the selectedFiles list, we don't need to re-clone anything. We have the Repomix data structure in memory (or cached). We just filter it to extract the context we need.

// BuildFileContext filters RepomixData to selected files and returns a packed string.
// Note: This example shows XML format for illustration. The production version outputs
// JSON using structured serialization for better type safety and easier parsing.
func (r *Runner) BuildFileContext(data *RepomixData, selectedFiles []string) (string, error) {
    // 1. Create a lookup map for O(1) checks
    keep := make(map[string]bool)
    for _, f := range selectedFiles {
        keep[f] = true
    }

    // 2. Filter the pre-packed files from Repomix
    // We assume data.Files contains the parsed output from Repomix
    var filteredFiles []repomix.File
    for _, file := range data.Files {
        if keep[file.Path] {
            filteredFiles = append(filteredFiles, file)
        }
    }

    if len(filteredFiles) == 0 {
        return "", fmt.Errorf("no relevant files found in context")
    }

    // 3. Reconstruct the XML/Context string
    // This replicates Repomix's output format but only for the subset
    var sb strings.Builder
    sb.WriteString("<repository>\n")
    for _, f := range filteredFiles {
        sb.WriteString(fmt.Sprintf("<file path=%q>\n%s\n</file>\n", f.Path, f.Content))
    }
    sb.WriteString("</repository>")

    return sb.String(), nil
}

This transforms a 50MB repo into a precise 10KB context window containing only the files that matter.

Step 4: Phase 2 - The Review

Finally, I run the actual review. I feed it the focused context and ask for specific actionable feedback.

Crucially, we map the response to a Go struct that matches GitHub's comment API structure. We also make sure to handle errors gracefully here too—no _ assigns allowed in production!

func (r *CodeReviewAnalyzer) performCodeReview(ctx context.Context, fileContext string, prDiff string) (*CodeReviewResult, error) {
    // 1. Build the prompt with the focused context
    phase2Prompt := fmt.Sprintf(ReviewPrompt, fileContext, prDiff)

    // 2. Call Bedrock with the "perform_code_review" tool
    toolConfig := bedrock.BuildCodeReviewTool()

    response, err := r.BedrockClient.Converse(ctx, SystemPrompt, phase2Prompt, toolConfig)
    if err != nil {
        return nil, err
    }

    // 3. Unmarshal the structured response
    var reviewData struct {
        Summary        string       `json:"summary"`
        KeyConcerns    []KeyConcern `json:"key_concerns"`
        InlineComments []struct {
            FilePath      string      `json:"file_path"`
            Line          interface{} `json:"line"`
            Body          string      `json:"body"`
            SuggestedCode string      `json:"suggested_code"`
        } `json:"inline_comments"`
    }

    toolJSON, err := bedrock.ExtractToolJSON(response)
    if err != nil {
        return nil, fmt.Errorf("failed to extract tool json: %w", err)
    }

    if err := json.Unmarshal(toolJSON, &reviewData); err != nil {
        return nil, fmt.Errorf("failed to unmarshal review data: %w", err)
    }

    // Now we have structured data to send to GitHub API!
    return processReviewData(reviewData), nil
}

The result? I get a list of InlineComments that I can loop over and POST directly to the GitHub Pull Request API. No regex parsing, no "I hope the AI formatted the markdown block correctly." Just typed data.

Why Not Use an Existing Tool?

Fair question. Tools like CodeRabbit and Codeium Enterprise exist and they are fantastic. But here's the deal: none of them worked with our specific constraints.

The primary driver was infrastructure. We run GitHub Enterprise on-premise behind a strict VPN. While many SaaS tools claim enterprise support, they typically expect your git instance to be reachable from the public internet (or require complex tunnel setups that our security team vetoed immediately). We simply couldn't pipe our repositories into a cloud SaaS.

Once we accepted we had to build, we realized four massive benefits that validated the decision:

Control: We could integrate deeply with our internal knowledge base. I'm talking about feeding the AI our specific architecture docs and Go style guides.
Cost: SaaS tools usually charge per-seat ($20-50/user/month). I wanted per-PR pricing I could control. If we have a quiet month, I pay less.
Data residency: This was non-negotiable. With a custom build using Bedrock in our VPC, zero code ever leaves our controlled environment.
Customization: We have specific architectural rules (like "no direct DB calls in handlers"). Phase 2 adds these as custom fitness functions—something no off-the-shelf tool offered.

If you can use a SaaS tool, use it. This post is for when you can't.

The First Save (And Why This Matters)

Catching syntax errors is cute. Catching logic bombs is why we built this.

Two weeks after deploying Archbot, it caught a bug that would have turned a minor network hiccup into a SEV-1 outage.

A Senior Engineer was implementing a "Feature Flag" loader. The pattern is common: fetch configuration from a remote service, but keep a local fallback for safety. The goal is resilience.

Here is a simplified version of the code:

func (c *ConfigLoader) GetFeatureFlag(ctx context.Context, flag string) (bool, error) {
    // 1. Try to fetch from the remote config service
    val, err := c.remote.Fetch(ctx, flag)
    if err != nil {
         // The "Safe" move? Fail fast.
         return false, fmt.Errorf("remote config failed: %w", err)
    }

    // 2. If remote returns nothing (e.g. 404), fallback to local defaults
    if val == nil {
        return c.local.Get(flag), nil
    }

    return *val, nil
}

Do you see it? It looks clean. It compiles. It passes the linter. It follows the standard Go idiom: if err != nil { return err }.

But Archbot saw the architectural implication:

The engineer paused. "Oh."

We had inadvertently built a "kill switch" into our startup sequence. If the config service blinked, our main application would crash. The fallback code—written specifically for that scenario—was unreachable code during an outage.

That's when the team stopped treating Archbot like a toy.

The Real Impact: Shifting Code Review from "Find Problems" to "Verify Intent"

Now, don't get me wrong—catching production bugs is the headline reason we built this. That's what gets management buy-in. That's what makes the case for the infrastructure investment.

But here's what actually changed day-to-day: Archbot became the first filter on every PR, freeing human reviewers to focus on architecture instead of syntax.

Think about the traditional code review process. You open a PR. A senior engineer context-switches from their deep work, loads your branch, scans 15 files looking for imports, error handling, edge cases, architectural violations—all while trying to remember what you were trying to accomplish in the first place. It's exhausting. By the time they get to the interesting architectural question ("Should this live in the service layer or the handler?"), their brain is already fried from scanning for nil checks.

Archbot inverts that. It handles the cognitive load of the "scan for obvious issues" phase. It's not replacing human review—it's augmenting it.

When Archbot leaves a comment, it's doing one of two things:

Spotlighting complexity: "Hey, this function has three nested error paths and I can't tell if all branches are covered." Even when Archbot is wrong, it's usually highlighting code that deserves a second look.
Catching the mundane: Missing error handling, unreachable fallback logic, off-by-one bugs in loops—the stuff that's easy to miss when you're mentally juggling architectural intent.

The result? Human reviewers spend less time hunting for bugs and more time asking the questions that matter: "Is this the right design? Does this scale? Is there a simpler way?"

And yes, Archbot nitpicks sometimes. It'll flag a variable name it doesn't like. It'll suggest a refactor that doesn't make sense in context. But the team learned quickly: if Archbot comments on it, it's probably worth at least reading twice. Complex code attracts AI attention. That's actually a feature, not a bug.

The shift we saw was subtle but real. Code review went from "find all the problems" to "verify the architectural intent." Archbot doesn't have architectural taste—but it frees up the humans who do to actually apply it.

The Result

The difference was night and day.

Naive Approach: Hallucinations, missed context.
Hammering: Slow, rate-limited.
Monolith (Full Repo): Context overflow, expensive.
Two-Phase Archbot: Fast, accurate, cheap.

What we're seeing:

Catching production-impacting logic bugs before merge
Identifying missing test coverage in critical paths
High acceptance rate for comments with minimal false positives
Drastically lower API costs compared to per-seat SaaS pricing
Zero data leaked outside our VPC

But catching bugs is only half the battle. The next evolution is Architecture Fitness Functions — teaching the AI to enforce our specific architectural rules (like "no direct DB calls in handlers" or "use circuit breakers for external APIs"). That's a topic for another day.

You are a Senior Engineer, Mastering Software Architecture and Design (Part 2)

Alexandre Amado de Castro — Wed, 15 May 2024 03:24:05 +0000

Hey there, Senior Engineer! You've reached a significant milestone in your career journey, and now it's time to chart the course for what's next. In this blog post, we're diving deep into a crucial aspect of seniority: mastering software architecture and design.

These skills aren't just nice-to-haves they're essential for staying sharp and competitive in today's tech landscape. So, grab your coffee and get ready to explore the ins and outs of software architecture and design.

Picking up new skills: Software architecture and design

To continue evolving your software design and architecture skills, there are two key pillars you need to focus on, and one helpful "shortcut." First and foremost, you should always make it a priority to continue studying and learning in this field. Technology is constantly evolving, and staying up-to-date with the latest tools, techniques, and approaches is essential to remaining competitive in your career.

Additionally, it's important to understand that design and architecture go hand in hand. You can't excel at one without a good understanding of the other. If you only focus on design, you might miss critical architecture considerations. On the other hand, if you only focus on architecture, you might end up over-architecting and neglecting the importance of good design.

Learning software architecture and design can be a challenge, especially since it's not as straightforward as learning a new coding language. As the saying goes, software architecture and design are the things you can't simply Google. This means that you'll need to rely on two main sources for learning:

📕 Books

When it comes to these topics, there's no substitute for diving deep into the subject matter. While video tutorials can be helpful, books are the raw source that will give you the depth of knowledge you need to truly understand these complex topics. If you're looking for a good place to start, I highly recommend checking out these books:

"Head First Design Patterns" by Eric Freeman and Elisabeth Robson is a friendly and engaging introduction to design patterns and object-oriented programming. This book presents the material in a fun and approachable way, making it easy to understand and retain.
"Fundamentals of Software Architecture: An Engineering Approach" by Mark Richards and Neal Ford covers the essential principles of software architecture and provides practical guidance for designing scalable and maintainable systems. The authors break down complex topics into easy-to-understand explanations and provide real-world examples to illustrate their points.
"Refactoring: Improving the Design of Existing Code" by Martin Fowler is a classic book that every software developer should read. It teaches you how to improve the design of existing code by making small, focused changes. While this book is a challenging read, it's well worth the effort if you want to improve your design skills.

🗞️ News

Keeping up-to-date with the latest technologies is critical, as they will undoubtedly shape the future of software engineering. For example, Kubernetes, which was introduced less than a decade ago, has become the default choice for many in the industry. Therefore, it's crucial to stay informed about the latest tech news to remain relevant and competitive.

When seeking out news, focus on topics that are specific to the technologies you want to learn, such as Golang, Kubernetes, and Microservices. Additionally, broad technology trends will give you a better understanding of the bigger picture, enabling you to anticipate and prepare for future developments. By staying on top of tech news, you'll be exposed to innovative ideas and emerging trends that will help you think outside the box and stay ahead of the curve. Some of my personal sources for news are:

Newsletters:
Coding Blogs:
- Like mine!
- Martin Fowler
- Dev.to
Youtube Channels:
- Fireship
- TechLinked

Practice, practice, and more practice

The other pillar, as you probably already guessed, is experience. You must expose yourself to an ever-growing list of new things.

If you're not feeling uncomfortable, then you're doing it wrong. Every day should be a challenge, where you feel like you have no idea how to do what you need to do.

But with time, you will realize that you can do it and be proud of yourself. Then, you'll be scared of the next day's challenge, and the cycle continues. A few good ways to gain experience in software architecture and design include:

If you have the bandwidth, work on side projects. Even if they never make it into production or never become a thing, building something outside of your main job can help you explore technologies that you wouldn't have the opportunity to work with otherwise. For instance, you could play with Rust if you're into that.
Ask your manager to move you to different projects within your team. This will enable you to work on different parts of the system and expose you to different types of problems and solutions.
Participate in tech exchange programs. You'll learn a lot from your peers, and it's an excellent opportunity to share your knowledge and skills.
Write documentation about systems you don't own but that your team uses. This will help you gain a deeper understanding of how the system works and how different components interact with one another.
Work on a lot of proofs-of-concept (POCs). This will allow you to try out different approaches and experiment with new technologies and methodologies.
Try to solve problems that don't need solving, just to learn a new thing. For example, you could explore how to re-write an application using NoSQL DB instead of a relational one, even if it's not necessary. This will help you learn new things and expand your skill set.
Contribute to Open Source!

By embracing these experiences, you'll be able to learn new things, expand your skill set, and build the confidence to take on more significant challenges.

Lowering the problem level

Finally, let's discuss the "shortcut" for tackling complex problems and concepts. By breaking them down into smaller, more manageable concepts, you can better understand and build upon them. For example:

Let's say you are feeling lost in a project that requires you to use Kafka but have no idea what Kafka is or what it's for. Don't worry, you're not alone. Kafka is a distributed system consisting of servers and clients communicating through a high-performance TCP network protocol.

But if that sounds too technical, let's simplify it further: think of Kafka as a topic that functions like a queue, with multiple consumers. Okay, now we're getting somewhere.

Further simplifying: Kafka is essentially a database full of records called messages that you can consume in order. With this understanding, you can now start building on top of it. For instance, knowing that Kafka is a database, you can begin exploring key concepts that databases have and use that to further your understanding of Kafka. Don't be intimidated by Kafka or any other technology.

Break it down into smaller, more understandable pieces, and you'll be well on your way to becoming a master.

Keep learning, keep growing, keep updated

Becoming a senior engineer is an outstanding achievement that you should be proud of. However, it's not the end of the road. The journey continues, and there's always more to learn and explore. As a senior engineer, you should focus on developing both technical depth and breadth to stay adaptable and versatile.

Additionally, understanding that software design and architecture go hand in hand is crucial for staying competitive in your career. To achieve this, you can rely on books and other sources to continue learning and deepening your understanding of these complex topics. Remember, staying up-to-date with the latest tools, techniques, and approaches is essential to remain competitive in your career.

If you're seeking more personalized guidance or mentoring, I’m here to help. You can easily book a time with me here or through the menu on the left — let’s tackle your challenges together!

You are a Senior Engineer, now what? (Part 1)

Alexandre Amado de Castro — Wed, 08 May 2024 03:15:39 +0000

First of all, congratulations! You’ve earned the rank of Senior Engineer. That's no small feat, and you should be proud of your hard work and dedication.

Now, you might be wondering, "What's next?" This is not only a common question but also one that I frequently get asked. As a Principal Engineer, I have the great opportunity to mentor many seniors and tech leads. That's precisely why I'm writing this blog post—to extend my personal observations to more people who are trying to navigate their next steps in their technology career.

But the truth is, becoming a senior engineer is only the beginning of a new chapter in your professional journey. So, there's no need to be afraid! I'm here to help, I will delve into some insights on how you can continue to thrive and achieve even greater success as a Senior Engineer. Let's dive in! 👏

Now that you are a senior, the junior days are over, does that mean that you should know everything? Absolutely not. Being a senior is about knowing how much you don't know. However, that also means you should be constantly seeking more technical knowledge in both depth and breadth, but what... what does that mean?

Technical Depth and Breadth

Technical Depth

Technical depth refers to a software engineer's deep understanding of a specific area or technology. This means that they have extensive knowledge and expertise in a particular programming language, framework, or tool, which allows them to optimize it, solve complex problems related to it, and work efficiently in that area.

Your technical depth, includes for example, your main programming language, the packages you use the most, what package manager you use, what frameworks you use, what databases you use, what cloud provider you use, etc. This is the area where you are the most comfortable and where you can solve problems the fastest!

For example, a software engineer who specializes in Java may have in-depth knowledge of the language, the Spring framework, and deployment on Tomcat in various configuration scenarios. This enables them to quickly and efficiently troubleshoot issues and optimize not only their work, but also the team's work and sometimes even assist company wise with the general developer experience.

Technical Breadth

On the other hand, technical breadth refers to a software engineer's broad understanding of multiple areas or technologies. They have a good understanding of various programming languages, frameworks, and tools, and can work on different projects that require different skills and knowledge.

I like to explain technical breadth as the technologies you know exist, you know how they work, what they are good for, but you never had the opportunity to use them directly on your day-to-day.

For example, if you are a Javascript engineer, you know there are a lot of different libraries and frameworks, each with its own specific use case. You know what they are good for, you know what problem they solve, you never had the opportunity to use them, but you could pick it up and start using it in a small time frame if the right opportunity arises.

While technical depth and breadth are distinct, they're not mutually exclusive. In fact, the most successful senior engineers have both. They have a strong foundation of technical depth in one or more areas, which enables them to tackle complex problems and optimize their work.

At the same time, they also have technical breadth, which allows them to adapt to new technologies and projects, and collaborate effectively with colleagues from different technical backgrounds. So, strive to develop both technical depth and breadth, and become a well-rounded and versatile senior engineer.

Besides technical skills, there are also plenty of soft skills that you'll need to master over time. But, I'll save that conversation for another blog post. For now, let's focus on how you can continue to improve your technical skills and stay up-to-date throughout your career.

Keep grinding

Okay, so here's the deal: once you hit that senior level, the path to seniority is pretty similar across the board, unless you decide to transition into a management role. It's like leveling up in a video game - you're always striving to get better and better, but the gameplay stays pretty consistent.

Now, don't get me wrong - when you start taking on tech lead, staff engineer, or principal engineer roles, you're going to need to add some new soft skills to your toolbelt (I will write another blog post on this). Things like leadership, mentorship, and influence become increasingly important. But when it comes to hard skills, the path stays pretty consistent. So let's say that reaching principal engineer status just makes you a "more senior senior" - you're still building on the same foundation of technical skills that got you to where you are.

Stay tuned for Part 2 of our blog series, where we'll dive into one of the most crucial topics for advancing your seniority: software architecture and design.

Kafka Retries: Implementing Consumer Retry with Go

Alexandre Amado de Castro — Thu, 02 Feb 2023 00:30:58 +0000

You don’t “need retries in Kafka” until the day one of your handlers starts failing and you’re forced into a choice: block consumption (and watch lag climb) or keep consuming and retry somewhere else.

This post is about one very pragmatic approach: commit the Kafka offset even when processing fails, then push the failed message into a Go retry queue. Kafka keeps moving, and your application owns the retry policy.

Quick context (assuming you already speak Kafka): consumer groups split partitions across consumers for parallelism. Offsets are committed per partition. In this approach, an offset commit means “Kafka can move on,” not necessarily “side effects succeeded.”

If you’re looking for the other style of “consumer retry” (don’t advance the offset until the handler succeeds), that’s a different design with different tradeoffs. This post is explicitly about keeping consumption moving and absorbing failures in a retry pipeline.

Working with examples

Let's create an example. Imagine you work for a company named ACME, and you have a Kafka topic that receives a new message every time a new customer is created. This customer needs to receive an email saying that their account is fully created and that they need to verify their email.

When we read something like "when something happens" do "something else," always think about events! Events, in Kafka, are messages!

We're going to create a microservice that will:

Receive this new customer message from Kafka.
Compose a nice email.
Send it to the customer.
Add a new message to another topic saying that the message was sent!

Let's diagram that for more visibility:

Where are the retries?

Cool, but, this is a blog post about retries right? Where are the retries in all of it?

Well, any part of the processing can go wrong, and we don't want our customers to miss out on their account creation emails, do we? So, let's talk about a few things that could go wrong and how we can handle them with retries:

The email template database might be down. No problem, we'll just keep trying to connect until it's back up.
The template persisted inside of it might be invalid. We'll check for this and alert the team to update it if needed.
The SMTP server might be down or drop a message. We'll keep trying to send the email until it goes through.
The application might compose the SMTP message badly and the SMTP server reject it. We'll catch this error and fix the composition before trying again.
Producing the message to the other Kafka topic might return an error on Kafka's server side. We'll keep trying until it's successfully sent.

Some of the errors mentioned may require code or data modifications to resolve, such as updating a template or correcting the way the application composes an SMTP message. These types of errors may not be suitable for retries as they may require manual intervention.

On the other hand, other errors such as temporary database downtime, flaky message production, or overloaded SMTP servers can potentially be resolved by retrying the operation.

Unlocking Kafka Consumer Retries

This comes up constantly in event-driven systems: Kafka gives you a great log, but it doesn’t give you a “retry policy” button.

I’ve tried a bunch of architectures for retrying message processing. None is perfect, but one is consistently the easiest to operate when your top priority is “keep consuming,” as long as you’re explicit about the semantics trade.

Before diving into the solution, let me share some background on my previous attempts.

AWS SQS: A Great Option, But With Limitations

As a strong advocate for AWS, my first thought when considering retries was AWS SQS.

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables the decoupling and scaling of microservices, distributed systems, and serverless applications. It comes with built-in retry behavior (visibility timeout + redrive policies + DLQ) and supports a maximum message size of 256 KB.

Now, don’t get me wrong: the grown-up solution is the pointer pattern — put the real payload in object storage (like S3) and send a small pointer through SQS.

That pattern works, but it adds moving parts:

You’ve introduced a second persistence system.
You need retention/lifecycle for blobs.
You need consistency between “pointer message” and “payload object”.

Also, quick correction on Kafka: Kafka message size is configurable (defaults vary, but ~1MB is common). You can tune limits, but you still need an explicit payload contract, or producers will eventually surprise you in production.

So while SQS is a great option, it wasn’t the best fit for what I wanted here: keep the entire pipeline “Kafka-native” without introducing a second storage path just to make retries easier.

Kafka

Yes, that's exactly it, since SQS isn't a suitable solution, why not use Kafka itself?

I attempted to create an internal application Kafka topic that would only be used by the application, where the application would push messages to it, and in case of a failure, it would enqueue the message again with a new count attribute in the header. Let me diagram that to make it more clear:

While this approach does work, it has some significant downsides. Firstly, setting up Kafka topics is not as straightforward as SQS, creating multiple topics can be somewhat cumbersome, and dealing with all the consumers and producers within the code can become quite messy.

Kafka also doesn’t give you a first-class “retry this message N times” feature at the broker level, and that’s mostly a good thing: Kafka can’t know whether your side effects are safe to repeat. Retries are an application-level decision tied to offsets, commits, backpressure, and idempotency.

Databases

Desperate Times Call for Desperate Measures, right?

As a last resort, I turned to use databases for message retries. Modern SQL databases have LOCKING mechanisms that can be leveraged to use a table as a queue.

To my surprise, this approach worked! However, it's not a perfect solution. Databases are not designed to function as queues, and using them in this way can be a stretch. There are a few downsides to this approach:

Databases are built for consistency using transactions, which can make them performant, but they will never perform as well as a specialized tool like Kafka. This can easily become a bottleneck in the system.
Managing all of this code can easily become problematic, and it is not an easy pattern to share across an entire company.

So, what now? Memory

After trying all of the above solutions and being unsatisfied with the results, I decided to look at how other companies and frameworks handle retries.

One framework that stood out to me is Spring, an open-source Java framework that provides a comprehensive programming and configuration model for building modern enterprise applications.

The Spring way

One of the modules within Spring is Spring Kafka, which provides several ways to handle retries when consuming messages. These include using the RetryTemplate and RetryCallback interfaces from the Spring Retry library to define a retry policy, using the @KafkaListener annotation to configure retry behavior on a per-method basis, or using AOP to handle retries by using a RetryInterceptor.

Spring Kafka allows developers to choose the approach that best fits their use case, whether it is a global retry policy or a more fine-grained per-method retry configuration.

It's been a long time since I worked with Java, and Spring code is far from easy to understand (it's pretty advanced stuff), but after diving deeper into the Spring Kafka framework, I discovered that it implements an error handler using a strategy for handling exceptions thrown by the consumers.

When an exception that is not a BatchListenerFailedException (which Spring knows is impossible to retry) is thrown, the error handler will retry the batch of records from memory. This prevents a consumer rebalance during an extended retry sequence and allows for a more elegant solution.

In my honest opinion, the Spring Kafka framework offers a robust and flexible approach to handling retries that is well worth considering for any enterprise application.

It provides a range of options for developers to choose from and allows for more fine-grained control over retry behavior, making it a great choice for any organization looking to implement a retry mechanism for their messaging system.

Inspired by Spring Kafka (but different semantics) in Go

I currently work with (and love ) Go, and, as far as my research went, I didn't find any rewrite of the Spring Kafka and Retry framework in Go, so, let's build our own!

Here’s the key difference in my implementation compared to Spring Kafka’s approach: I’m not trying to pause a partition and “hold the offset line” until processing succeeds.

Instead, I commit the offset and move on, then retry in the background.

My approach includes:

Creating a consumer that reads from the topic.
Creating a Go channel that handles retries.
If a message fails to process, it is sent to the retry channel.
The main consumer continues to process the main topic while the other messages are retried.
If retries are exhausted, the message is sent to a persisted DLQ for later investigation.

⚠️ Warning:
This pattern intentionally decouples Kafka offsets from processing success.

If you commit a message and your process crashes before the retry succeeds, Kafka will not redeliver that message. With an in-memory Go channel, that means a restart can drop “in-flight retries.”

If you need stronger guarantees than that, the retry queue must be durable (retry topic, DB table, external queue) and your handler must be idempotent.

Also note that retries can happen later than subsequent messages, so you’re trading away strict ordering for throughput.

When not to use this pattern

If you need strict ordering guarantees, or you must be able to restart at any time without losing “in-flight retries,” don’t use an in-memory retry queue behind an early commit. In those cases, you either need a durable retry queue, or you need an offset-holding strategy (pause/seek and commit only after success).

Here’s the full example

This example uses an in-memory Go channel as the retry queue. That keeps Kafka consumption moving, but it also means retries are lost on process restart. In real systems, I usually evolve this into a durable retry queue.

One more practical note: your retry queue is part of your backpressure story. If it fills up, enqueueing can block and slow consumption. The "keep consuming no matter what" version of this pattern needs an explicit overflow strategy (drop, spill to durable storage, or fail fast into DLQ). Also consider adding jitter to your backoff to avoid retry storms when multiple messages fail simultaneously.

Quick orientation: ProcessRetryHandler is the interface your handler implements — Process does the work, MoveToDLQ handles exhausted retries. ConsumerWithRetryOptions wires everything together. The main loop reads from Kafka and enqueues failures; the goroutine retries them with backoff.

Semantics (don’t skip this)

This pattern is “Kafka keeps moving,” which means you’re making a deliberate trade: Kafka offsets can advance even when processing hasn’t succeeded yet.

What Kafka guarantees (in this pattern)

If you commit before success (or your client auto-commits on read), Kafka won’t redeliver on restart. That’s effectively at-most-once for your side effects.

If you’re following along with kafka-go, note that Reader.ReadMessage with a GroupID advances committed offsets independently of whether your handler succeeds.

How to get stronger guarantees

If you want at-least-once for the side effects, the retry queue can’t be in-memory. It needs to be durable (retry topic, DB table, external queue) and your handler needs to be idempotent.

Idempotency (practical version)

If processing record X twice causes two emails, two charges, or two rows, you’re going to have a bad week. The fix is usually an idempotency key + write-once constraint (or upsert).

Poison pills

A poison pill is a record that will never succeed without code/data changes. This pattern is nice because poison pills don’t block Kafka consumption — they show up as retry storms and DLQ events instead.

Decision table

In my head, this “commit early + retry elsewhere” idea has a few maturity levels:

Option	Kafka consumption	Failure semantics	Operational complexity	When I’d pick it
In-memory retry channel	Never blocks consumption	Retries are lost on restart	Low	Quick wins, low blast radius workflows
Durable retry queue (Kafka retry topic / external queue)	Never blocks consumption	At-least-once is achievable (with idempotency)	Medium	Most production systems
DB-backed state machine / outbox	Never blocks consumption	Strongest auditability and control	High	Complex workflows, compliance, long retries

In conclusion

Kafka retries are not a broker feature; they’re a semantics decision.

This “commit early + retry elsewhere” approach is great when your top priority is keeping Kafka consumption unimpaired. The cost is that you’re now running a second system: your retry pipeline. Treat it like a first-class dependency.

If I were shipping this pattern, I’d page on:

Retry queue depth (and time-in-queue).
Retry exhaustion rate (messages going to DLQ).
DLQ age (DLQ isn’t success; it’s a promise you’ll look).
Processing success latency (p95/p99) for the retry worker.
Consumer lag as a sanity check (it should stay flat even during downstream outages).

Dealing with distributed systems challenges or building reliable event-driven architectures? Let's talk.