Forem: Mohan Krishna Alavala

I corrected my own benchmark claim from 91.5% to 88%. Here's what changed.

Mohan Krishna Alavala — Thu, 30 Apr 2026 15:03:41 +0000

A week ago I shipped v4.4.3 of context-router with a number on the README: "91.5% fewer tokens than code-review-graph."

It was true in the narrow sense that both numbers came from real benchmark runs. It was also wrong in every way that matters. The two tools were running on different repos, on different tasks, with different inputs. I was comparing my best-case workload to their best-case workload and putting a percent sign between them.

This post is about the redo. v4.4.4 ships a workload-matched run on the same SHAs and the same diffs as input, on the same machine. The new headline is ~88% fewer tokens, 2/3 rank-1 hits vs 0/3 on the kubernetes commits I picked. That's a number I'll defend.

What context-router does

context-router is a small Python project for routing AI coding agents to the minimum useful context. You point it at your repo, give it a task type (review, debug, implement, handover), and it returns a ranked pack of files and snippets sized to fit a token budget.

The way you benchmark something like this is straightforward: pick a real bug-fix commit, hide the fix, hand the tool the parent state plus the diff, and check whether the file the human eventually changed shows up in the tool's top-N output. If yes, the tool would have routed the agent to the right place.

How I got the wrong number

For v4.4.3 I ran context-router across six OSS repos (gin, actix-web, django, gson, requests, zod). Separately, I ran code-review-graph on a different set of repos and grabbed its average tokens per output. Then I divided.

That isn't a comparison. That's two unrelated measurements with a percent sign glued between them. If code-review-graph happened to be running on repos where it had to emit more boilerplate, or where its scorer was less confident, my number would be flattering for reasons that had nothing to do with my tool.

Someone pointed this out. They were right. I pulled the claim and rebuilt the test.

Workload-matching in one sentence

Both tools see the same SHAs and the same diff as input.

That's the rule. If you can't say that sentence about a benchmark, the percent at the bottom isn't really pointing at anything.

Concretely, here's what v4.4.4's run looks like:

I picked three single-source-file bug-fix commits in kubernetes/kubernetes: kubelet status_manager, client-go clientcmd loader, and kube-proxy winkernel proxier. SHAs are pinned in benchmark/holdout/kubernetes/tasks.yaml so anyone can reproduce.
For each commit both tools get the same input: the parent tree, with the parent→fix diff handed in. Neither tool gets to "see the answer" in the working copy.
context-router: pack --mode review --pre-fix <fix-sha>.
code-review-graph: detect-changes --base <fix-sha>^.
The diff each tool consumes is git diff <fix-sha>^..<fix-sha>. Identical bytes.

Then I report what each tool predicts in its top-3, what its rank-1 was, and how many tokens it emitted.

The numbers

	context-router	code-review-graph
Rank-1 hits	2/3	0/3
Recall-at-3	3/3	3/3
Total tokens	406	3,478
Avg tokens / task	135	1,159
Errors	0	0

Token delta on this workload: -88.3%.

A few honest things to note before anyone gets too excited:

Three tasks is a small N. I'm reporting the direction with confidence. The precise percent is well within the range that could shift on a different task mix. If you put more weight on a single number than that, you're reading too much into it.

Recall-at-3 is tied. Both tools surfaced the right file in their top three on every task. The useful gap is at rank-1, and at cost. If your agent only reads the top hit, context-router takes you to the right file two times out of three; the other tool zero. If your agent reads the top three, both tools work, but one costs roughly 9× more tokens to do it.

Both tools were tripped by the same fixture noise. I had to reconstruct the kubernetes repo from per-commit GitHub tarballs because depth-50000 clones throttled badly on my network and a full clone is more bandwidth than I had at the time. GitHub's tarball generator stamps the source SHA into a couple of version.sh and version/base.go files at archive time. Those files appear in the synthetic parent→fix diff, but were not in the real upstream commit. Both tools' rank-1 picks on the two missed cases were one of those stamped files. On a real working-tree-diff workflow that noise wouldn't exist. I'll re-run this on a full clone once I have the bandwidth.

code-review-graph indexes faster. Roughly 80 seconds to build its graph + FTS for the full kubernetes tree. context-router takes 4–5 minutes on the same checkout because it's collecting richer call/symbol metadata. That's a real cost you pay; the precision and token economy at query time are what you get for it.

The full report with per-task tables, predicted top-3 lists, and the reproducer is at benchmarks/comparison-code-review-graph.md. The caveats are in the report itself, not in a corner where nobody looks.

What else shipped in v4.4.4

The benchmark redo wasn't the only thing in this release. The other piece worth mentioning, because it's load-bearing for the 2/3 rank-1 number, is an FTS5 anchor for implement-mode candidate retrieval.

v4.4.3 had a quiet regression on repos with more than 10,000 symbols: implement-mode's candidate set came from a get_all query capped at the first 10K rows with no ORDER BY. If the file you cared about lived past row 10,000 (say, in a 197K-symbol kubernetes graph), it was invisible. The bug was masked on every smaller repo I tested against.

The v4.4.4 fix is a SQLite FTS5 virtual table over (name, signature, file_path) with porter + unicode61 tokenization, kept live by three triggers. SymbolRepository.search_fts(query, repo, limit=200) returns BM25-ranked symbol rows; the orchestrator unions those with the existing 10K slice, FTS first so they survive top-N capping. When FTS returns zero hits and get_all returned ≥10K rows, a stderr warning fires naming the case. No silent degradation.

Three things I'd like you to take from this

Workload-matched or it doesn't count. If you read a tool benchmark and can't tell whether both systems saw the same input, treat the result as marketing.
Show the misses. "2/3" with the failed case explained is more credible than "100%" with no commentary. The fixture noise that tripped both tools on this run is right there in the report. Hiding it would have made the rank-1 number look better and the project less trustworthy.
A correction isn't a defeat. v4.4.3 had a claim that didn't hold up. v4.4.4 has one that does. The repo is in better shape than it would have been if nobody had pushed back.

If you want to reproduce the run yourself, the commands are at the bottom of the comparison report. If you find a workload where the numbers don't hold, open an issue with the raw comparison_*.json attached and I'll either fix it or update the README to match what's true.

context-router is on GitHub; v4.4.4 is on PyPI as context-router-cli and on Homebrew as mohankrishnaalavala/context-router/context-router.

Release & Deprecation Sentinel — Autonomous SRE Copilot with Bright Data + n8n

Mohan Krishna Alavala — Wed, 27 Aug 2025 19:18:27 +0000

🚀 Introduction

As a DevOps/SRE engineer, one of the most painful problems I face is keeping up with release notes and deprecation notices from dozens of vendors (Kubernetes, Docker, HashiCorp, Elastic, CNCF, etc.).

Deprecations can break clusters, APIs, and infrastructure if missed.
Releases bring fixes/features, but it’s overwhelming to track everything manually.

That’s where Release & Deprecation Sentinel comes in:
An autonomous, n8n-powered bot that continuously fetches release & deprecation notes using Bright Data, normalizes them, deduplicates alerts, assigns owners, and publishes structured events to Slack + API.
GitHub Repo
Demo

🎯 Problem Statement

Enterprises rely on multi-vendor DevOps stack (Kubernetes, Istio, Terraform, Prometheus, Vault, Loki, etc.).

Each vendor publishes release notes in different formats (HTML, markdown, blog pages, docs).

Manually checking them wastes hours and still risks missing critical deprecation notices.

Teams lack automation to normalize, classify, and distribute this information to the right stakeholders.

💡 Solution — Release & Deprecation Sentinel

An autonomous monitoring agent built in n8n with Bright Data proxy.

Core Features

Automated Fetching — every 6 hours, fetch release/deprecation notes from 18+ vendors.
Bright Data Integration — ensure scraping works reliably even behind rate limits/CDNs.
Normalization & Deduplication — unify different vendor formats into a single JSON schema, dedupe by vendor/product/version.
Ownership Assignment — map products to owning teams (e.g., SRE, Platform, Security).
Severity Tagging — classify releases (low) vs. deprecations (high).
Due Dates — add default remediation windows (e.g., 30 days for releases, 60 for deprecations).
Slack Notifications — auto-publish structured alerts into Slack channels.
Web UI + API — present a real-time searchable dashboard (filter by severity/vendor/type).
History & Retention — keep last 500 events in memory for audit/debug.

Tech Stack

n8n (workflow automation backbone)

Bright Data Proxy(reliable data extraction from vendor docs)

JavaScript Code Nodes (JSON normalization, dedupe, schema enforcement)

Slack Integration (real-time alerts to teams)

Webhook API + Custom UI (live dashboard for browsing events)

AI Agent (OpenAI) — attempted summarization (but skipped due to token limits; architecture is ready for future LLM integration)

Why Bright Data?

Many vendor docs (e.g., Kubernetes, Docker) use Cloudflare/CDN protections.

Standard HTTP requests often fail with SSL/self-signed cert errors.

Bright Data’s proxy ensures always-on, reliable fetching with geo/IP rotation.

Without Bright Data, half of these vendors would fail in automation.

Bright Data is the core enabler of this project.

Workflow Highlights
Key Nodes

Seed Vendors → list of 18+ vendors (Kubernetes, Docker, HashiCorp, Elastic, CNCF, Red Hat, Grafana, Istio, Argo, Prometheus).
Bright Data HTTP Request → fetch vendor release notes reliably.
Trim & Normalize → enforce schema {vendor, product, type, version, severity, url}.
Deduplication Node → avoid duplicate alerts.
Assign Owners → map vendor/product to correct Slack/Jira team.
Slack Node → send structured alert.
Webhook + Dashboard → expose API & UI with filters.

Closing Pitch

Release & Deprecation Sentinel transforms vendor chaos into structured insights.
With Bright Data ensuring reliable fetching and n8n orchestrating the flow, teams never miss critical deprecations or releases.

This project is:

Useful for any DevOps/SRE team.
Scalable across dozens of vendors.
Fully autonomous — no human babysitting required.

It’s not just a workflow, it’s a copilot for release management.

GPT-5 Meets DevOps: Why It’s the AI Sidekick You Didn’t Know You Needed

Mohan Krishna Alavala — Sat, 09 Aug 2025 20:33:27 +0000

When OpenAI unveiled GPT-5 on August 7, 2025, it wasn’t just another upgrade — it marked a leap toward what Sam Altman described as interacting with a “PhD-level expert” on any subject — coding included.

If you’re deep in pipelines, infrastructure, and uptime, here’s how GPT-5 transforms DevOps from grunt work to smart orchestration.

Real-World DevOps Use Cases with GPT-5

“Software-On-Demand” for Infrastructure as Code
Generate Terraform, Helm charts, Ansible playbooks, and full CI/CD workflows from plain English descriptions — cleaner, faster, and more compliant than GPT-4.
Smarter Debugging & Troubleshooting
Reads CI logs, spots the failing step, and proposes fixes — pulling references from Stack Overflow, CVE databases, or your internal docs.
Integrated Incident Response
Runs log parsing, dashboard checks, and Jira ticket creation in one shot thanks to parallel tool-calling.

Hands-On Prompt Examples for DevOps Engineers
Here are GPT-5 prompts you can try today to see the difference from GPT-4:
Prompt 1: One-Shot Multi-Cloud Deployment

Generate a Terraform configuration to deploy a 3-tier application (frontend in AWS S3 + CloudFront, backend in Azure App Service, database in GCP Cloud SQL). Include networking, IAM roles, and monitoring integrations for each cloud. Ensure the code is modular, reusable, and passes terraform validate without errors.

Prompt 2: Blue-Green Kubernetes Pipeline

Create a GitHub Actions workflow that deploys a Spring Boot app to AKS using a blue-green strategy. Add a job that runs kubectl top pods every 10 minutes post-deployment and triggers an alert to Grafana if CPU > 80% for 5 minutes.

Prompt 3: Automated Incident Report

Analyze the attached Kubernetes logs and Prometheus metrics. Identify the root cause of the crash, suggest a fix, and draft a Jira ticket with reproduction steps and the fix plan.

Prompt 4: Security Patch Integration

Check my Helm chart for outdated dependencies. Cross-reference the vulnerable packages with the latest CVE database, suggest patched versions, and update the Chart.yaml and templates accordingly.

Prompt 5: Multi-Modal Runbook Helper

Here’s a screenshot of my Grafana dashboard and a PDF of the last incident report. Suggest improvements to alert thresholds and propose runbook updates to avoid similar issues.

Why GPT-5 Is a Game-Changer for DevOps

The leap from GPT-4 to GPT-5 feels less like a version upgrade and more like adding a DevOps superbrain that never sleeps.
It delivers:

One-shot automation for complex deployments.
Lint-clean YAML/JSON outputs on first try.
Massive context memory for persistent workflows.
Multimodal reasoning across logs, dashboards, PDFs, and images.
Fewer hallucinations, more production-ready code.

With GPT-5, you’re not just automating tasks — you’re automating the thinking.