Forem: Nex Tools

Claude Code for Chaos Engineering: How I Stopped Hoping My System Was Resilient and Started Proving It

Nex Tools — Tue, 26 May 2026 01:24:27 +0000

For years I told myself my production system was resilient. I had retries. I had circuit breakers. I had timeouts. I had a runbook. I had survived enough incidents that I knew the system mostly held together under stress. What I did not have was any real evidence that the next incident would not be the one that broke it for good. Resilience was a story I was telling myself, not a property I had measured.

Chaos engineering is the practice of replacing that story with measurements. Instead of hoping the system survives a database failover, you trigger a database failover in production during a quiet hour and watch what actually happens. Instead of hoping the timeouts you set are right, you inject latency into a dependency and see whether the symptoms match what your timeouts were supposed to prevent. The practice is uncomfortable the first few times you do it. After a while, it becomes the only honest way to know whether your system actually works the way you think it does.

The hard part of chaos engineering is not the experiments themselves. The experiments are mostly small scripts that introduce a specific failure. The hard part is the workflow around the experiments: choosing which failures to test, defining what success looks like, scheduling the experiments without disrupting the business, capturing the results, and turning the results into changes that make the system more resilient. That workflow is where Claude Code reshaped my practice. Here is what the workflow looks like.

Why Most Teams Skip Chaos Engineering

Most teams I have worked with have heard of chaos engineering, agree it is a good idea, and have never run a single experiment. The reasons are predictable. The experiments feel scary. The setup work is invisible to the rest of the business. The first few experiments often produce surprising results that lead to uncomfortable conversations. The path of least resistance is to keep telling the resilience story and hope no one runs the test.

The cost of skipping the practice does not show up on any single day. It shows up in the incident postmortem six months later, when the failure mode that took the system down was something the team would have caught with a fifteen minute experiment. The cost is invisible until the moment it is enormous.

The other reason teams skip the practice is that the work feels like it has no clear owner. Chaos engineering does not fit neatly into any existing role. It is not feature work, not bug fixing, not on call response, not platform engineering in the traditional sense. The work needs someone to push for it, and most teams do not have that person.

The teams that get chaos engineering right are the teams that make it boring. The first few experiments are dramatic, but the goal is to get to the point where running an experiment is as routine as deploying a small change. Boring chaos engineering is the kind that actually happens.

The workflow I describe below is the workflow that made chaos engineering boring for me. The Claude Code skills handle the parts that would otherwise be tedious, which leaves the interesting parts to humans.

If you want the broader picture of how I think about production safety, the Claude Code for Incident Response workflow covers what happens when something does break, and chaos engineering is the practice that surfaces those failures before customers do.

The Hypothesis Skill

The first skill in the workflow handles experiment design. Given a service and a failure mode, the skill produces a structured hypothesis that the experiment will test.

A hypothesis is not a vague intention to break something. It is a specific claim that the experiment is designed to confirm or refute. A good hypothesis says something like, "When the primary database becomes unreachable for thirty seconds, the API continues to serve cached reads, write requests return a 503 with a Retry-After header, and the p99 latency on read endpoints stays below 800 milliseconds." A bad hypothesis says, "The system should handle database failures."

The skill enforces the structure. Every hypothesis has a failure scenario, an expected behavior across multiple dimensions, a blast radius, and a rollback condition. The structure forces the person designing the experiment to think through what they actually expect to happen, which is often the most valuable part of the entire workflow.

The skill also catalogs the failure modes worth testing. The catalog draws from the real incidents the team has seen, the dependencies the system has, and the common failure modes for the service type. The catalog grows over time as new failure modes are discovered, which means the workflow gets more thorough as the team learns more about the system.

The output of the skill is a written experiment plan that goes into a shared document. The plan is reviewable by humans before any experiment runs. Most experiments start as a draft that goes through one or two rounds of review before they are approved for execution.

The Safety Skill

The second skill handles the safety envelope around the experiment. Before any failure is injected, the skill checks a set of conditions that must be true.

The conditions vary by experiment, but the core set is consistent. The system has to be healthy at baseline, which the skill verifies by comparing recent metrics to historical norms. The time has to be appropriate, which the skill checks against the deployment calendar and the on call schedule. The blast radius has to be bounded, which the skill verifies by checking the configuration of the failure injection. The rollback mechanism has to be ready, which the skill confirms by running a dry test of the rollback.

If any condition fails, the experiment does not run. The skill produces a structured report explaining why the experiment was blocked, and the person running the experiment decides whether to fix the underlying issue or postpone the experiment.

The skill also handles the human approval requirement. For low-risk experiments in lower environments, no approval is needed. For experiments in production or experiments with a large blast radius, the skill requires explicit sign off from a named approver before it will allow the experiment to start. The approval is logged with the experiment record.

The safety layer is what makes the practice sustainable. Without it, every experiment carries enough perceived risk that people will postpone them indefinitely. With it, the experiments become routine because the safety properties are enforced by code rather than by hope.

If you want to see how this connects to deployment safety more broadly, the workflow I described in Claude Code for Canary Deployments uses a similar approach to keep risk bounded during rollouts. Canary deployments and chaos experiments are the two halves of the same safety practice.

The Injection Skill

The third skill handles the actual failure injection. Given an approved experiment plan, the skill executes the failure in the specified system component for the specified duration.

The injection mechanisms are varied. For network failures, the skill manipulates the routing layer to drop or delay traffic between specific services. For resource exhaustion, the skill consumes a controlled amount of CPU, memory, or disk on a specific host. For dependency failures, the skill substitutes a misbehaving mock for the real dependency on a subset of requests. For data failures, the skill injects malformed or unexpected data into a specific path.

The injection is reversible. Every mechanism has a corresponding stop action that returns the system to its baseline. The skill verifies that the stop action works before the injection starts, and it has a hard timeout that triggers the stop action regardless of any other state.

The injection also captures evidence. While the failure is active, the skill records metrics, logs, and traces from the affected components and from the components downstream of them. The evidence is timestamped and tagged with the experiment identifier, which makes it easy to find later. The evidence is the basis for the analysis in the next step.

The skill is also conservative by default. If the captured metrics start to look meaningfully worse than the hypothesis predicted, the skill triggers the rollback automatically rather than waiting for the experiment duration to complete. The early rollback prevents an experiment from turning into an incident.

The Analysis Skill

When the injection is complete, the analysis skill compares the captured evidence to the original hypothesis. The output is a verdict on whether the hypothesis held.

The verdict has more nuance than pass or fail. For each prediction in the hypothesis, the skill reports whether the actual behavior matched the predicted behavior. The verdict can be that all predictions held, that some held and some did not, or that the actual behavior was meaningfully different from what was predicted on every dimension.

The interesting cases are the ones where some predictions held and some did not. Those cases are the highest signal output of the entire workflow. They surface the gap between how the team thinks the system behaves and how it actually behaves, which is the gap the practice exists to close.

The analysis output is a structured report that goes into the experiment record. The report includes the original hypothesis, the captured evidence, the comparison verdict, and a set of follow up actions. The follow up actions are the changes that need to happen to bring the system behavior into line with the hypothesis, or to update the hypothesis to match the actual behavior.

The report also feeds the catalog of known behaviors. Over time, the catalog becomes a high-fidelity description of how the system actually responds to various failures, which is more valuable than any architecture diagram.

How the Workflow Runs in Practice

A typical experiment cycle takes between thirty minutes and a few hours, depending on how complex the failure is and how long the observation window needs to be. The cycle is the same regardless of the experiment type.

Someone proposes a new experiment, usually because of a recent incident, a recent architectural change, or a gap in the catalog. The hypothesis skill produces a structured draft. The draft goes through review, and the people who own the affected systems sign off on the plan and the safety envelope.

When the experiment is scheduled, the safety skill runs the pre flight checks. If everything passes, the injection skill starts the failure. The team watches the metrics in real time, ready to intervene if anything looks worse than expected.

The injection runs for its configured duration, or stops early if the safety boundaries are crossed. The analysis skill produces the verdict. The verdict goes into a follow up queue, and the team triages the actions in the same way they would triage any other engineering work.

Over weeks and months, the catalog of known behaviors grows. Each new experiment either confirms an existing entry, refines an existing entry, or adds a new one. The catalog becomes a living description of the system, and the description is grounded in evidence rather than in assumption.

What This Workflow Did to My Practice

The most visible change is in the kind of bugs that reach customers. The class of bug where a dependency failure cascades into a customer-visible outage has nearly disappeared. The reason is that the experiments surface those failure modes before they happen in real conditions, which gives the team time to fix the cascade before it matters.

The second change is in how the team writes new code. When you know that a chaos experiment will eventually test the failure handling on every dependency, you write the failure handling more carefully the first time around. The cost of careless failure handling becomes visible during the experiment instead of during the incident, which is a much cheaper place to find out.

The third change is in how the team thinks about system documentation. The old documentation described how the system was supposed to work. The new documentation, grounded in the experiment catalog, describes how the system actually works. The two are not always the same, and the differences are where most of the operational learning lives.

The fourth change is in confidence. Before the workflow, every on call rotation carried a low background anxiety because no one knew how the system would behave under stress. After the workflow, the on call rotations carry less anxiety because the answer to most of those questions is known.

For the full picture of how I run production systems with Claude Code, the complete series on DEV.to covers every workflow I rely on, from incident response to dependency management to chaos engineering.

FAQ

Do I have to run experiments in production?

Not at first. The early experiments should run in a non production environment that closely mirrors production. The point of running in production eventually is that some failure modes only appear under real traffic and real data. The path is to start in staging, build confidence in the workflow, and then move the lower-risk experiments to production one at a time.

What if the safety envelope blocks every experiment I want to run?

That is a useful signal. Either the safety envelope is too tight and needs to be relaxed for the kind of experiments you want, or the system is not yet in a state where those experiments would be safe. Both outcomes are worth knowing. Adjust the envelope or fix the underlying issues, but do not bypass the envelope to force an experiment through.

How do I get organizational buy in for this practice?

Start with the cheapest experiment that produces a surprising result. A surprising result, well documented, is the most persuasive case for the practice. Most teams overestimate how resilient their systems are, and a single concrete demonstration is worth more than any number of theoretical arguments.

What if my system is too small to need this?

The workflow scales down. A small system has fewer failure modes worth testing, but it also has less budget for incidents. The cost of an outage in a small system can easily exceed the cost of an outage in a large system as a fraction of the team's capacity. Run the experiments that match your system's size.

The chaos engineering workflow is one of the few practices I know that has paid back the time invested in it multiple times over. The serious incidents that would have hit production reach me first as experiment verdicts, in the form of a written analysis I can read with coffee instead of a phone call I have to answer at 2 a.m. The trade is one of the best ones I have made in my career as an engineer, and the workflow above is the version of it I would rebuild from scratch if I started over today.

Claude Code for Canary Deployments: How I Ship to 1% of Users Before Breaking Everything

Nex Tools — Tue, 26 May 2026 01:18:41 +0000

I used to ship by faith. The change passed code review, the tests went green, the deploy button was right there, and I pressed it. Most of the time it was fine. The handful of times it was not fine cost me weekends, customer trust, and a real amount of money. The worst incident I can remember was a single line change that took down checkout for forty minutes during a marketing push. The change had passed every test we had. The bug only showed up under real traffic patterns.

After that incident, I built a canary deployment workflow. Every risky change now ships to one percent of traffic first, sits there for a defined observation window, and gets promoted to the full population only when the metrics from the canary cohort look identical to the metrics from the control cohort. It works. The serious incidents I used to ship have been replaced by canary failures that get caught and rolled back before they reach the majority of users.

The hard part of canary deployment is not the routing layer. The routing layer is a solved problem. The hard part is everything around the routing layer: choosing the right metrics to watch, deciding what counts as a regression, building the decision logic that promotes or rolls back, and connecting it all to the deployment pipeline. That hard part is where Claude Code reshaped how I work. Here is the workflow.

Why Canary Deployments Are Underused

Most teams I have worked with talk about canary deployments more than they actually do them. The reason is almost always the same. Setting up the infrastructure is more work than people initially expect, and the work is spread across several systems that each have their own conventions.

You need a routing layer that can split traffic by percentage and by user cohort. You need a metrics pipeline that can compare the canary cohort to the control cohort on the dimensions that matter. You need a decision policy that knows when to promote, when to hold, and when to roll back. You need a control plane that ties it all together and gives humans visibility. And you need all of it to be reliable enough that people trust it.

Most teams end up with two or three of these pieces but not the full set. The result is a canary system that exists in name only. Deployments still go to everyone at once, with a vague intention to "watch the dashboards for a few minutes" that no one ever has time to follow through on.

The gap between a real canary system and the vague intention of one is the gap between "we caught it before it shipped" and "we caught it because customers complained." Both gaps look small from a distance. Up close, they are completely different worlds.

Once you have a real canary system, you also discover that you start writing different kinds of changes. Changes that would have been considered too risky become routine because you have a safety net for them. The cost of every individual change goes up slightly because you have to wait for the canary window, but the cost of failed changes drops to nearly zero. The expected value calculation flips, and the team ships more aggressively.

The workflow I describe below is the workflow that closed the gap for me. The Claude Code skills do the work that humans were not doing because the work was tedious and the payoff was abstract.

The Cohort Skill

The first skill in the workflow handles cohort assignment. Given a user identifier, the skill returns whether the user belongs to the canary cohort or the control cohort for a particular deployment.

The assignment is stable. The same user identifier always returns the same answer for the same deployment. The stability matters because it means a user who hits the canary on their first request continues to hit the canary on subsequent requests within the same session. Without stability, half a user's requests would go to the canary and half to the control, which would distort the metrics and could also create user-visible inconsistencies.

The assignment is also fast. The skill produces a deterministic hash of the user identifier and the deployment identifier, takes the result modulo 100, and compares to the canary percentage. The computation is single-digit microseconds. It can run in the request hot path without measurably affecting latency.

The skill also handles cohort segmentation. For some deployments, the canary should be limited to specific user populations. The skill accepts a population filter and respects it. The most useful filter I have is internal users only, which lets me canary internal-facing changes to employees before they reach customers.

If you want to see how this cohort approach connects to a broader feature flag system, the workflow I described in Claude Code for Feature Flags is the layer that sits one level up from canary assignment. Canaries are a specialized use of feature flags where the cohort is randomized by user identifier rather than chosen explicitly.

The Metrics Skill

The second skill handles metrics comparison. Given a deployment, a canary cohort, a control cohort, and a time window, the skill produces a comparison of every tracked metric between the two cohorts.

The metrics are dimensional. The skill does not just compare the average error rate across the cohorts. It compares the error rate at p50, p90, p99, and p99.9. It compares the latency distribution at every percentile. It compares the throughput, the success rate, the cache hit rate, and any custom metric the deployment opts into.

The comparison is statistical. The skill knows the difference between a real change and noise. A two percent jump in error rate on a small sample is probably noise. A two percent jump on a large sample is probably real. The skill reports both the point estimate and the confidence interval, and it flags differences that are unlikely to be noise.

The output is a structured comparison report. Each metric has a row showing the canary value, the control value, the absolute difference, the relative difference, and the statistical significance. Rows where the canary is meaningfully worse than the control are at the top. Rows where the canary is meaningfully better are also surfaced, because improvements are interesting too.

The Decision Skill

The third skill turns the comparison report into a deployment decision. Given a comparison and a deployment policy, the skill produces one of three outcomes: promote, hold, or roll back.

The policy is the interesting part. The policy specifies which metrics matter and what regressions are tolerable. For a payment service, the policy might say that any increase in checkout error rate is a roll back, but small latency regressions are tolerable. For a search service, the policy might say that small error rate increases are tolerable but latency regressions over 50 ms are a roll back.

The policy also specifies the observation window. Some changes need a short canary because the signals appear quickly. Others need a long canary because the relevant signals only appear during certain traffic patterns. The skill respects the configured window and does not declare a verdict until the window has elapsed.

The decision is auditable. The skill produces a structured record of the decision, the metrics that drove it, the policy that was applied, and the timestamp of the verdict. The record goes to a deployment log. If a decision is later questioned, the record is the evidence for what was known at the time.

The decision is also overridable. A human with appropriate permissions can override a decision in either direction. An override is logged and requires a reason. In practice, the overrides are rare. Most of the time, the skill's decision is the right one, and the policy is what would need to change if it is not.

The Promotion Skill

When the decision is promote, the promotion skill handles the rollout. The promotion is not a single step from 1% to 100%. It is a series of steps with observation windows between them.

A typical promotion ladder goes 1%, 5%, 25%, 50%, 100%. Each step has its own observation window and its own comparison. The skill executes each step, runs the comparison, applies the policy, and decides whether to proceed to the next step or hold or roll back. The ladder gives multiple chances to catch a regression that did not show up at lower traffic levels.

The promotion also handles communication. Each step posts a status update to the deployment channel. The update includes the current traffic percentage, the metrics from the most recent comparison, and the time until the next step. Humans can follow along without having to query the system.

The full promotion typically takes one to three hours. The duration sounds long compared to a traditional deployment that ships in minutes, but the duration is the price of safety. The bugs that get caught at the 5% step would otherwise be in front of every customer by the time anyone noticed.

How the Workflow Runs in Practice

The deployment pipeline integrates with the canary workflow at the deploy step. Instead of pushing the new version to the full fleet, the pipeline pushes it to a canary subset and registers the deployment with the cohort skill.

The metrics skill starts collecting comparisons immediately. The first comparison usually runs after fifteen minutes of canary traffic. The skill emits a structured report that the decision skill consumes.

If the decision is hold, the comparison continues. The metrics skill produces a new comparison every fifteen minutes, and the decision skill re-evaluates each time. The hold continues until either the observation window expires with a promote decision or a regression appears and triggers a roll back.

If the decision is promote, the promotion skill takes over. It steps through the promotion ladder, running comparisons at each step, until the deployment reaches 100% traffic. At that point, the canary is done and the change is live for everyone.

If the decision is roll back, the routing layer reverts the canary cohort to the previous version. The metrics that triggered the roll back are attached to the deployment record. The author of the change gets a notification with the comparison data, which is usually enough information to identify the bug.

What This Workflow Did to My Practice

The most visible change is in incident frequency. The category of incident I used to see most often, where a code change went to 100% of traffic and broke something, has nearly disappeared. The category that replaced it is canary roll backs, which catch the same class of bugs without the customer impact.

The second change is in deployment speed for safe changes. Because the workflow is automated, deployments that would have required careful human attention now run in the background. I can deploy a low-risk change at any time and the workflow handles the promotion without me having to be present. The combination is that risky changes get more attention and safe changes get less, which is the right allocation.

The third change is cultural. The team writes different code now. The kinds of changes that would have been postponed or batched are now shipped continuously, because the cost of a small risky change is much lower than it used to be. The cycle time on individual changes has dropped, even though each individual deployment takes longer than it used to.

If you want to see how this connects to the broader picture, Claude Code for Incident Response covers what happens when something does break despite the canary. The two workflows together form most of the safety net I rely on in production.

For the rest of my practical workflows around shipping software with Claude Code, the full series is on DEV.to.

FAQ

Does this require a service mesh?

No. The cohort skill can run anywhere a routing decision is made. A service mesh makes it easier, but a load balancer, an API gateway, or even an application-level router works.

What if my service has too little traffic for statistical significance at 1%?

Increase the initial canary percentage. The workflow does not require 1%. It requires that the canary cohort is small enough that a regression does not affect most users and large enough to produce a meaningful statistical signal. The right percentage depends on your traffic volume.

What about changes that affect every request the same way?

For uniform changes, the per-cohort comparison still works because the metrics are computed independently for each cohort. The skill will detect differences even when the change affects every request, as long as the change produces a measurable signal.

How do I write the policy?

Start conservative. List the metrics that matter most for your service. For each one, choose a regression threshold that is large enough to be unambiguous. Tighten the policy over time as you learn what false positives look like.

The canary deployment workflow is not glamorous. It does not produce the kind of architectural diagrams that get applauded at conferences. What it does is take an entire category of operational pain and make it disappear quietly. The change to the team's day-to-day experience is huge, even though the surface change to the system is small. That ratio of impact to visible complexity is exactly what I look for when I decide where to invest engineering time, and it is why I would build this workflow first if I were starting a new production service today.

Claude Code for Error Budgets: How I Stopped Arguing About Reliability and Started Measuring It

Nex Tools — Mon, 18 May 2026 11:07:35 +0000

For the first three years of running production systems I had the same fight with the same people about the same thing. The product team wanted to ship faster. The on-call team wanted to ship slower. Both sides had data. Neither side could prove the other was wrong. The arguments would end in compromise that nobody felt good about, and the next incident would restart the cycle.

The fix was not better arguments. The fix was an error budget. Once we had a budget, the question stopped being "should we ship this risky change" and started being "do we have budget left to spend." That is a much smaller question with a much clearer answer, and it changes the entire conversation between the people who build features and the people who keep them running.

Setting up an error budget program sounded simple in the SRE book. In practice it took me eighteen months and three failed attempts before I had something that actually worked. The thing that finally made it work was using Claude Code to handle the parts of the program that humans were too inconsistent to handle on their own. Here is the workflow I built and what it taught me about reliability as an engineering discipline rather than a debate topic.

Why Error Budgets Are Hard to Run in Practice

The theory of error budgets is straightforward. You pick a service level objective, usually expressed as a percentage of successful requests over some window. The difference between 100% success and your objective is your budget. When you have budget left, you can ship risky things. When you have burned through your budget, you slow down and focus on reliability work until the budget recovers.

The theory is clean. The practice is messy in ways the theory does not warn you about.

The first mess is measurement. Picking the right success criteria turns out to be much harder than it sounds. A request that returned 200 but took 12 seconds is not really a success. A request that returned 500 because the user sent malformed input is not really a failure. A request that succeeded for the user but failed in a way that corrupted internal state is the worst possible outcome and looks like a success in your metrics. Every team that runs an error budget hits these edge cases, and most teams either ignore them or argue about them forever.

The second mess is enforcement. Error budgets only work if the organization actually changes behavior when the budget is exhausted. In practice, the moment the budget runs out is also the moment somebody important wants to ship something important, and the budget gets waived. After this happens three or four times, the budget becomes a number on a dashboard that everyone ignores. The credibility cost of ignoring the budget once is much higher than people realize.

The third mess is cadence. A budget that resets monthly behaves very differently from a budget that resets quarterly, and both behave very differently from a sliding window. Each cadence has different failure modes. The wrong cadence for your traffic patterns can make the budget either too punishing or too lax, and neither extreme produces the cultural changes you wanted.

An error budget is not just a number. It is a contract between the people who build features and the people who run them, and like any contract, the value comes from how rigorously it is enforced rather than how cleverly it is written.

The teams I have seen succeed with error budgets are not the teams with the most sophisticated objectives. They are the teams that wired the budget into their actual release process so that the contract enforced itself. The teams that failed are the ones that left enforcement up to human judgment, because human judgment under pressure always finds a reason to ship.

The Objective Skill

The first skill in my error budget workflow handles objective definition. Given a service and its traffic patterns, the skill proposes a service level objective and the supporting indicators that feed into it.

The skill does not just pick a percentage. It looks at historical traffic, current failure rates, and customer impact patterns to recommend an objective that is achievable but meaningful. An objective of 99.99% sounds impressive but is usually meaningless for a service whose current state is 99.5%. An objective of 99% is useless for a service that already runs at 99.95%. The right objective is one that requires real work to maintain but does not require fantasy.

The skill also defines the success criteria precisely. It does not just say "successful requests." It specifies what success means for this particular service, including the edge cases. A successful request might require a 2xx status code, a response time under a specific threshold, and the absence of any internal error logs correlated with that request ID. The precision is important because vagueness is where the arguments start.

The output is a structured objective document that can be reviewed, debated, and eventually signed off by both the engineering team and the product team. The document is the foundation of the budget program. Without precise definitions, every subsequent decision becomes a fight about what the words mean.

The Burn Rate Skill

The second skill tracks burn rate in real time. The burn rate is the speed at which the budget is being consumed relative to the time remaining in the window.

A budget that burns at exactly the expected rate is not interesting. A budget that burns at three times the expected rate is a warning. A budget that burns at ten times the expected rate is an active fire. The skill watches the burn rate continuously and surfaces deviations from expected behavior.

The interesting design choice is the smoothing. A naive burn rate calculation produces wild swings every time a single bad request comes in. A heavily smoothed calculation hides real problems for hours. The skill uses multiple time horizons in parallel. A one-hour view catches active fires. A six-hour view catches sustained degradations. A 24-hour view catches gradual drift that nobody would notice in shorter windows.

When the burn rate crosses a threshold, the skill alerts. The alert is not just a number. It includes the context needed to act. Which endpoint is contributing most to the burn. Which deploy correlated with the change in burn rate. Whether the burn is concentrated in a single tenant or distributed across the user base. The context turns the alert from a notification into a starting point for investigation.

The investigation often connects directly to a log analysis pass. If you have set up the workflow I described in Claude Code for Log Analysis, the burn rate alert can hand off straight into pattern detection, which compresses the time from "budget is burning" to "we know why" by a meaningful margin.

The Policy Skill

The third skill enforces the budget policy in the release pipeline. The skill sits between the deploy command and the actual deploy and checks whether the current budget state permits the release.

The policy is configurable per service. A typical configuration might say that any deploy is permitted while the budget is above 50%, that only low-risk deploys are permitted between 25% and 50%, and that no non-critical deploys are permitted below 25%. The thresholds and the risk classifications are defined in the objective document so that the policy is mechanical rather than negotiable.

The mechanical enforcement is the entire point. When the budget gets low and the policy blocks a deploy, the response is not a debate about whether to override the policy. The response is a question about whether to spend the remaining budget on this specific change. If the answer is yes, the change ships and the budget burns further. If the answer is no, the change waits. Either way, the budget stays meaningful.

The skill also produces an audit trail. Every deploy that was permitted under the policy is logged with the budget state at the time. Every deploy that was blocked is logged with the reason. The audit trail makes it possible to look back at a quarter and see exactly how the budget was spent and whether the spending decisions were the right ones in retrospect.

The Postmortem Skill

The fourth skill connects error budget consumption to postmortem actions. After every significant budget burn, the skill produces a draft postmortem that documents what happened, how much budget was consumed, and what changes would prevent the burn from recurring.

The draft is not a finished postmortem. It is a structured starting point. The skill fills in the data sections automatically, including the timeline, the affected metrics, and the related deploys. The human writes the analysis sections, which are the parts that actually require judgment. The split between mechanical sections and judgment sections cuts the time to produce a postmortem roughly in half without reducing its quality.

The postmortem also includes a budget impact summary. The summary expresses the incident in terms of how much of the quarterly budget it consumed and how that affects the remaining release capacity for the quarter. The budget framing makes the postmortem read differently. Instead of saying "this incident lasted 47 minutes," the postmortem says "this incident consumed 18% of the quarterly budget." The second framing leads to different priorities about prevention.

For incident response context that pairs naturally with this postmortem workflow, the system I described in Claude Code for Incident Response handles the live response side and feeds directly into the postmortem skill once the incident is closed.

How the Workflow Runs in Practice

The workflow runs continuously rather than on demand. The burn rate skill is always watching. The policy skill is always sitting in front of the deploy pipeline. The postmortem skill triggers automatically when a burn crosses a threshold.

When the burn rate alerts, my first move is to check the context the alert provides. The endpoint, the correlated deploy, the affected user segment. Most of the time the context points directly at the cause. If the alert correlates with a recent deploy, the deploy is probably the cause and the response is a rollback. If the alert correlates with a tenant spike, the cause is probably load and the response is a scaling decision.

When the policy skill blocks a deploy, the response is a conversation rather than a fight. The conversation is about whether the change is important enough to spend remaining budget on, given that the budget cannot be replenished mid-quarter. Sometimes the answer is yes and the team takes ownership of the increased risk. Sometimes the answer is no and the change moves to the next quarter. Either answer is fine because both are deliberate.

When the postmortem skill produces a draft, my first move is to fill in the human analysis sections. The data is already there. The narrative, the root cause, the action items are the parts that need judgment. The structured starting point makes it easier to focus on the parts that matter rather than getting bogged down in data assembly.

The objective skill runs once per service when the service is onboarded and then again at quarterly reviews. The review cycle keeps the objectives calibrated to actual traffic and actual customer expectations. A service whose traffic has grown 10x in a year usually needs a tighter objective than the one it was launched with.

What This Workflow Did to My Practice

The most visible change is that the arguments stopped. The product team and the engineering team no longer fight about whether to ship a particular change. They look at the budget, they look at the change, and they make a decision. The decision is not always the one I would have preferred, but the decision-making process is much faster and much less politically expensive than the old arguments were.

The second change is that incidents feel different. An incident that would previously have produced a vague sense of "things were bad for a while" now produces a precise statement about how much budget was consumed and what that means for the rest of the quarter. The precision makes prevention work easier to justify, because the cost of the incident is no longer abstract.

The third change is in how the team thinks about reliability investments. Before the budget program, reliability work was something to argue for during planning. After the budget program, reliability work happens whenever the budget gets tight, because the alternative is freezing deploys. The forcing function is mechanical rather than political, which means the work actually happens.

The fourth change, which I did not expect, is in how features get scoped. The product team has started asking about expected error budget impact early in the design process. A feature that requires a risky new dependency now gets weighed against the budget cost of integrating it, not just the engineering cost of building it. The conversation about scope is informed by reliability data instead of opinions.

For the broader set of workflows that connect to this one, my Claude Code Practical Workflows series on DEV.to covers everything from observability through incident response, refactoring, migrations, and security. The error budget workflow ties many of them together because the budget is the unifying measurement that tells you whether the rest of the practice is working.

FAQ

What if the team does not want to commit to a service level objective?

That resistance is usually about fear of being held to an unrealistic number rather than disagreement with the concept. The objective skill helps because it grounds the objective in actual traffic and current failure rates, which makes the number defensible. Once the team sees that the proposed objective is achievable, the resistance usually fades.

How do I handle services with very low traffic?

Low-traffic services have noisy budget calculations because a single bad request consumes a much larger percentage of the budget. The skill handles this by using longer windows for low-traffic services and by combining related services into a single budget where appropriate. A budget calculated over 1,000 requests per quarter is meaningful in a way that a budget calculated over 50 is not.

What happens when the budget is exhausted halfway through the quarter?

The policy skill blocks non-critical deploys for the rest of the quarter. The team uses the time to do the reliability work that the burn rate revealed. This is the intended behavior. If exhausting the budget produces no behavioral change, the budget program is not working and the program needs to be reconsidered, not the budget.

Can I run multiple budgets per service?

Yes. A single service might have a budget for availability and a separate budget for latency. The skills handle multiple budgets per service and produce aggregated views for cases where the budgets need to be reasoned about together. Most teams start with a single availability budget and add additional budgets only when the first one is operating well.

How do I get product team buy-in?

The biggest unlock is reframing the budget as a permission slip rather than a restriction. The budget is what gives the product team the right to ship risky changes. Without the budget, every risky change has to be argued individually. With the budget, the team can ship anything that fits within the available capacity. Most product teams respond well to this framing once they see that the budget enables faster shipping when reliability is healthy.

The error budget workflow is the piece of my SRE practice that I would recommend to any team that has ongoing tension between product and engineering about reliability. The tension is real and it does not resolve itself through arguments. It resolves through a contract that enforces itself mechanically, and the workflow I described is how I made that contract operational. The investment is significant. The payoff is that reliability stops being a source of conflict and becomes a source of shared planning, which is the version of the relationship that healthy teams have.

Claude Code for Log Analysis: How I Stopped Drowning in Stack Traces

Nex Tools — Mon, 18 May 2026 10:54:44 +0000

The first time a production incident hit at 2 AM, I spent two hours scrolling through logs before I found the line that mattered. It was a single timestamp buried inside 800,000 entries from the same hour. The bug had been throwing the same exception in a hot loop, drowning out the rare error that actually caused the outage. By the time I found it, the customer impact window had stretched past anything we wanted to be telling the postmortem audience.

That night taught me something. Log analysis is not a search problem. It is a triage problem. The interesting signal is almost never the most frequent line. It is the rare line that happens once or twice and then never again. Humans are terrible at finding rare signals in high-volume noise, especially at 2 AM. Grep is even worse, because grep finds matches but does not rank them.

This is where Claude Code rewired how I do log analysis. The workflow I built turns a wall of unstructured text into a ranked list of things worth investigating, and it does it in seconds rather than hours. I run it every time something goes wrong in production, and it has compressed my median time to root cause from somewhere around an hour to somewhere around five minutes. Here is how the workflow works.

Why Traditional Log Analysis Falls Apart at Scale

The standard tools for log analysis were built for a world where logs were small. Grep, awk, and tail work fine when you have a few thousand lines and a clear idea of what you are looking for. They fall apart at modern volumes for two reasons.

The first reason is that you do not know what you are looking for. You know that something went wrong, but the error message that caught your attention might not be the actual cause. It might be a downstream symptom. The cause is buried somewhere earlier in the timeline, inside a log line that looked unremarkable when it was written.

The second reason is that the signal-to-noise ratio is brutal. A production service running at moderate load produces tens of thousands of log lines per minute. The vast majority of those lines are routine. The interesting ones are needles in a haystack of needles. Even when you find a candidate, you cannot easily tell whether it is rare or common without running another query.

The bottleneck in log analysis is never the speed of the search. It is the speed of pattern recognition across high-volume text. The tools we have are good at search and bad at pattern recognition, which is exactly the wrong tradeoff for the job.

The teams I have worked with that have invested heavily in observability platforms still have this problem. The platforms make ingestion easier but they do not solve the pattern recognition problem. They let you slice the data faster but they still require you to know what slice to ask for. When the incident is novel, you do not know.

If you want context on why I treat observability as a first-class engineering concern, the workflow I described in Claude Code for Observability Stacks lays out the broader system this log analysis workflow plugs into.

The Frequency Skill

The first skill in the workflow handles frequency analysis. Given a log file and a time window, the skill produces a ranked list of every distinct log pattern and how often it occurred.

The interesting part is the normalization. The skill recognizes that two log lines with different timestamps, different request IDs, and different user IDs but the same underlying message template are the same pattern. It strips out the variable parts and groups the lines by template. The output is a list of templates, each with a count, an example line, and a sample of the variable values.

The ranking flips the normal log ordering on its head. The most common patterns are at the bottom, not the top. The rare patterns, the ones that occurred once or twice in the window, float to the top. The list becomes a tour of every unusual thing that happened during the incident window, ranked by how rare it was.

The first time I ran this on a real incident, the cause jumped out from position three on the list. It was a single log line from a connection pool exhaustion event, twelve seconds before the cascade of customer-facing errors started. The line had been there the whole time, but it was invisible inside the 800,000 line haystack.

The Correlation Skill

Once the frequency skill has surfaced rare patterns, the correlation skill ties them together. The skill takes a set of candidate patterns and looks for temporal relationships between them.

The relationships it finds are useful for debugging. If pattern A always appears within five seconds of pattern B, that is worth knowing. If pattern A spikes shortly before pattern C starts firing, that is worth knowing. If pattern D appears only when pattern E has not appeared for several minutes, that is also worth knowing.

The skill also looks at cross-service correlations. When the log stream includes multiple services, the skill can ask whether a pattern in service X correlates with a pattern in service Y. The cross-service view often reveals causes that are invisible inside any single service.

The output is a small graph of temporal relationships. Each edge is annotated with the lag time and the strength of the correlation. The graph is far smaller than the original log volume and is much easier to reason about.

The Hypothesis Skill

The third skill builds hypotheses about what went wrong. Given the frequency analysis and the correlation graph, the skill produces a ranked list of possible root causes.

Each hypothesis comes with the evidence that supports it. The evidence is specific log lines, specific timestamps, specific correlation strengths. The hypothesis is not a guess. It is a falsifiable claim grounded in the actual log data, which means I can validate or reject it quickly.

The ranking is based on how well the evidence supports the hypothesis. A hypothesis that is consistent with every observed pattern is ranked higher than one that only explains part of the data. A hypothesis that contradicts a known pattern is ranked lower.

I treat the top hypothesis as the starting point of my investigation rather than the answer. Sometimes the top hypothesis is correct and I move on. Sometimes it is wrong but the evidence reveals a different cause that the skill missed. Either way, the starting point is much better than scrolling logs from the beginning.

If you want to see how this hypothesis-driven approach extends to incident response broadly, Claude Code for Incident Response covers the full workflow. The log analysis skills described here are the substrate that makes the incident response workflow possible.

The Diff Skill

Long-running incidents have a special challenge. The logs from before the incident and the logs from during the incident look almost identical at the line level, but the distributions are different. The diff skill quantifies the difference.

The skill takes two log windows. One is a baseline, usually a similar period from before the incident. The other is the incident window itself. The skill compares the frequency distributions of every log pattern in the two windows and surfaces the ones that changed the most.

The patterns that increased dramatically in the incident window are usually symptoms. The patterns that decreased are sometimes the most interesting. A drop in the rate of successful operations or healthy heartbeats often points more directly at the cause than the new errors do.

The diff also surfaces patterns that are entirely new. Lines that appear in the incident window but do not appear at all in the baseline are particularly interesting. They are the things the system was not doing in normal operation, which means they are very likely related to the incident.

How the Workflow Runs in Practice

When an incident fires, my first move is to grab the log window. I usually take fifteen minutes before the first customer-visible error and fifteen minutes after. The window is small enough to process quickly and large enough to capture context.

I pass the window to the frequency skill. The output is a ranked list of patterns that takes about ten seconds to skim. The rare patterns at the top are my first candidates. I tag the ones that look interesting.

The correlation skill takes the tagged patterns and produces a small graph. The graph usually reveals the rough order of events. I can see which pattern came first, which came second, which seemed to trigger which.

The hypothesis skill takes everything and gives me a starting point. Maybe two or three candidate root causes, each with supporting evidence. I validate the top hypothesis by checking the evidence directly and either confirming it or ruling it out.

If the incident is long-running, I bring in the diff skill. The baseline comparison is particularly useful when the incident is a gradual degradation rather than a sudden break. The patterns whose rates have drifted reveal the degradation in a way that simple error scanning cannot.

The whole workflow takes ten to fifteen minutes for a typical incident. The actual fix often takes longer than the analysis, which is the opposite of how my workflow used to be balanced.

What This Workflow Did to My Practice

The most measurable change is the median time to root cause. Before this workflow, I would estimate it at sixty to ninety minutes for a typical incident. Today it is closer to five to ten minutes for the same class of incident. The reduction is dramatic enough that I no longer dread getting paged.

The less measurable change is more important. I now expect to understand what happened, not just to make it stop. Pre-workflow, I would frequently hit a state where the system was healthy again but I did not really know why or what had caused the original problem. The temptation was to declare victory and move on. Post-workflow, I almost always have a hypothesis grounded in evidence by the time the system stabilizes. The postmortems are shorter and more accurate because the root cause is already documented.

The third change is in how I think about logs themselves. I used to view logging as a debugging output, something to write more of when I was stuck. Now I view it as a structured signal source that needs to be analyzable at scale. The way I write log statements has changed. I include more context, I use more consistent templates, I avoid baking variable values into the static parts of the message. The downstream tooling works better because the upstream emission is more disciplined.

If you want to take this further, my full set of practical workflows is in the Claude Code Practical Workflows series on DEV.to. The series covers everything from incident response to refactoring to migrations.

FAQ

Does this work for unstructured logs?

Yes. The frequency skill normalizes log lines into templates even when the lines are unstructured. Structured logs are easier to work with, but the workflow does not require them.

What about logs that are too large to process in a single pass?

The skills can sample. For very large windows, the frequency analysis runs on a sample first and then drills down on the interesting patterns at full resolution. The sampling is much faster and usually surfaces the same candidates as the full pass.

Does this replace observability platforms?

No. The observability platform is where the logs live. The skills consume what the platform provides. They complement the platform by adding pattern recognition that the platform itself does not offer.

How do I get started?

Start with the frequency skill. Pick a recent incident, grab the log window, run the analysis, and see what shows up. The first time you find a needle the platform missed, you will know whether the workflow is worth investing in further.

The log analysis workflow is one piece of a larger pattern. The pattern is using Claude Code to add pattern recognition layers on top of tools that were designed for human-scale data and now have to work at machine-scale volumes. Every layer makes a different part of the job tractable. Log analysis was where the payoff was clearest for me, and it is where I would recommend starting if you want to try this on your own systems.

Claude Code for Feature Flags: How I Ship Risky Changes Without Losing Sleep

Nex Tools — Wed, 13 May 2026 11:43:28 +0000

The riskiest deployment I have ever done was a payment processor migration. The old processor was being deprecated. The new one had better rates but a completely different API. The migration touched the most sensitive code path in the business. A bug in the new path would silently lose revenue or charge customers incorrectly. There was no acceptable amount of downtime.

I shipped that migration on a Tuesday afternoon at three in the morning. No, that is not a typo. I shipped it in the middle of a normal Tuesday because feature flags let me. The new path was already in production behind a flag set to zero percent of traffic. I ran a script that increased the percentage gradually: one percent, then ten, then fifty, then one hundred. Each step, I watched the metrics. If anything looked wrong, I would have rolled back to zero with a single command. Nothing looked wrong. The migration completed in about ninety minutes and nobody on the team even knew it was happening except me.

That is what feature flags do when they work right. They turn a scary deployment into a routine one. They let you separate the act of shipping code from the act of activating it. They give you the ability to react to problems in seconds instead of minutes or hours. But they only work right when the infrastructure around them is solid. Building that infrastructure is what Claude Code helps with.

Here is the workflow I use to run feature flags across multiple products.

What Feature Flags Are Actually For

Feature flags get pitched as a way to do A/B testing. That is the marketing pitch and it sells products. The actual reason to have feature flags in your codebase is more boring and more important. They let you decouple deployment from release.

When the deployment ships, the new code is in production but the new behavior is not active. When you decide the behavior should be active, you flip the flag and the behavior turns on without redeploying. When you discover a problem, you flip the flag back and the behavior turns off without redeploying.

The value of feature flags is not in the experimentation they enable. The value is in the asymmetry they create between deployment risk and release risk. A bad deployment forces a rollback. A bad release flips a flag. The difference is the difference between a Sunday night incident and a Tuesday afternoon decision.

The challenge is that feature flags themselves become a source of complexity. Code that is gated behind flags is harder to reason about. Tests have to cover both branches. The flag state has to be consistent across requests. The flag has to be cheap to evaluate. The flag has to be possible to clean up when the experiment is done.

Most teams I have worked with started with a simple flag system and then watched it grow into something unmanageable. Hundreds of flags. Stale flags from features shipped years ago. Flags that nobody remembers what they do. Flag configurations that disagree across environments. The complexity of the flag system eventually exceeds the complexity it was meant to manage.

The Claude Code workflow tackles this by automating the lifecycle of flags from creation through cleanup, and by enforcing the patterns that keep the system manageable.

The Flag Creation Skill

The flag creation skill takes a feature request and produces the flag-gated implementation. The skill handles the boilerplate that makes flags consistent and the discipline that prevents bad patterns.

The skill creates the flag definition in the central flag registry. The registry has a single source of truth for every flag, including its name, its description, its expected lifetime, its owner, and its allowed values. The flag does not exist if it is not in the registry. New flags require an explicit registration step that captures the metadata.

The skill instruments the new code with the flag check at the right level. The check is at the boundary where the new behavior diverges from the old. Putting the check too deep means you have to thread the flag value through many layers. Putting the check too shallow means the entire request path has to be duplicated. The right place is the smallest scope that contains all the diverging logic.

The skill writes both branches of the code. The new branch is the new behavior. The old branch is the existing behavior, preserved unchanged. Both branches have tests. The tests run for both branches in CI, ensuring that turning the flag on or off does not break the build.

The skill also writes the migration plan. The plan documents how the flag will be rolled out: starting percentage, ramp schedule, success criteria, and rollback criteria. The plan goes into the PR description and gets reviewed alongside the code. Without a plan, the flag has no path to being fully on or fully off.

The Rollout Skill

Once a flag exists, it has to be rolled out. The rollout skill handles the progressive activation that turns a flag from zero percent to one hundred percent.

The skill operates on a rollout schedule. The schedule has stages, each with a target percentage and a duration. Stage one might be one percent for one hour. Stage two might be ten percent for one day. Stage three might be fifty percent for one day. Stage four is one hundred percent.

Between stages, the skill checks the health metrics for the rollout. The metrics include the error rate for the new behavior, the latency, the conversion rate if it is a user-facing change, and any custom metrics specified in the rollout plan. If the metrics are within the acceptable range, the rollout proceeds to the next stage. If they are outside the range, the rollout halts and a human is notified.

The skill never advances past a stage on a schedule alone. The schedule sets the maximum pace, but the metrics set the actual pace. A rollout that looks fine on the schedule but has degrading metrics will stop. A rollout that has clean metrics but is ahead of schedule will not skip the wait, because the wait is what lets long-tail issues surface.

The skill also handles segment-based rollouts. Sometimes you want to roll out to specific users first: internal staff, then beta testers, then a geographic region, then everyone. The skill expresses these segments in the rollout plan and applies them in sequence. The segment-based rollout catches issues that percentage-based rollouts miss, because a one percent rollout might still miss specific cohorts where the bug manifests.

The Evaluation Skill

The flag has to be evaluated at runtime, and the evaluation has to be fast and consistent. The evaluation skill produces the runtime code that checks whether a flag is on for a given context.

The skill produces an evaluation function that takes a context (user ID, request attributes, environment) and returns the flag value. The function is deterministic. Given the same flag state and the same context, it always returns the same value. This consistency is important for testing and debugging. If you cannot reproduce a flag evaluation, you cannot debug the resulting behavior.

The skill caches flag values aggressively. Flag definitions change rarely. The flag values for a given context can be computed once per request and reused everywhere. The skill produces an in-request cache that avoids redundant evaluations. For longer-lived contexts like user sessions, there is also a session-level cache.

The skill also handles flag dependencies. Sometimes one flag depends on another. The dependent flag should not be evaluated if the parent flag is off. The skill produces evaluation code that respects the dependency graph and avoids spurious evaluations.

The evaluation has to be cheap. Every request might evaluate dozens of flags. If each evaluation takes a millisecond, the request latency adds up quickly. The skill produces evaluation code that completes in microseconds for the cached case and in milliseconds even for the cold case. The performance budget for flag evaluation is much tighter than most teams realize.

The Observability Skill

Flags need observability for the same reason any production code needs observability. You have to know what the flag is doing. The observability skill adds the instrumentation that makes flags debuggable.

Every flag evaluation produces a log line. The line includes the flag name, the context attributes that mattered, the value returned, and the reason the value was chosen. The reason is important. A flag returning true might be returning true because the user is in the rollout cohort, or because the user is in a force-on list, or because the flag is fully on. The reason tells you which path was taken.

Every flag value gets a metric. The metric tracks the distribution of values for the flag over time. You can see at a glance whether a flag is at one percent, ten percent, or one hundred percent. You can see when a flag was changed, by looking for the inflection point in the metric. You can see whether the rollout is consistent with what the configuration says, by comparing the metric to the configured percentage.

The skill also produces trace spans that show which flags were evaluated during a request and what values they returned. The trace span is what lets you debug behavior that depends on flag values. When a user reports an issue, you can pull up their request trace and see exactly which flag values applied to their session.

The observability is what makes the flag system trustworthy. Without it, flag-related bugs are hard to diagnose because you cannot tell what the flag system was doing. With it, the flag system is transparent and bugs are quick to find.

The Cleanup Skill

The biggest source of flag debt is flags that should have been removed but were not. The cleanup skill is what keeps the flag system from growing unbounded.

The skill watches the flag registry for flags that are eligible for cleanup. A flag is eligible if it has been at one hundred percent or zero percent for a sufficient period, with no recent changes. The threshold is configurable, typically a few weeks for fully rolled out flags and a few days for fully rolled back ones.

When a flag is eligible, the skill produces a cleanup PR. The PR removes the flag check from the code, keeping only the active branch. The removed branch is the dead branch, the one not in use. The flag definition is also removed from the registry. The tests are updated to remove the branch coverage that no longer applies.

The cleanup PR goes through normal review. A human looks at it, confirms the removed branch is actually dead, and merges. The merge is what completes the flag lifecycle. The flag existed for as long as it needed to. Now it does not exist and the codebase is simpler.

The skill also surfaces flags that have not had any activity for a long time. These are zombie flags, where the rollout stalled and was never completed. The zombie flag should either be rolled out the rest of the way or rolled back. The skill produces a report that surfaces the zombies and asks for a decision.

The Coordination Skill

Multiple flags interact. A flag for a new payment processor might depend on a flag for the new checkout UI, which might depend on a flag for the new auth flow. The coordination skill manages these interactions.

The skill maintains a dependency graph of flags. The graph encodes which flags depend on which. When you change one flag, the skill checks whether the change is consistent with the dependent flags. Turning on a flag that depends on another flag that is off produces an error.

The skill also handles experiment conflicts. Two experiments running at the same time might target overlapping segments. The skill detects the overlap and warns. Sometimes the overlap is fine because the experiments are independent. Sometimes the overlap is a problem because the experiments interfere with each other.

The skill produces a calendar of flag activity. Each flag rollout shows up on the calendar with its start date and end date. The calendar helps the team see when the system has many things in flight versus when it is quiet. Scheduling a risky rollout for a quiet period reduces the chance of conflicting changes.

The Testing Skill

Flag-gated code has to be tested for both branches. The testing skill ensures this happens automatically.

The skill produces a test matrix for each flag. The matrix enumerates the relevant flag values: off, on, partial. For each value, the existing test suite runs against that flag state. If any test fails for any flag value, the build fails. The matrix ensures that turning the flag on or off does not break the system.

The skill also produces flag-specific tests. These tests target the boundary between the two branches and verify that the boundary works correctly. The transition tests are what catch bugs where the flag check is in the wrong place or where the two branches diverge in unexpected ways.

For more complex flags, the skill produces integration tests that exercise the full flow through both branches. The integration tests are slow but catch the kinds of bugs that unit tests miss. Integration tests run on a schedule rather than on every commit, so the slowness does not block the team.

The skill also handles snapshot testing for user-visible changes. When a flag is going to change a user interface, the skill produces snapshots for both states. The snapshots get reviewed before the flag rolls out, ensuring that the visual change is intended.

How the Skills Compose

The skills compose into a flag lifecycle. A new feature comes in. The creation skill produces the flag-gated implementation. The rollout skill activates the flag gradually. The evaluation skill makes the flag fast at runtime. The observability skill makes the flag visible. The coordination skill keeps the interactions clean. The testing skill catches the branch bugs. The cleanup skill removes the flag when it is done.

The team interacts with the skills through normal PR flow. The creation PR adds the flag and the gated code. The rollout PR sets the schedule. The cleanup PR removes the completed flag. Each PR is small and reviewable. The skills handle the boilerplate but the humans make the decisions.

The result is a flag system that scales to many flags without becoming a source of pain. Flags get created, rolled out, observed, and cleaned up on a regular cadence. The codebase stays manageable. The risk of changes goes down.

What This Costs

The skills took about a month to build. The evaluation skill was the most complex piece because the performance requirements are tight and the consistency requirements are strict. The cleanup skill required the most ongoing tuning to avoid false positives that would propose removing flags that were still in use.

The benefit is in the rate at which the team can ship. Before the flag system, every risky change required a deployment that activated the change immediately. The deployment had to be timed carefully and watched closely. With the flag system, the deployment is decoupled from the activation. Risky changes ship at any time and activate when the conditions are right.

The benefit also shows up in the rate of recovery from problems. A bug that gets discovered after a flag-gated rollout takes seconds to mitigate. The flag flips off and the bug stops happening. The fix can be deployed at a normal pace because the production impact is already neutralized.

What the Skills Do Not Do

The skills do not pick your flag platform. Whether you use a SaaS flag service, an open source flag system, or a homegrown one, the skills produce code that fits the platform's evaluation interface. The platform itself is your choice.

The skills also do not write your feature code. They handle the flag gating and the lifecycle, but the feature behavior is yours to design and implement. The flag is the wrapping around the feature, not the feature itself.

The skills also do not decide which features should be flag-gated. Not every change needs a flag. Small changes, low-risk changes, and changes that have no rollback path do not benefit from flagging. The decision of when to use flags is a judgment call that the skills support but do not make.

Setting Up Your Own Flag System

Start with the registry. The registry is the foundation. Every flag exists in the registry, with its metadata. Without the registry, the flag system is not manageable at scale.

Add the evaluation library next. The library is what application code uses to check flags. The library should be small, fast, and well-tested. It is the most performance-sensitive part of the system.

Add observability third. Get flag evaluations into your logs and metrics. This is what makes the flag system debuggable when it does not behave as expected.

Add the rollout tooling after that. The progressive rollout is what turns flags from a binary on-off switch into a graduated mechanism for managing risk.

Add cleanup last. The cleanup matters but it can wait until you have a few flags in flight and need to start removing them.

The Bigger Picture

Feature flags are one of those tools that pay off enormously when they work and create chaos when they do not. The difference is the infrastructure around them. A flag system without a registry is a mess of strings scattered through the codebase. A flag system without observability is a black box that nobody trusts. A flag system without cleanup is a graveyard of dead branches that accumulate over years.

The pattern in this workflow is the pattern I keep applying. The repetitive parts of flag management get automated. The judgment parts stay human. The infrastructure is what lets the team use flags confidently. Without the infrastructure, flags become risky. With it, flags become routine.

If you have a codebase that needs more sophisticated release management than your current deployment pipeline supports, the answer is probably not to invest in faster rollbacks. The answer is to invest in flags. The flags are what let you take the deployment pressure off, ship more often, and recover faster when things go wrong.

The first concrete step is the registry. Create a single source of truth for flags. Make it impossible to use a flag that is not registered. Once the registry exists, every other piece of the system has a foundation to build on. Without it, the flags drift and the system becomes ungovernable.

Build the registry. Add one flag. Watch it through its full lifecycle. Then add the next one. The discipline compounds quickly.

FAQ

Should every change go behind a flag? No. Small changes and changes that you can revert with a quick deploy do not benefit from the overhead. Use flags for changes where the risk of a bad release is high.

How long should a flag live? As short as possible. A flag that is fully rolled out should be cleaned up within weeks. A flag that is fully rolled back should be cleaned up within days. Flags that live for years are usually a sign of a stalled rollout.

What about flags for permissions? Permission flags are different from rollout flags. Permission flags live forever and are part of the application's permanent logic. The same registry can hold both, but they should be tagged differently and treated differently by the cleanup skill.

What about server-side versus client-side flags? The same patterns apply but the implementation differs. Server-side flags evaluate on every request. Client-side flags evaluate once per session. The evaluation skill handles both modes with the same interface.

What is the biggest mistake to avoid? Adding flags without a plan to remove them. Every flag should have an expected end state. Without a plan, the flag accumulates and the system gets messy.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Observability Stacks: How I Stopped Flying Blind in Production

Nex Tools — Wed, 13 May 2026 11:37:38 +0000

The first real outage I had to debug without proper observability took fourteen hours to resolve. The system was throwing 500s intermittently. Logs showed nothing useful. Metrics showed the error rate climbing but no signal about why. Traces did not exist. I spent the entire day adding log lines, redeploying, watching, and adding more log lines until I finally cornered the root cause.

The fix took eight minutes once I understood what was happening. The other thirteen hours and fifty-two minutes were spent building the observability that should have already been in place.

After that incident, I made a rule. Every service has to have observability built in from day one. Not as a future improvement. Not as something that gets added when there is time. Built in from the first commit. I have kept that rule for years now, and Claude Code is the thing that made it cheap enough to actually follow.

Here is the workflow I use to build and maintain observability across every service I run.

What Good Observability Actually Means

Good observability is the ability to answer questions about your system without having to deploy new code to answer them. When something breaks, you should be able to look at what is already being collected and figure out the answer. When you cannot answer the question with existing data, that is a gap in your observability and the next outage will be the one that exposes it.

The three pillars are logs, metrics, and traces. Each one answers different questions. Logs tell you what happened. Metrics tell you how often it happened and how it changed over time. Traces tell you what was happening at the same time across the rest of the system.

You do not have observability when you have all three pillars deployed. You have observability when you can answer the question you actually need to answer at three in the morning when something is on fire and you have ten minutes to find the root cause. The three pillars are necessary but not sufficient.

The hard part of observability is not picking the tools. The hard part is making sure the right data is being collected in the right shape, with the right labels, at the right cardinality. This is where most observability stacks fail. The tools are deployed but the data is wrong, and at the moment of crisis the answer is not in the system.

The Claude Code workflow targets the data collection problem directly. The skills produce instrumentation that is consistent across services, that captures the right shape of information, and that gets maintained as the services evolve.

The Instrumentation Skill

The instrumentation skill takes a service and adds the structured logs, metrics, and traces that the service needs to be debuggable.

The skill starts by reading the service code and identifying the natural instrumentation points. Public API endpoints get request and response logging with latency metrics and distributed trace spans. Database queries get duration metrics and trace spans with the query type as a label. External API calls get the same treatment, plus retry counts and circuit breaker state. Background jobs get start, complete, and failure events with duration histograms.

The skill applies the same patterns across every service. The endpoint logs have the same fields everywhere. The trace spans have consistent naming. The metrics have a shared set of labels. The consistency is what lets dashboards and alerts work across services without per-service configuration.

The skill avoids over-instrumentation. Adding a log line to every function in a codebase produces enormous volumes of low-value data that drowns out the signals you actually need. The skill instruments at the boundary points where data crosses subsystem lines. Inside a subsystem, only the points that have historically been useful for debugging get instrumented.

The skill also handles the cardinality problem. Metrics with high-cardinality labels explode in storage cost and query latency. The skill identifies labels that should be high-cardinality (request IDs, user IDs) and ensures they go into logs and traces rather than metrics. It identifies labels that should be low-cardinality (endpoint paths, error categories) and uses those for metrics.

The Schema Skill

Logs without a schema are unsearchable. Logs with a schema are queryable like a database. The schema skill makes sure every log line follows the schema for the service.

The schema starts simple. Every log line has a timestamp, a level, a service name, a request ID if one is available, and a message. Every log line is JSON, not free text. The fields beyond these are specific to the event being logged.

The skill captures the per-event fields as it instruments. A login event logs the user ID, the auth method, and whether the attempt succeeded. A database query event logs the query type, the table, the duration, and the row count. The fields are explicit and consistent across the codebase.

The skill produces a schema document that lists every event type, the fields it carries, and what each field means. The document gets checked into the repository and updated whenever a new event type is added. The document is what makes the logs usable by anyone other than the person who wrote them.

The schema also drives the log aggregation pipeline. The aggregator parses the JSON, extracts the fields, and indexes them so they can be queried. The indexing is much faster than searching through unstructured text. A query that takes thirty seconds against unstructured logs takes a few hundred milliseconds against indexed structured logs.

The Trace Sampling Skill

Distributed tracing is valuable but expensive. Tracing every request consumes too much storage and slows down query performance. Sampling solves the cost problem but introduces a bias problem. Random sampling misses the rare failures that you actually want to investigate.

The trace sampling skill implements smart sampling that catches the interesting traces while keeping the cost manageable. Every trace gets a sampling decision based on its characteristics.

Traces from healthy successful requests get sampled at a low rate, maybe one in a hundred. The aggregate behavior is captured but the storage cost is low. Traces from slow requests, where latency exceeds a threshold, get sampled at a higher rate, maybe one in ten. Traces from failed requests, where the response is an error, get sampled at one hundred percent. Every failure is captured.

The skill also implements tail sampling for the cases where the sampling decision has to wait until the trace is complete. A request that started normal but ended in an error needs to be sampled, but the decision can only be made after the error happens. The tail sampler buffers traces in memory and makes the decision when the trace ends.

The result is a trace storage that is dominated by failures and outliers, which is exactly what you want for debugging. The successful requests are represented but not dominant. The cost stays manageable and the data stays useful.

The Alert Generation Skill

Alerts are the part of observability that most teams get wrong. Either there are too many alerts and people stop responding, or there are too few alerts and outages go undetected for hours. The alert generation skill produces alerts that are actionable, specific, and rare.

The skill starts from the service-level objectives. Each service has objectives for availability, latency, and correctness. The alerts measure deviation from those objectives. When the error budget is being consumed at a rate that would exhaust it before the period ends, an alert fires.

The skill avoids the common trap of alerting on raw metrics. An alert that fires when CPU usage exceeds 90% is mostly noise. CPU usage at 90% is not a problem if the requests are still being served fast. The alert should fire on the user-visible effect, not the internal cause.

The skill also avoids per-instance alerts. When a single instance is unhealthy, the load balancer should route around it and the system should self-heal. The alert should fire when the system as a whole cannot self-heal, which is when the redundancy is exhausted.

Each alert produced by the skill includes the runbook link that explains what the alert means, what to check, and how to mitigate. The runbook is generated alongside the alert and updated whenever the alert is modified. The alert without the runbook is useless. The combination is actionable.

The Dashboard Generation Skill

Dashboards are the interface that lets a human understand a system at a glance. Dashboards that have everything on them are unusable. Dashboards that have only one metric are not informative enough. The right level of detail is hard to find.

The dashboard generation skill produces dashboards for each service following a consistent template. The template has four panels at the top showing the four golden signals: latency, traffic, errors, and saturation. Below those, there are panels for the specific behavior of the service.

The skill picks the specific panels based on what the service does. An API service gets per-endpoint latency and error breakdowns. A background worker gets queue depth and processing latency. A database client gets per-query duration and connection pool saturation. The specifics are different but the layout is consistent.

The skill also produces composite dashboards that show multiple services together. When a user-facing feature spans several services, the dashboard for that feature shows the relevant panels from each service on one page. The composite dashboards are what get used during incident response, when you need to see across the whole call chain at once.

The dashboards get committed to the repository as code rather than configured in the UI. The code is reviewable, versioned, and reproducible. When a dashboard changes, the change goes through the same review process as any other code change.

The Correlation Skill

The hardest part of debugging in production is correlating signals across the three pillars. The metric shows the error rate climbing. The logs show a stream of errors. The traces show specific failed requests. Connecting these requires a shared identifier that flows through all three.

The correlation skill ensures that every request gets a request ID that propagates everywhere. The ID is created at the edge of the system, included in the logs, attached to the trace, and used as a label on the relevant metrics. With the ID in place, you can pivot between the three pillars by querying for the same ID.

The skill also adds correlation for asynchronous flows. A background job triggered by a request gets the request ID propagated through the job queue. A retry of a failed operation gets the original request ID so the full retry chain can be traced. A user session ID gets attached to every request in the session so you can see the full user journey.

The correlation is what makes the observability data composable. Without it, each pillar is an island. With it, the pillars become a connected graph that you can navigate based on the question you are asking.

The Cost Control Skill

Observability data is expensive. Storage costs scale with the volume of data. Query costs scale with the cardinality of the labels. A naively instrumented service can produce so much data that the observability bill exceeds the compute bill.

The cost control skill keeps the observability spend in check. The skill watches the ingestion volume per service and alerts when a service starts producing significantly more data than its peers. The alert prompts a review of whether the additional data is valuable or whether it represents an instrumentation mistake.

The skill also implements log level controls per service. Production runs at INFO level by default, which captures the events that matter without the volume of DEBUG. When a service is being actively debugged, the level can be raised to DEBUG for a short window and then dropped back. The temporary verbose period gives you the data you need without paying for it all the time.

The skill manages retention as well. High-cardinality data like traces gets retained for a short window, maybe seven days, because the value of a trace drops quickly after the incident is resolved. Lower-cardinality data like metrics gets retained for a longer window, maybe a year, because long-term trend analysis is valuable. The retention policies match the value of the data.

The cost control skill turns observability from an open-ended expense into a managed one. The spend has a budget. The budget gets allocated across services based on their criticality. The skill makes sure the allocation is being respected.

How the Skills Compose

The skills compose into an observability practice. The instrumentation skill adds the data collection. The schema skill makes the data queryable. The trace sampling skill keeps the volume manageable. The alert generation skill turns the data into actionable signals. The dashboard generation skill turns the data into visual summaries. The correlation skill makes the data connectable. The cost control skill keeps the bill predictable.

A new service comes into the system with all of this from day one. The instrumentation skill runs on the initial codebase. The schema document is generated. The trace sampling is configured. The alerts are generated. The dashboards are created. The service ships with observability built in.

When the service evolves, the skills evolve with it. New endpoints get instrumented as they are added. New event types get added to the schema. New alerts and dashboards appear as the surface area grows. The observability stays current without anyone scheduling observability work.

What This Costs

Building the skills took a few weeks. The instrumentation skill was the largest single piece because it has to understand many different code patterns. The schema and dashboard generation skills came next. The alert generation skill required the most tuning because alert quality is hard to get right.

Once the skills are in place, the cost of adding observability to a new service is close to zero. The skill runs, the output is reviewed and merged, and the service is observable. Compared to the days or weeks of manual work this would have taken, the savings are enormous.

The bigger benefit is the consistency. Every service follows the same patterns. Every dashboard has the same shape. Every alert links to a runbook. When something breaks, the cognitive load of finding the right data is low because the data is always in the same place.

What the Skills Do Not Do

The skills do not pick your observability vendor. Whether you use the open source stack, a commercial platform, or something in between, the skills produce instrumentation that fits the OpenTelemetry standard. The downstream pipeline is yours to configure.

The skills also do not replace the human judgment in incident response. They give you the data, but the data does not interpret itself. When something is breaking, a human has to look at the dashboards, read the logs, and decide what to do. The skills make this easier but do not automate it.

The skills also do not write your service-level objectives. The objectives are a product and business decision. The skill takes the objectives as input and produces alerts and dashboards that measure against them, but the objectives themselves come from you.

Setting Up Your Own Stack

Start with structured logging. Get every service producing JSON logs with a consistent schema. This is the foundation that everything else builds on. Without it, the other pillars cannot connect to the logs.

Add request ID correlation next. Make sure every log line in a request flow carries the same ID. Once the IDs are in place, you can connect logs from different services that participated in the same request.

Add metrics third. Start with the four golden signals per service. Add custom metrics as you discover the need. Resist the temptation to add metrics for everything just because you can.

Add traces fourth. Traces are the most expensive pillar and the one with the highest setup cost, so it makes sense to add them after the cheaper pillars are working. The smart sampling skill keeps the cost manageable.

Add alerts and dashboards last. These depend on the data being clean and the schema being stable. Premature alerting produces noise. Premature dashboards become abandoned.

The Bigger Picture

Observability is the kind of work that pays off enormously but feels invisible when it is working. When the system is healthy, you do not think about your observability stack. When something breaks, the stack is either there to help you or it is not. The investment in observability is paid back in minutes saved during outages, but the savings compound across every outage for the life of the system.

The pattern in this workflow is the same pattern I keep using. The repetitive parts of observability work get automated. The judgment parts remain human. The result is a practice that scales without scaling headcount, and a production environment where outages get resolved in minutes instead of hours.

If you have services in production without proper observability, the answer is not to wait for a quieter quarter. Build the workflow. Add observability to one service. Use the workflow to add it to the next service. After a few services, the workflow is mature and adding observability to the rest is fast.

The first concrete step is structured logging. Every log line as JSON, every log line with a request ID, every service following the same schema. Once that is in place, the rest of the stack starts to make sense. Without it, every additional pillar is harder than it needs to be.

Pick one service. Add structured logging. Verify the logs are queryable. Then move to the next service. The compounding starts immediately.

FAQ

Which observability platform should I use? The skills produce OpenTelemetry-compatible output, which works with most platforms. Pick the platform based on your team's familiarity and your budget.

How much should I spend on observability? A reasonable starting point is between five and ten percent of the compute spend for the service. If you are spending much less, you probably do not have enough observability. If you are spending much more, you probably have too much.

What about security and privacy? Logs and traces can capture sensitive data. The skills include a redaction step that removes known sensitive fields before the data is shipped to the observability platform. Configure the redaction rules for your context.

What about local development? The skills produce the same instrumentation locally as in production. Local logs and traces go to a local collector. This way you can debug observability issues without needing to deploy.

What is the biggest mistake to avoid? Treating observability as something to add later. The data you cannot collect during the incident is data you cannot have. Build it in from the start.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for TypeScript Migrations: How I Converted a 200,000-Line JavaScript Codebase Without Stopping Shipping

Nex Tools — Mon, 11 May 2026 13:18:59 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

The first time I tried to migrate a large JavaScript codebase to TypeScript, I made the classic mistake. I planned a six-week migration project, kicked off with a team meeting, started at the top of the directory tree, and got about 4% through before the project stalled. Other work kept coming in. The migration sat in a long-running branch that diverged from main every day. Three months later, I gave up and merged the partial work back with a lot of any types and a lot of regrets.

The second time I tried, I had learned the lesson. Big-bang migrations do not work in production codebases that have to keep shipping. The migration has to be incremental, and it has to be done in a way that lets the rest of the team keep working without coordination overhead. The tools that exist for this kind of migration are good but require enormous amounts of human time to apply correctly across a large codebase.

This is where Claude Code changed the math for me. I built a migration workflow that turned what would have been a six-month project into a six-week project, and the codebase kept shipping the entire time. Today, the migration is done, the code is fully typed, and the team is faster than they were before. Here is how the workflow works.

Why Most TypeScript Migrations Fail

The reason most TypeScript migrations fail is that they are framed as a project. Projects have start dates and end dates and dedicated resources. Production codebases do not. They have a continuous stream of feature work that cannot stop, and they have a team that cannot pause for weeks to focus on a migration.

When the migration is a project, it competes with feature work for time. Feature work always wins because feature work has external pressure. The migration loses, falls behind schedule, and eventually gets canceled or quietly abandoned.

The migrations I have seen succeed are the ones framed as a continuous activity rather than a project. The work happens alongside feature work. Each commit takes a small bite out of the migration. The bites accumulate. After a few months, the migration is done without anyone having scheduled a migration sprint.

The migration succeeds when it becomes invisible. When the work is so cheap that every PR can include a slice of it without anyone noticing, the migration moves forward at the speed of the regular development cadence. The migration that demands focus is the migration that gets deprioritized.

The challenge is making the work cheap enough. TypeScript migration is not naturally cheap. Adding types to existing code requires understanding the code, the runtime behavior, the patterns of use, and the edge cases. Doing this well across thousands of files takes a long time. Doing this badly produces a codebase full of any types that buys none of the benefits of TypeScript.

The Claude Code workflow makes the work cheap by automating the parts that can be automated and focusing human attention on the parts that cannot.

The Foundation Skill

Before you can migrate any code, you have to set up the foundation. The foundation skill handles the project configuration that makes incremental migration possible.

The skill configures the TypeScript compiler to accept both .ts and .js files. It enables allowJs so existing JavaScript files keep working. It enables checkJs so JSDoc types in JavaScript files get checked. It sets strict mode for new TypeScript files but allows untyped JavaScript files to coexist.

The skill also sets up the build pipeline. The build needs to compile a mix of .ts and .js files. The test runner needs to handle both. The bundler needs to handle both. The CI needs to type-check the TypeScript files and lint everything together. Each of these has small configuration changes that the skill handles in a single pass.

The foundation skill produces a codebase where you can rename a .js file to a .ts file and it still works. That is the precondition for incremental migration. Without it, every renamed file becomes a blocker that breaks the build for everyone.

The Inventory Skill

Once the foundation is in place, the inventory skill maps out what needs to be migrated. The map is the basis for prioritization.

The skill produces an inventory of every JavaScript file in the codebase. Each entry includes the file path, the size in lines, the number of exports, the number of importers, the cyclomatic complexity, and a migration difficulty estimate. The difficulty estimate is based on signals like dynamic property access, runtime type checking, eval usage, and the presence of patterns that are hard to express in TypeScript.

The inventory also includes a dependency graph. For each file, the skill lists which files depend on it and which files it depends on. The graph is what drives the migration order. Files with no dependencies on other JavaScript files are leaves. Leaves can be migrated independently. Files with many JavaScript dependencies are roots. Roots have to wait until the dependencies are migrated.

The output is a prioritized list of migration targets. The top of the list is leaf files with low difficulty and high importance, ranked by impact per hour of work. The bottom of the list is complex roots that depend on many other things being migrated first.

The inventory becomes the migration plan. Instead of asking "what should I migrate next?" I look at the next entry on the list. The list itself is updated as files get migrated, so the next entry is always the right next entry.

The Conversion Skill

The conversion skill handles the actual migration of a single file. The skill takes a JavaScript file and produces a TypeScript file with types added.

The skill starts by reading the file and understanding its structure. It identifies all the exports, the function signatures, the class definitions, the constants, and the patterns of use. It then queries the importers of the file to see how the exports are actually used. The usage tells it what the types should be.

For a function that takes a string and returns a number, the skill can infer the types from the function body if the body is simple enough. For a function that takes an object with various properties, the skill looks at how callers construct the object and what properties they pass.

For exports that are used in multiple places with conflicting types, the skill produces a union type or a generic. The decision depends on the pattern. If the function is genuinely polymorphic across usages, the skill uses a generic. If the function has a few specific usage patterns, the skill uses a union.

The skill avoids any whenever possible. When the type is genuinely unknown, it uses unknown instead, which forces the caller to narrow the type before using the value. When the type is partially known, it uses the most specific type it can derive.

The output is a TypeScript file with types that match the actual usage. The file is not perfect. Edge cases that the skill could not figure out are flagged for review. But the bulk of the work is done.

The Validation Skill

After the conversion skill produces a TypeScript file, the validation skill checks the result. The check has three parts.

The first part is the compile check. The TypeScript compiler runs on the file and reports any type errors. The skill reads the errors and decides whether they are real or whether they reflect places where the inferred types were wrong. The skill can often fix the inferred types automatically if the error is clear.

The second part is the test check. The test suite runs to make sure the migrated file still behaves correctly. If a test fails, the skill correlates the failure with the migration. Most test failures after a migration are caused by overly strict types that rejected runtime patterns the original code allowed. The skill identifies these and proposes a fix.

The third part is the usage check. The skill looks at every importer of the migrated file and verifies that the new types work for them. If an importer was passing an argument that does not match the new type, the skill identifies the mismatch. The mismatch might be a bug in the importer, in which case it should be fixed. Or it might be a sign that the migrated type is too narrow, in which case the type needs to be widened.

The validation skill catches the cases where the migration would have broken something downstream. Without it, a migration that compiles locally can introduce errors that only surface when other files try to use the migrated module. Catching these at migration time is much faster than catching them later.

The Pattern Library

Most JavaScript codebases have repeated patterns. The same idiom for error handling. The same shape of options object. The same approach to async iteration. Once you have migrated one instance of a pattern, the rest of the instances can be migrated mechanically.

The pattern library skill identifies repeated patterns in the codebase and learns how to migrate them. The library starts empty. As migrations happen, the skill notices when a similar pattern appears and asks whether to apply the same migration approach. After a few applications, the pattern is captured and applied automatically.

The pattern library is what makes the migration accelerate over time. The first hundred files are slow because everything is novel. The next hundred files are faster because most of the patterns are already captured. The last few thousand files are fast because almost everything is a known pattern.

The library also handles the codebase-specific idioms. Every codebase has weird things. Custom hooks. Custom decorators. Custom inheritance patterns. The library captures these and applies them consistently across the migration, which means the migrated codebase has consistent type patterns instead of one-off solutions in every file.

The Coordination Skill

The migration happens alongside feature work. Feature work changes files. Migration changes files. When the migration touches a file someone else is also touching, there is potential for conflict.

The coordination skill prevents this. The skill watches the open pull requests across the team and reserves files that are being actively worked on. The migration never touches a file that has an open PR against it. When the PR merges, the file becomes available for migration. When the migration is in progress, the team is notified to avoid that file.

The coordination is what keeps the migration from creating friction for the team. Without it, the migration would be a source of constant merge conflicts. With it, the migration moves through the parts of the codebase that are quiet at any given moment, and the team rarely notices.

The skill also batches the migration into small PRs. Each PR migrates a handful of related files. The small PR size makes the migration changes easy to review and reduces the chance of conflicts. The team reviews the migration PRs the same way they review any other PR, just with the awareness that the changes are mostly mechanical.

The Strictness Ramp

TypeScript has many levels of strictness. Starting with full strict mode in a freshly migrated codebase is too much, because the migration produces types that are correct but not always the strictest possible. The strictness ramp skill increases strictness gradually as the migration matures.

The first level of strictness allows implicit any but checks everything else. The second level disallows implicit any but allows explicit any. The third level disallows any entirely and requires unknown instead. The fourth level enables exhaustive switch checking. The fifth level enables strict null checks. The sixth level enables strict function types. The seventh level is full strict mode.

The skill tracks where the codebase is on each level and surfaces opportunities to advance. When 90% of the files pass a stricter level, the skill suggests turning that level on globally and fixing the remaining 10%. The ramp lets the codebase get to full strict mode in stages instead of trying to satisfy all of strict mode at once.

The strictness ramp also includes per-file overrides. A file that is not yet at the target level has its level set explicitly, and the override is removed when the file reaches the target. This way the codebase can have a mix of strictness levels temporarily while everything converges.

The Regression Detection Skill

A successful migration is one that does not introduce regressions. The regression detection skill watches for cases where the migration changed runtime behavior accidentally.

The skill has three modes of detection. The first mode is type-driven, looking for places where the new types narrowed the behavior compared to what the JavaScript allowed. If the original code accepted a number or a string and the new type only accepts a number, the skill flags it for review.

The second mode is test-driven, watching for tests that started passing or failing after the migration. A test that started failing is an obvious regression. A test that started passing is sometimes a sign that the migration fixed a latent bug, but more often a sign that the test is checking something the migration changed.

The third mode is production-driven, watching for runtime errors that appear after deployment of the migration. The skill correlates production errors with the files that were migrated and surfaces likely regressions. This catches the cases where the type system allowed something the runtime did not, or where the migration introduced a subtle behavior change that only manifests in production.

The regression detection is what makes the migration safe. Without it, a migration that looks good in development can introduce production issues that take weeks to find. With it, regressions are caught quickly and rolled back before they accumulate.

How the Skills Compose

The skills compose into a migration cadence. Each day, the inventory skill identifies the next handful of files to migrate. The coordination skill confirms they are available. The conversion skill produces TypeScript versions. The validation skill checks the result. The regression detection skill watches for issues.

I review the produced PRs, approve them, and merge them. The team reviews them as part of their normal workflow. The pattern library skill captures any new patterns. The strictness ramp skill tracks progress and surfaces opportunities to advance.

The total time I spend on the migration is about 30 minutes per day. That is enough time to review and merge five to ten file migrations. The team spends almost no extra time, because the PRs are small and mechanical.

Over a few months, the migration completes. The codebase moves to TypeScript without anyone scheduling a migration sprint, without anyone feeling like the migration was disruptive, and without any production regressions caused by the work.

What This Costs

The skills took a few weeks to build, mostly because the conversion skill needed a lot of tuning to produce good types instead of any types. The pattern library skill needs a few months of usage to accumulate the patterns specific to your codebase.

The benefit is in the rate of migration. Before this workflow, a TypeScript migration on a 200,000-line codebase would have been a six-month project requiring dedicated headcount. With the workflow, it was a six-week migration that ran alongside normal feature work and required about an hour of my time per day.

The benefit also shows up in the quality of the migration. Codebases that get migrated in a hurry end up with any types scattered throughout, because the team did not have time to do the work properly. Codebases that get migrated with this workflow end up with proper types from the start, because the conversion skill defaults to specific types and only falls back when forced.

What the Skills Do Not Do

The skills do not replace architectural decisions. When the migration reveals a design that does not work in TypeScript, the skills tell you but do not redesign. Some patterns that work in JavaScript do not have clean TypeScript equivalents and require code restructuring. The restructuring is yours to do.

The skills also do not write tests. They check that existing tests still pass and surface places where new tests would be valuable, but they do not write the tests themselves. The test writing is yours.

The skills also do not enforce style decisions. Whether to use interfaces or type aliases, whether to prefer const assertions or explicit types, whether to use enums or string unions, these are style choices the skills are agnostic about. You configure them based on your team's preferences.

Setting Up Your Own Workflow

Start with the foundation skill. Without the foundation, nothing else works. Get the project to a state where renaming a .js file to .ts does not break the build. This is the minimum viable starting point.

Add the inventory skill next. You need to know what you have before you can plan the migration. The inventory tells you whether the migration will take a week or a year.

Add the conversion skill after that. Migrate ten files manually first to see what good output looks like, then build the conversion skill to produce similar output. The first version of the conversion skill will be rough. Tune it on real files until the output is consistently good.

The validation, pattern library, coordination, strictness ramp, and regression detection skills can come later. They are valuable additions but the migration can start without them. Build them as you discover the need.

The Bigger Picture

The pattern in this migration workflow is the pattern I keep seeing in every successful application of Claude Code to a large engineering problem. The work has repetitive parts and judgment parts. The repetitive parts can be automated. The judgment parts cannot. The automation is what makes the work tractable. Without the automation, the work is too expensive to do well. With the automation, the work becomes routine.

TypeScript migration is a particularly good example because the scale is so visible. A 200,000-line codebase is intimidating. Most of the lines do not require any human judgment to migrate, but the few that do require careful thought. Automating the routine 95% lets the human focus on the 5% that needs them.

If you have a JavaScript codebase that you have been meaning to migrate but have been putting off because the project feels too large, the answer is probably not to wait for a quieter quarter. The answer is to build a workflow that makes the migration cheap enough to run continuously. The migration completes eventually, without disrupting anything else, and the codebase ends up in a better place than it would have if you had tried to do the migration as a project.

If you have been reading along, the first concrete step is to configure your build to accept both .ts and .js files. Once that is working, every subsequent step gets easier. The migration becomes a series of small commits instead of a giant lift. The compounding effect of small commits is how big migrations actually get done.

Build the foundation. Run the inventory. Start migrating leaves. The rest follows.

FAQ

How long did the migration actually take? Six weeks of calendar time, about an hour per day of my time, plus normal review time from the team for the migration PRs.

What language version did you migrate to? The most recent TypeScript at the time. The skills do not care which TypeScript version. They produce types compatible with the target version.

What about React components? React components migrate well because the types are mostly mechanical. The skills handle JSX correctly and produce typed props and state.

What about node_modules dependencies that lack types? The skills produce ambient declaration files for dependencies without types. Most popular dependencies have types available on DefinitelyTyped.

What is the biggest mistake to avoid? Trying to migrate the most complex parts first. Leaves before roots. Easy before hard. The accumulation of small wins gives you the momentum to tackle the hard parts.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Dependency Management: How I Stopped Being Afraid of npm Update

Nex Tools — Mon, 11 May 2026 13:13:18 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

Every developer I know has a story about dependency hell. Mine was a Friday afternoon in 2024 when I ran npm update on a project I had inherited, and the entire test suite turned red. Not a few tests. Every single test. The diff was 400 packages changed across the lockfile, and I had no idea which of those changes had broken what. I spent the rest of the weekend bisecting the upgrade manually, package by package, until I found the breaking change buried four levels deep in a transitive dependency of a transitive dependency.

That experience changed how I think about dependency management. The default workflow most teams use is to ignore dependencies until something forces an upgrade, and then panic. The panic upgrade is when security patches pile up and someone finally runs npm audit fix --force at 11 PM the night before an audit. The panic upgrade is also when most production incidents happen, because the gap between the version that worked and the version you are jumping to is measured in months and breaks accumulate silently.

I built a Claude Code workflow that turned dependency management from a periodic crisis into a routine background activity. The workflow is not glamorous. It does not involve any clever AI tricks. What it does is make the work of staying current on dependencies cheap enough that I actually do it, instead of letting it pile up until it explodes.

Here is how the workflow works and why it has saved me hundreds of hours.

Why Dependency Management Goes Wrong

The reason dependency management is hard is not because individual upgrades are hard. Most upgrades are easy. The reason it is hard is because the work is distributed across so many small decisions that no human can keep them all in their head, and the cost of getting any one of them wrong is non-zero.

Every dependency in your project has a release cycle. Most have patch releases monthly, minor releases quarterly, major releases yearly. If you have 50 direct dependencies and 500 transitive dependencies, you are looking at thousands of version changes per year flowing into your project from the outside. Each one of them is a potential surprise.

The way most teams handle this is to ignore the firehose and react to specific events. Security alerts force upgrades. A new feature in a library forces an upgrade. A bug that blocks shipping forces an upgrade. Between those events, dependencies drift further and further out of date, and the cost of catching up grows.

Dependency management is not a project. It is a habit. The workflow that makes the habit cheap is the workflow that gets followed. The workflow that demands a half-day of focus is the workflow that gets skipped.

I needed a workflow that was cheap. Cheap enough that I would actually run it every week. Cheap enough that I would not skip it when I was busy. The Claude Code skills I built are the result of optimizing for cost-to-run, not cost-to-build.

The Audit Skill

The first skill in the workflow is an audit skill. It runs every Monday morning and produces a report on the current state of dependencies across all my projects.

The report has four sections. The first section lists outdated dependencies, sorted by how far behind they are. A package three patch versions behind is a low priority. A package five major versions behind is a flashing red light. The skill annotates each entry with the release date of the current version and the release date of the latest version, so the gap is obvious at a glance.

The second section lists security advisories. The skill queries the security advisory database for every dependency and surfaces anything with a known vulnerability. The advisories include the severity, the affected version range, and the patched version. I see exactly what I need to upgrade and how urgent it is.

The third section lists deprecation warnings. Many packages get deprecated silently. The package still works, but the maintainer has marked it as no longer supported. The audit skill catches these before they become problems.

The fourth section lists dependencies with significant changes. Significant means breaking changes have been released, or the maintainer has been replaced, or the package has been transferred to a new owner. These are the changes that often get missed because they do not show up as version bumps.

The audit skill takes 90 seconds to run across all my projects. It produces a one-page markdown report that I can read in two minutes. The report is what drives the rest of the week's dependency work.

The Categorization Skill

Not all upgrades are equal. The categorization skill takes the audit output and assigns each entry to a category that determines how it gets handled.

The first category is critical. Critical means a security vulnerability with a high severity score, an active exploit in the wild, or a package my code depends on at runtime for something user-facing. Critical upgrades happen the day they are identified, regardless of what else is on the schedule.

The second category is high. High means a security vulnerability with medium severity, a deprecated package that needs replacement before it stops working, or a major version of a key dependency that will be needed for an upcoming feature. High upgrades happen within the week.

The third category is medium. Medium means a major version bump of a non-critical dependency, a deprecation warning that does not have an immediate impact, or accumulated minor version updates of dependencies I want to keep current. Medium upgrades happen monthly.

The fourth category is low. Low means patch versions that have not introduced any changes I care about. Low upgrades happen quarterly, batched together so the upgrade work is amortized.

The categorization is what makes the workflow tractable. Instead of treating every dependency as needing immediate attention, I have a triage system that focuses my time where it matters. The skill does the categorization based on rules I tuned over a few months. The rules are not fancy. They look at vulnerability severity, version distance, package criticality, and a few signals about the package itself.

The Upgrade Skill

The upgrade skill is where the work happens. For each upgrade I need to perform, the skill produces an upgrade plan. The plan includes the specific commands to run, the changes that will be applied, the tests that need to pass, and the rollback procedure if anything goes wrong.

The most useful part of the upgrade plan is the changelog summary. The skill reads the release notes for every version between my current version and the target version, summarizes the breaking changes, and flags anything that might affect my code. If I am jumping from version 3.2 to version 4.5, the summary tells me what changed in 3.3, 3.4, 3.5, 4.0, 4.1, 4.2, 4.3, 4.4, and 4.5. The major version is highlighted because that is where breaking changes live.

The summary is not just a copy of the release notes. The skill reads my code, identifies how I use the package, and tells me which of the changes are likely to affect me. If the changelog says a function I do not use was removed, the summary deprioritizes that. If the changelog says a function I use heavily had its signature changed, the summary flags it prominently.

The flagging is the difference between a 30-minute upgrade and a 3-hour upgrade. Without the flagging, I would have to read every release note and check every change against my code by hand. With the flagging, I read a short summary and know exactly where to focus.

The Test Skill

After every upgrade, the test skill runs. Running the test suite is obvious. What the test skill adds is the intelligence about what to do when something fails.

When a test fails after an upgrade, the test skill correlates the failure with the upgrade. It looks at the test that broke, compares it to the changes in the upgraded package, and tells me whether the failure is likely caused by the upgrade or whether it is unrelated. Most of the time it is the upgrade. Sometimes the test was already flaky and the upgrade just happened to be the moment it failed. Knowing which is which saves me from a goose chase.

When the failure is caused by the upgrade, the test skill produces a hypothesis about what changed. The hypothesis is based on the changelog summary and the actual error. If the changelog says a function signature changed and the test fails with a type error on that function, the hypothesis is clear. If the changelog says a default behavior changed and the test fails with an assertion that depends on the default, the hypothesis is also clear.

The hypothesis is not always right. When it is wrong, I have to debug manually. But when it is right, the upgrade fix is a one-line change instead of an hour of digging.

The Rollback Skill

Some upgrades fail. Either the tests fail in ways I cannot quickly fix, or the upgrade introduces runtime behavior that breaks something not covered by tests. When that happens, I need to roll back fast.

The rollback skill maintains a snapshot of every upgrade. The snapshot includes the previous lockfile, the previous package versions, and the state of any related configuration. Rolling back is a single command that restores the snapshot. Total time to roll back is under 30 seconds.

The rollback is not the end. The rollback skill also produces an analysis of why the upgrade failed and what would need to be true for the upgrade to succeed. Sometimes the answer is a small code change. Sometimes the answer is to wait for a patch release that fixes the issue. Sometimes the answer is to switch to a different package because the current path is no longer viable.

The analysis is what prevents the rollback from being a permanent retreat. Without the analysis, a failed upgrade often turns into a permanent skip. The dependency stays at the old version forever, and the gap grows. With the analysis, I have a concrete plan for when and how to try again.

The Cross-Project Skill

Most of my projects share some dependencies. When a critical update lands on a shared dependency, I need to apply it across multiple projects. The cross-project skill handles this.

The skill identifies all projects that depend on a given package, plans the order of upgrades based on which projects are most critical, and executes the upgrades in parallel where it can. The output is a single report that tells me the status of the upgrade across all projects.

The cross-project view also helps me identify which packages are good candidates for centralization. If five of my projects depend on the same internal utility package, I know I should be tracking that package carefully and consider whether the utility should live in a shared library instead.

The cross-project skill catches the case where a dependency has different versions in different projects. Version drift across projects is a subtle problem. The same bug behaves differently in different projects because they are using different versions of a shared library. The skill flags drift and proposes a unification plan.

The Transitive Dependency Skill

Direct dependencies are visible. Transitive dependencies are not. Most of the packages in your node_modules are not packages you chose. They are packages your packages chose, recursively. When something goes wrong with a transitive dependency, the path from cause to effect is long.

The transitive dependency skill maps out the dependency tree and identifies hotspots. A hotspot is a transitive dependency that many of your direct dependencies depend on, which means a problem with that transitive dependency affects many things at once. The skill ranks the hotspots and tracks them like first-class dependencies, even though I never directly added them.

The skill also identifies transitive dependencies that have known issues. If a transitive dependency has a security advisory, the skill traces it back to the direct dependencies that pulled it in. I get a clear picture of what I would need to change at the direct level to fix the issue at the transitive level.

This skill is the one that prevented my next dependency hell. I have caught two security issues in transitive dependencies that I would not have noticed otherwise. Both were patched within hours of detection because I knew exactly which direct dependency to upgrade.

The Lockfile Hygiene Skill

Lockfiles are easy to get wrong. They commit the wrong way, they get out of sync with the package manifest, and they introduce changes that are not actually changes you made. The lockfile hygiene skill keeps the lockfile sane.

The skill detects unexpected lockfile changes. If a commit changes the lockfile without changing the package manifest, the skill flags it for review. Most of the time the change is legitimate, but sometimes it is a sign that someone ran the package manager in a way that updated something they did not mean to update.

The skill also detects diverged lockfiles. When two branches each modify the lockfile, the merge can resolve in ways that lose updates. The skill catches this by comparing the resolved lockfile to what it should be and flagging discrepancies.

The hygiene skill is the least exciting part of the workflow, but it is the part that prevents the silent bugs. Lockfile drift is one of those problems that produces incidents months later when nobody can figure out why the same build produces different results.

How the Skills Compose

The skills compose into a weekly rhythm. Monday morning, the audit skill runs and produces the report. I spend 10 minutes reading the report and deciding which upgrades to do this week. The categorization skill has already prioritized them, so the decision is mostly which medium-priority items to include alongside the critical and high.

Throughout the week, the upgrade skill produces plans for each upgrade. I review the plan, run the upgrade, and watch the test skill validate the result. If the tests pass, I commit. If they fail, the test skill diagnoses, and I either fix or roll back. The rollback skill makes rollback safe.

The cross-project skill kicks in for shared dependencies. The transitive dependency skill kicks in when something interesting shows up in the dependency tree. The lockfile hygiene skill runs continuously in the background.

The total time I spend on dependency management is about 90 minutes per week, spread across the week. Before this workflow, dependency management was a quarterly all-hands fire drill that consumed two days and produced incidents in the following week. Now it is a routine activity that produces no surprises.

What This Costs

The skills took about a week to build. Most of the time was spent tuning the categorization rules and the changelog summary heuristics. The skills do not require any special infrastructure. They run against the same package manager output that any developer already has.

The benefit is in the rhythm. Once you have a workflow that costs 90 minutes per week, dependencies stop being a thing you are afraid of. You upgrade things as they become available. You catch problems when they are small. You never end up six months behind on a critical dependency because the upgrade work is too daunting to start.

The benefit also shows up in production. The number of production incidents I trace back to a dependency upgrade has dropped to roughly zero. The upgrades I do are small and safe, because they are spread out and tested individually. The upgrades I used to do were large and risky, because they bundled months of changes into a single chaotic push.

What the Skills Do Not Do

The skills do not replace judgment. They produce reports, plans, and hypotheses. I am still the one who decides what to upgrade, when, and how. The skills make the decisions faster and better informed, but the decisions are still mine.

The skills also do not handle every edge case. When a dependency has been abandoned and needs replacement, the skill tells me but does not pick the replacement. When a major upgrade requires architectural changes to my code, the skill identifies the changes but does not write them. The hard parts are still hard.

What the skills do is make the easy parts trivial. The cumulative effect of trivializing the easy parts is that I have time and energy for the hard parts when they come up.

Setting Up Your Own Workflow

Start with the audit skill. It is the cheapest to build and produces the most value per hour of effort. You will get a weekly report that tells you the state of your dependencies. That alone changes how you think about them.

Add the upgrade skill next. The upgrade plans cut the time for individual upgrades by half. You will feel the difference within a week.

Add the test skill after that. The diagnosis when something breaks is where you save the most time per incident. Without it, a failed upgrade can eat hours. With it, most failures are resolved in minutes.

Build the rollback skill once you have done a few upgrades. You need the snapshots in place before you need to roll back, because trying to capture state in a panic is not reliable.

The other skills are useful but optional. The cross-project skill matters if you have multiple projects. The transitive dependency skill matters if you have a deep tree. The lockfile hygiene skill matters if you have multiple committers.

The Bigger Picture

The pattern in this workflow is the same as in every other Claude Code workflow that has worked for me. Repetitive work gets automated. Judgment-heavy work stays with the human. The automation makes the repetitive work cheap enough that it actually happens, instead of being skipped and accumulating into a crisis.

Dependency management is the canonical example. The work is repetitive. There is a lot of it. Each individual piece is small. The accumulated weight is what breaks teams. Automating the repetitive parts and triaging by judgment is the right shape of the solution.

If you have a project that has not had its dependencies looked at in six months or more, you have technical debt that is compounding silently. The way to stop the bleeding is to build a workflow that makes the maintenance cheap. The way to make it cheap is to automate the boring parts so you can focus the human time on the parts that need a human.

If you have been reading along and recognizing your own situation, the first step is to run an audit on one project. Pick the project with the most direct dependencies. See what the audit tells you. Once you see the report, you will know whether you have a manageable situation or a five-alarm fire. Either way, you are better off knowing than not knowing.

Build the audit skill. Run it weekly. Decide what to do based on the report. The rest of the workflow grows from there.

FAQ

How long does it take to build the audit skill? A few hours for a basic version. A day if you want it polished. The polished version pays for itself in the first week.

Does this work for languages other than JavaScript? Yes. The patterns translate to any ecosystem with a package manager. The audit query is different for Python or Rust or Go, but the workflow is the same.

What about monorepos? Monorepos make the cross-project skill more important and the audit skill more interesting because the report has to handle multiple packages. The basic structure is the same.

How do I get my team to adopt this? Run the audit yourself for a few weeks. Bring the reports to standups. The team will see the value when they see the reports identify real issues before they become incidents.

What is the biggest mistake to avoid? Trying to upgrade everything at once when you start. Build the workflow first. Use it to triage. Upgrade in order of priority. Resist the urge to do a giant catch-up upgrade.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Incident Response: How I Cut My Mean Time to Recovery in Half

Nex Tools — Sun, 10 May 2026 10:49:40 +0000

It is 3 AM. PagerDuty is screaming. Production is down. You are half-awake, half-dressed, and trying to figure out which of the 47 dashboards in your monitoring system is showing the actual problem versus a downstream symptom of the actual problem. Your team is asking what they can do to help. Customers are tweeting. The status page is still green because nobody has had time to update it.

If you have been on call for any length of time, you have lived this scene. The first 15 minutes of an incident are chaos, not because the people responding are incompetent, but because the cognitive load of an incident is much higher than the cognitive load of normal work, and humans degrade under that load in predictable ways.

I started using Claude Code during incidents because I noticed that the same patterns repeat every time. Run these queries. Check these logs. Look at these dashboards. Update the status page. Notify the right stakeholders. The patterns are predictable enough that they could be partially automated. So I automated them. The result is that my mean time to recovery has dropped from a median of 38 minutes to a median of 17 minutes, and the incidents themselves feel less like trauma and more like a process.

Here is the workflow.

What Goes Wrong in the First 15 Minutes

The first 15 minutes of an incident is where most of the damage happens, and it is also where most of the recovery time gets wasted. The recovery time wasted is not because the responder does not know what to do. It is because the responder is operating at 30% of their normal cognitive capacity and has to do everything from scratch.

The things a responder needs to do in the first 15 minutes are mostly the same across incidents. They need to confirm the incident is real. They need to identify the affected service or services. They need to find the most likely cause. They need to notify stakeholders. They need to update the status page. They need to start a timeline. They need to coordinate with anyone else who has been paged. They need to begin investigation while keeping the rest of the team informed.

In a calm moment, this list is manageable. At 3 AM with PagerDuty screaming and adrenaline running, this list is overwhelming. The responder ends up doing some of these things and forgetting others, and the incident drags on while small mistakes compound.

The first 15 minutes of an incident is the highest-leverage time you have, and it is also the time when you are least equipped to use it well. Anything you can pre-load into automation is time you do not have to spend thinking when thinking is hardest.

Claude Code is a way to pre-load that automation. The patterns that repeat every incident can be encoded as skills. The skills run when an incident is declared, gather the information that is always needed, and present it in a format the responder can read in 30 seconds. That changes the first 15 minutes from chaos to a checklist.

The Triage Skill

The first skill I wrote is a triage skill. It runs when I declare an incident and gathers the information I always need to confirm whether the incident is real and what is affected.

The skill checks the status of every critical service by running its health check, queries the error rate for each service for the last 15 minutes and compares it to the rolling baseline, looks at the latency percentiles for each service and flags anything that has degraded, queries the deployment log for any deploys in the last hour, and checks for any infrastructure events from the cloud provider that might be relevant.

The output is a one-page summary that tells me which services are degraded, by how much, and what changed recently that might explain the degradation. The summary takes about 90 seconds to generate. It replaces about 10 minutes of manual dashboard navigation that I would otherwise do half-asleep.

The triage skill has caught two incidents that I would have misdiagnosed without it. In one case, the alerting service paged me about a database problem. The triage skill showed that the database was fine and the actual issue was a load balancer misconfiguration that was causing the alert. In another case, the page was about a single endpoint, but the triage skill showed that the underlying issue was affecting three other services that had not paged yet. Knowing this earlier let me get ahead of the cascading failures.

The Deploy Correlation Skill

The second skill correlates the incident with recent deploys. About 60% of production incidents are caused by a recent deploy, but identifying which deploy and which change is harder than it sounds, especially in environments where multiple services deploy independently.

The deploy correlation skill queries the deployment log for the last 24 hours across all services, identifies which deploys overlap with the incident timeline, retrieves the changes included in each candidate deploy, and ranks the candidates by how likely each change is to be related to the symptoms.

The ranking uses heuristics like whether the change touches the affected service, whether it changes any code paths in the failing endpoints, whether it modifies dependencies or configuration, and whether the deploy completed shortly before the incident started. The ranking is not always right, but it is right often enough to give me a strong starting point for investigation.

When the deploy correlation skill identifies a likely culprit, the next question is whether to roll back. Rolling back is a high-stakes decision because it can introduce new problems and it costs time. The skill produces a rollback plan with the specific commands to run, the expected downtime, and the rollback risk assessment. I make the call, but I make it with all the relevant information in front of me, not in my head.

The Communication Skill

The third skill handles communication. Communication during an incident is critical and almost always done badly. Stakeholders need to know what is happening. Customers need to know what is happening. The status page needs to reflect reality. Internal channels need updates. The on-call engineer needs to coordinate with anyone else who is involved.

The communication skill drafts the messages. It produces a status page update appropriate for customers, a Slack message for internal channels, an email for the executive notification list if the severity warrants it, and a customer support brief for the support team to use when responding to inquiries.

Each message is drafted from a template and filled in with the specific incident details. The templates are tuned to communicate the right amount of information for each audience. Customers get plain language about what is affected and when we expect it to be resolved. Internal channels get more detail, including what has been ruled out and what is being investigated. Executives get a brief that matches the format they expect.

The skill produces drafts. I review and send. The review takes 30 seconds per message, compared to several minutes of writing each message from scratch while my brain is still booting up.

The Timeline Skill

The fourth skill maintains the timeline. Every incident needs a timeline that captures what happened, when, and what was done about it. The timeline is what feeds the post-mortem, and a post-mortem with a sparse timeline is a post-mortem that misses lessons.

Capturing the timeline in real time is hard. The responder is busy responding. They make notes in Slack or in their head and intend to write up the timeline later, except later they have forgotten the details and the timeline ends up incomplete or wrong.

The timeline skill captures events automatically. It watches the incident channel and pulls out timestamped events. It watches the alert system and captures every alert fire and resolution. It watches the deploy log and captures every deploy and rollback. It produces a structured timeline that I can edit and annotate during the incident or after.

The result is a timeline that is comprehensive without me having to do the bookkeeping. When I sit down to write the post-mortem the next day, the timeline is already there. I just need to add the narrative.

The Hypothesis Skill

The fifth skill is the one that does the most work during an incident. It is a hypothesis skill that takes the symptoms, the recent changes, and the system architecture and proposes hypotheses about what might be wrong.

The skill reads the symptom description, looks at the recent changes from the deploy correlation skill, queries the relevant logs and metrics, and produces a ranked list of hypotheses. Each hypothesis includes what it would predict about the symptoms, what evidence would confirm or refute it, and the next investigation step.

The hypothesis skill is the part of the workflow that feels most like working with a senior engineer who happens to have read every line of the codebase recently. It is not always right. The hypotheses are sometimes wrong, and the ranking is sometimes off. But it produces useful starting points faster than I can think of them on my own, and during an incident the time savings is the entire point.

The skill handles the cognitive load that I cannot reliably handle at 3 AM. It generates the hypotheses I should be considering. It identifies the evidence I should be looking for. It tells me which dashboard would confirm or refute each hypothesis. I do the actual investigation, but the framing is provided.

The Coordination Skill

The sixth skill handles coordination when multiple people are involved. Big incidents pull in multiple responders. Each responder needs to know what the others are doing. Without coordination, two people end up investigating the same thing while a third thing goes uninvestigated.

The coordination skill maintains a live document that lists who is on the incident, what each person is investigating, what has been ruled out, and what is still open. The document updates from the incident channel automatically. The responders can see at a glance who is doing what.

The skill also enforces handoff protocol. When the primary responder needs to step away, the skill produces a handoff document that captures everything the next responder needs to know to take over. The handoff document includes the current hypotheses, the evidence collected, the actions taken, and the open questions. The handoff that used to take 10 minutes of conversation now takes 2 minutes of reading.

The Post-Mortem Skill

The seventh skill writes the post-mortem. The post-mortem is the deliverable that comes out of the incident, and writing it is most teams' weakest link. Post-mortems are tedious to write, they are painful to read, and they often skip the lessons that would actually prevent the next incident.

The post-mortem skill produces a draft. It uses the timeline from the timeline skill, the hypotheses from the hypothesis skill, the actions taken from the coordination skill, and the resolution from the responders. It structures the draft using the post-mortem template the team has agreed on, with sections for what happened, what went well, what went badly, what we are going to change, and what we are not going to change.

The draft is rarely the final post-mortem. It is missing the deeper analysis that requires actual reflection on what went wrong and why. But it captures all the facts, the timeline, and the obvious lessons, so the work I have to do is the reflection rather than the bookkeeping. The post-mortem that used to take three hours now takes one hour, and the one hour is the hour where the actual learning happens.

How the Skills Compose

The skills are designed to compose during an incident. When I declare an incident, the triage skill runs immediately and gives me the lay of the land. The deploy correlation skill runs in parallel and identifies likely culprits. The communication skill produces draft messages while I am reviewing the triage output. The timeline skill starts capturing events.

As the incident progresses, the hypothesis skill generates investigation directions. The coordination skill tracks who is doing what. After resolution, the post-mortem skill drafts the writeup.

The composition is what matters. Any single skill helps a little. All the skills together transform the experience of being on call. The cognitive load drops. The mistakes drop. The mean time to recovery drops. The job becomes sustainable rather than corrosive.

What This Costs

I built the skills over about two weeks of evenings, mostly while my brain was still warm from a recent on-call rotation that had reminded me how miserable incident response can be. The initial versions were rough. I have tuned them based on what worked and what did not over the last several incidents.

The maintenance cost is low. The skills change when the system changes, but most of the patterns are stable. New runbooks get added when new failure modes are encountered. The cost of maintenance is much lower than the cost of working without the skills.

The benefit is real. My mean time to recovery has dropped by about half. The communication during incidents is consistently better. The post-mortems are more thorough because the timeline is captured automatically. On-call no longer feels like the worst week of the rotation. It feels like work that is hard but doable.

What the Skills Do Not Replace

I want to be clear about the limits. The skills do not replace the judgment of a competent on-call engineer. They produce drafts, hypotheses, and summaries. The engineer decides what to do with them.

When the skills are wrong, the engineer needs to recognize that and override them. When the situation is novel, the skills will not have a useful pattern to apply, and the engineer has to fall back on first principles. When the impact assessment is wrong, the engineer has to correct it.

The skills make the routine parts of incident response faster. They do not make the hard parts easier. The hard parts are still hard, and they still require human judgment. What the skills do is free up the cognitive bandwidth that would otherwise be spent on routine work, so the human can apply judgment where it matters.

Setting Up Your Own Workflow

If you want to build something similar for your team, the place to start is to look at the last five incidents and identify the patterns. What did the responder do every time? What information did they need to gather? What messages did they need to send? Those are the patterns that can be encoded.

Pick the one that takes the most time and automate it first. The triage skill is usually a good starting point because it is the highest leverage. Once that is working, add the next one. Build the workflow incrementally. Do not try to build everything at once, because you will not know which parts you actually need until you have used the early skills in a real incident.

The most important property of the workflow is that it actually runs during incidents. A skill that exists but does not run during the chaos of an actual incident is worthless. The way to make the skills run is to integrate them into the incident response runbook so that running them is the first step rather than an optional step. When PagerDuty fires, the responder runs the triage skill before doing anything else. That is the muscle memory you want to build.

The Bigger Picture

There is a pattern that runs through this whole approach, and it is the same pattern that runs through every other workflow I have built with Claude Code. The pattern is that high-stakes work tends to have repeatable parts and judgment-heavy parts. The repeatable parts can be automated. The judgment-heavy parts cannot. Most of the value of automation comes from removing the cognitive cost of the repeatable parts so that the human can focus on the judgment.

Incident response is high-stakes work with a lot of repeatable parts. The triage, the communication, the timeline, the post-mortem are all repeatable. The hypothesis generation has a repeatable scaffold. The coordination has a repeatable protocol. Automating these parts is not a replacement for the engineer. It is a way to make the engineer more effective when it matters most.

If you are on call for a system that you care about, the cost of building this workflow is much smaller than the cost of one bad incident. The math is overwhelming. The only thing stopping you is the time it takes to start, and the way to deal with that is to start with one skill and grow from there.

If you have read this far, you are probably someone who has been on the receiving end of a bad incident response and is looking for a way to do it better. The way is to stop trying to handle the chaos with raw human cognition and start offloading the mechanical parts to automation. Claude Code is one tool for doing this. There are others. The point is that the workflow is the answer, not the tool.

Build the workflow. Run the workflow. Improve the workflow. The next 3 AM page will go better than the last one.

FAQ

How long does it take to build this workflow? The initial set of skills takes about two weeks of evenings. You can get started with just the triage skill in a day.

Does this work for small teams? Yes. Small teams benefit even more, because they cannot afford the time cost of bad incident response.

What about incidents that are not in the patterns? The skills handle the routine 80%. The novel 20% still requires human judgment. The skills free up cognitive bandwidth so the human can focus on the novel parts.

How do I get my team to actually use this? Make running the skills the first step in the incident runbook. Update the runbook so that the very first thing the responder does is invoke the triage skill. Build the muscle memory.

What is the biggest mistake to avoid? Trying to automate the judgment parts. The skills should produce drafts, hypotheses, and summaries. The engineer decides what to do with them. Do not build skills that try to make decisions on behalf of the responder.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Security Audits: How I Catch Vulnerabilities Before They Cost Me

Nex Tools — Sun, 10 May 2026 10:43:50 +0000

Three years ago a junior engineer on a team I was advising committed an environment file to a public GitHub repository. The file contained an AWS access key with admin permissions on a production account. The key was harvested by an automated scanner within four minutes of the commit. By the time the team noticed, an attacker had spun up 200 EC2 instances mining cryptocurrency. The bill for those four hours was $14,000.

The team had a security checklist. The checklist included a line that said "do not commit secrets to git." The line had been on the checklist for two years. It had been read by every engineer on the team. None of that mattered, because security checklists do not run themselves, and the moment of committing a file is exactly the moment when nobody has the bandwidth to consult a checklist.

I started using Claude Code for security audits because I wanted the checklist to run itself. Not as a replacement for human review, but as the first pass that catches the obvious mistakes before they reach a human reviewer or, worse, production. Here is the workflow that has caught real vulnerabilities in real codebases.

Why Security Audits Get Skipped

Most teams have security checklists. Most teams do not run them consistently. The reason is not that engineers do not care about security. The reason is that security audits feel like a tax that gets paid out of the same time budget as shipping features, and the visible reward for shipping a feature is much higher than the visible reward for catching a vulnerability that would not have been exploited for another six months.

This math is wrong, but it feels right in the moment. The cost of a missed vulnerability is theoretical and deferred. The cost of pausing to audit is concrete and immediate. So the audit gets skipped, and the vulnerability accumulates, and six months later somebody pays the deferred cost in cash and reputation.

The second reason security audits get skipped is that they are tedious. A real audit means reading every line of new code with a paranoid mindset. It means thinking about what an attacker could do with each input, each query, each file path. It means imagining failure modes that have not happened yet. This is exhausting work, and humans are bad at sustaining it for long stretches.

A security audit is the highest-leverage hour you can spend on a codebase, and it is also the hour engineers are least motivated to spend, because the work is invisible when it succeeds and only visible when it fails.

Claude Code does not get tired. Claude Code does not get bored. Claude Code can read every line of a diff with the same paranoid mindset on the hundredth file as on the first. That is exactly the kind of work where automation pays off.

The Pre-Commit Audit Skill

The first skill I built is a pre-commit audit. It runs on the staged diff before I commit and flags anything that looks like a security risk. The skill has a list of patterns it looks for and a list of file types it pays extra attention to.

The patterns it looks for include hardcoded credentials of any kind, calls to dangerous functions like eval and exec with user input, SQL queries built by string concatenation, file paths constructed from user input without validation, deserialization of untrusted data, and authentication checks that are missing, bypassable, or applied inconsistently.

The file types it pays extra attention to include environment files, configuration files, anything that looks like it might contain credentials, anything in an authentication or authorization module, and anything that handles user uploads.

When the skill flags something, it explains what the risk is and what the fix would look like. It does not block the commit. It just tells me what it found, and I decide whether to address the issue or proceed. Most of the time the issue is real and worth fixing. Sometimes the issue is a false positive, and I commit anyway. The skill is calibrated to err on the side of flagging too much rather than too little, because a false positive costs me 30 seconds and a missed vulnerability could cost me $14,000.

The skill caught a hardcoded API key in a test file last month. The test file was meant to use a mocked credential, but somebody had pasted a real key into the test while debugging and forgotten to remove it. The commit would have gone to a public repository. The skill flagged it before I pushed, and I cleaned it up.

The Dependency Audit Skill

The second skill audits dependencies. Modern applications include hundreds or thousands of transitive dependencies, and any of them could be compromised. The dependency audit skill cross-references my package manifest against published vulnerability databases and flags packages with known issues.

This is not a novel idea. Tools like npm audit and pip-audit do something similar. What the Claude Code version adds is context. When npm audit tells me there is a high-severity vulnerability in a transitive dependency, I have to figure out whether the vulnerable code path is actually reachable from my code, whether the fix requires a major version bump that will break things, and whether the risk is actually material to my application or just theoretical.

Claude Code reads the vulnerability description, looks at how the dependency is used in my code, and gives me an honest assessment. Sometimes the answer is "this is a real risk, fix it now." Sometimes the answer is "this vulnerability requires the attacker to control the input to a function you do not call, so it is not exploitable in your application." Sometimes the answer is "this vulnerability is real and exploitable, but the fix requires upgrading three other packages first, so you should plan a separate sprint."

The contextual assessment is the part that matters. A list of vulnerabilities is overwhelming. A prioritized list of vulnerabilities with reasoning attached is actionable.

The Authentication Flow Skill

The third skill audits authentication and authorization flows. This is the highest-stakes area of most applications and the area where mistakes are most likely to happen, because authentication code looks similar across applications and engineers tend to copy patterns from previous projects without checking whether the patterns still apply.

The authentication audit skill looks at every endpoint and asks: who is allowed to call this endpoint, and how is that enforced? It traces the authentication middleware, looks at the authorization checks, and verifies that the checks are present, correct, and not bypassable.

Common issues the skill catches include endpoints that are missing authorization checks entirely, endpoints where the authorization check uses the wrong identifier, endpoints where the authorization check happens after a side effect has already occurred, endpoints where the authorization logic is correct in one place but wrong in another, and endpoints where the authorization can be bypassed by malformed input.

I run this skill against every authentication-related PR. It has caught issues that would have shipped to production in two of the last twelve PRs. Both issues were the result of an engineer copying a pattern from a different endpoint without realizing that the new endpoint had different authorization requirements. Both would have been hard to catch in code review because the code looked correct.

The Secrets Scan Skill

The fourth skill scans the entire repository for secrets. This is more aggressive than the pre-commit audit, which only looks at the staged diff. The secrets scan looks at every file, every commit in the history, and every branch.

The skill looks for high-entropy strings that match known credential patterns, environment files that have been committed even if they are now gitignored, hardcoded passwords in test data, API keys in documentation examples, and credentials embedded in deployment scripts.

When the skill finds something in git history, the fix is more involved than just removing the file. The credential needs to be rotated, because anyone who cloned the repository while the credential was visible could still extract it. Then the history needs to be cleaned up, which requires a force push and coordination with everyone who has the repository checked out.

The skill produces a report with the findings sorted by severity and a runbook for each finding that explains what to do. The runbook includes the rotation procedure, the history cleanup procedure, and a list of stakeholders to notify. This is the kind of detail that a generic secrets scanner does not include, and it is the part that turns a finding into a fix.

The Input Validation Skill

The fifth skill audits input validation across the application. The skill identifies every place where the application accepts external input and verifies that the input is being validated before it is used.

External input includes HTTP request parameters, file uploads, environment variables, configuration files loaded at runtime, message queue payloads, and data read from third-party APIs. Each of these is a place where untrusted data enters the system, and each needs to be validated before it is used in a sensitive operation.

The skill looks for input that flows into database queries, file system operations, command execution, deserialization, template rendering, and HTTP requests to other services. For each flow, the skill verifies that the input has been validated against an explicit schema and rejected if it does not match.

The most common issue the skill catches is input that is validated in one path and not in another. An engineer adds a new endpoint that calls an existing function. The existing function assumes its input has already been validated, because the original caller validated it. The new endpoint does not validate, because the engineer assumed the function would handle it. The result is an injection vulnerability that could not have been caught by reading either function in isolation.

The Configuration Audit Skill

The sixth skill audits configuration. Configuration is where security defaults turn into security disasters, because configuration changes do not go through the same review as code changes and the people who make them often do not understand the implications.

The configuration audit skill looks at infrastructure as code, deployment manifests, environment configuration, feature flag definitions, and any file that controls how the application behaves at runtime. It checks for common misconfigurations like overly permissive IAM policies, public S3 buckets that should be private, security groups that allow access from anywhere, debug mode enabled in production, default credentials that have not been changed, and encryption disabled where it should be enabled.

The skill is calibrated for the specific cloud provider and infrastructure stack I use, so it understands the difference between a configuration that is correct for development and one that would be a disaster in production. When it flags something, it tells me whether the issue is hypothetical or material, and what the fix looks like.

How the Skills Compose

The skills are designed to compose. I run the pre-commit audit on every commit. I run the dependency audit weekly. I run the authentication flow audit on every PR that touches auth-related code. I run the secrets scan monthly across the full history. I run the input validation audit on any PR that adds new endpoints. I run the configuration audit before any deployment to production.

This composition is the part that matters. A single audit run catches the issues that are present at one moment. A continuous audit pipeline catches issues as they are introduced, before they accumulate into a backlog that nobody has time to address.

The pipeline has a meta-rule attached. If any audit flags something at high severity, the relevant deployment is blocked until the issue is addressed or explicitly waived. The waiver requires a written explanation of why the issue is acceptable, which goes into a record that gets reviewed periodically. This means that when an issue is waived, it is waived deliberately, not by accident.

What the Skills Do Not Catch

I want to be honest about the limits. The skills catch the kind of issue that has a known pattern and shows up in a recognizable shape. They do not catch novel vulnerabilities, business logic flaws, or issues that require deep understanding of the application's threat model.

Examples of what the skills miss include race conditions in business logic that allow value extraction, authorization checks that are technically correct but enforce the wrong policy, side channels that leak information through timing or error messages, and chained vulnerabilities where each individual issue is low severity but the combination is high severity.

For these classes of issue, you still need human review. What the skills do is reduce the volume of low-hanging issues so that human review can focus on the hard problems. If a human reviewer spends 80% of their time catching missing semicolons in the security checklist, they have 20% left for the issues that actually require their judgment. Flip that ratio, and the audit becomes valuable.

Setting Up the Skills

If you want to build something similar, the structure is straightforward. Each skill is a markdown file that describes what to look for, what to flag, and how to format the report. The skill reads the relevant inputs, looks for the patterns, and produces a report.

The skills are stored alongside the codebase and version-controlled. When the codebase changes in a way that affects the security model, the skills change too. When a new attack surface is added, a new skill is added. When an existing skill produces too many false positives, it is tuned. The skills are living documents, not a one-time setup.

The most important thing is to run the skills consistently. A skill that runs every commit catches issues. A skill that runs once a quarter catches a backlog. The whole point of automation is to remove the human decision about whether to run the audit, and that only works if the audit runs every time.

What This Workflow Costs

The skills took about a day to write initially. Tuning them took another two days spread over the first month, as I saw which patterns produced false positives and which patterns missed real issues. Maintenance takes about an hour a month.

The time saved is harder to measure, because the value of catching a vulnerability is the cost of the breach that did not happen, and you cannot measure something that did not happen. What I can measure is that I no longer skip security audits, because the cost of running them is now measured in seconds rather than hours. The audits have caught real issues that would have shipped to production. The math is overwhelming, in the same way it always was, except now the math actually plays out in practice.

The Bigger Pattern

There is a bigger pattern here that goes beyond security audits. The pattern is that any kind of work that is high-stakes and tedious tends to get skipped, and the skipping accumulates costs that show up later. Code review skipped because it is tedious leads to bugs. Documentation skipped because it is tedious leads to onboarding pain. Security audits skipped because they are tedious lead to breaches.

The pattern for fixing this is the same in each case. Find the part of the work that is mechanical and automate it. Use the time saved to do the part that requires human judgment. Refuse to skip the work entirely, because the math is overwhelming if you account for the deferred costs.

Claude Code is a tool for executing this pattern. It is not a replacement for engineering judgment. It is a way to make sure the tedious 80% of the work gets done so that the engineering judgment can be applied to the 20% that needs it.

If you want to apply this pattern to your own codebase, the place to start is to pick one audit skill and run it. Pick the one that matches your biggest current risk. If you have ever committed a secret, start with the secrets scan. If your authentication is complex, start with the auth flow audit. If your dependency tree is deep, start with the dependency audit. Run it once. See what it finds. Fix what is real. Then schedule it to run continuously.

The first audit will probably find issues that have been sitting in your codebase for months. That is normal. The second audit will find fewer. By the third or fourth iteration, the audit becomes a regular checkpoint rather than a fire drill, and that is when the workflow starts paying back the time you put into it.

FAQ

How do I get started? Pick one audit skill that matches your biggest risk. Write a markdown file that describes what to look for and what to flag. Run it on your codebase. Tune it based on the results.

Do I still need professional security testing? Yes. The audit skills catch the patterns that are easy to encode. They do not catch novel vulnerabilities or business logic issues. Use them as the first line of defense, not the only line.

What about false positives? False positives are a cost. The way to reduce them is to tune the patterns, narrow the scope, and add suppression rules for known-safe cases. Aim for high precision over high recall on issues that block deployment.

How often should I run the audits? Pre-commit audits should run on every commit. Dependency audits weekly. Secrets scans monthly. Configuration audits before every production deployment.

Will this work for my language and framework? The pattern works for any language. The specific patterns depend on the language and framework. Customize the skills for your stack.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Documentation Generation: How I Stopped Shipping Code Nobody Could Read

Nex Tools — Fri, 08 May 2026 09:06:52 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

Two years ago I inherited a project from an engineer who had left the company. The codebase was clean. The test coverage was reasonable. The architecture was defensible. The documentation was a single README that said "TODO: write docs." There were 200 commits, three deployment environments, a set of cron jobs, and a database schema with 47 tables. None of it was documented.

I spent six weeks figuring out how the system worked before I felt comfortable making changes. Six weeks. The original engineer had probably written the whole thing in three months. I lost a sixth of his entire build time to the absence of a document he could have written in an afternoon.

That experience changed how I think about documentation. Documentation is not a nice-to-have that you write when you have time. Documentation is a force multiplier for everyone who comes after you, and the math for whether it is worth writing is almost always overwhelming. The reason most teams ship without documentation is not that the math is bad. It is that writing documentation is tedious, and the people who would benefit from it are not in the room when the decision is made.

Claude Code changed this for me. Documentation that used to take an afternoon now takes 15 minutes. Documentation that I would have skipped because the cost was too high now gets written because the cost is trivial. Here is the workflow.

Why Documentation Goes Unwritten

Most engineers do not skip documentation because they think it is unimportant. They skip it because the cost feels disproportionate to the benefit at the moment they would have to write it. You just shipped a feature. You are tired. The next feature is already lined up. The documentation is for some hypothetical future engineer who probably will not need it. You skip it.

Six months later, you are that engineer. You stare at the code you wrote and try to remember why a particular decision was made. You cannot. You spend an hour reverse engineering your own thinking. The cost was real. It was just deferred.

The second reason documentation goes unwritten is that the kind of documentation engineers can write quickly is the kind of documentation that nobody reads. Inline comments are easy and largely useless. JSDoc blocks that restate the function signature are easy and largely useless. The documentation that actually helps people is the documentation that captures intent, context, and tradeoffs. That kind of documentation is hard to write because it requires you to step out of implementation mode and think about what someone else would need to know.

The third reason documentation goes unwritten is that there is no obvious place to put it. Should it go in the code as comments? In a docs folder as markdown? In a wiki? In a knowledge base? Each option has tradeoffs and most teams pick one and then regret it later. The friction of figuring out where the documentation belongs is enough to make people skip writing it.

The cost of documentation feels high in the moment of writing it and low when reading it. The cost of missing documentation feels low in the moment of skipping it and high every time someone has to reverse engineer the missing context.

Claude Code does not change the math on whether documentation is worth writing. The math was always overwhelming. Claude Code just makes the writing fast enough that the in-the-moment cost stops being a barrier.

The Module Documentation Skill

When I finish a module, I run the module documentation skill. The skill takes the module source code and produces a markdown document with the following sections.

The first section is what this module does, written in two to four sentences. Not what each function does. What the module as a whole accomplishes. This is the section that future engineers read first to decide whether this module is the one they need to be looking at.

The second section is the public interface. What can callers do with this module? What are the inputs and outputs? What are the error conditions? This section is what Claude Code generates well from code, because the public interface is mostly mechanical.

The third section is the design choices. Why was this module structured this way? What alternatives were considered? What tradeoffs were made? This section is the one that requires actual thought, and it is the one Claude Code does not generate automatically. I write this section as a prompt for Claude Code to fill in based on context I provide. Sometimes I dictate a paragraph and ask Claude Code to clean it up. Sometimes I ask Claude Code to read the code and propose what the design choices probably were, which I then correct.

The fourth section is the gotchas. What surprised me about this module? What is non-obvious? What edge cases caused bugs that I had to fix? This section is the most valuable for future maintenance and the easiest to forget to write, because the gotchas seem obvious to me right after I have just dealt with them.

The fifth section is the change history. Major versions, the reasons for them, and links to the PRs. This is what tells future engineers whether the current behavior is the original intent or a deliberate departure from it.

The skill produces a draft of all five sections. I review the draft, fix the parts Claude Code got wrong, fill in the parts Claude Code could not infer, and commit the file alongside the module. The whole process takes 15 minutes for a module that took me a day to write. The ratio is right.

The README Skill

Every repository should have a README that someone unfamiliar with the project can read in five minutes and walk away with a working mental model. Most repositories do not have this README. They have either a stub README that says "this is the [project name] repository" or a sprawling README that tries to be comprehensive and ends up being unreadable.

The README skill takes the repository structure, the package configuration, the recent commit history, and any existing documentation, and produces a draft README with these sections.

A one-paragraph description of what the project is and who it is for. The audience matters more than the description. A README that does not tell me whether I am the intended audience is a README I will skim and forget.

A quick start guide that walks through the most common setup path. Not every possible setup path. The one that 80 percent of new contributors will use. The other paths can have their own dedicated documentation pages.

A high-level architecture overview. Three to five sentences about the major components and how they fit together. This is the section that helps somebody figure out where to look when they want to make a change.

A pointer to the deeper documentation. The README is a starting point, not a comprehensive guide. It should make it easy to find the deeper material when the reader needs it.

A contribution guide. How are issues tracked? What is the PR process? What conventions does the team follow? This section is what makes the difference between a repo that strangers can contribute to and a repo where strangers bounce off without contributing.

The skill produces a complete first draft. I edit it, sometimes substantially, and commit. The README that used to take a half day to write now takes 30 minutes including my edits. More importantly, the README actually exists, which is a meaningful improvement over the previous baseline.

The Claude Code memory files workflow is what makes the README skill produce useful output instead of generic boilerplate. Claude Code reading the project context once and remembering it across documentation tasks is what changes the output quality.

The Architecture Decision Record Skill

Some decisions deserve a permanent written record. Not every decision. The decisions where future engineers might wonder "why did we do it this way" and where the answer is non-obvious. Architecture Decision Records (ADRs) are the standard format for this kind of documentation, and they are profoundly underutilized.

The ADR skill takes a brief description of a decision, the context that led to it, the alternatives considered, and the tradeoffs accepted, and produces a properly formatted ADR. Each ADR has a number, a title, a status, a date, the context, the decision, the consequences, and the alternatives.

The reason ADRs are underutilized is that the format feels heavyweight relative to the value of any individual decision. Engineers think "this decision is not big enough to deserve an ADR" and so the ADR does not get written. Six months later the decision turns out to have been bigger than they thought, and now there is no record.

The skill changes this calculus. Writing an ADR no longer takes 30 minutes. It takes five. The threshold for "big enough to deserve an ADR" can drop accordingly. I now write ADRs for decisions I would have left undocumented two years ago, and the ADRs are paying off in conversations where I can point to the document instead of trying to reconstruct the reasoning.

The format I use:

# ADR 042: Use cursor pagination for the orders API

Status: Accepted
Date: 2026-04-15

## Context
The orders API returns lists of orders to mobile clients. Order
volume is high enough that offset pagination causes issues at
high page numbers (slow queries, inconsistent results across
pages when new orders are inserted).

## Decision
Use opaque cursor-based pagination. Cursors are base64-encoded
JSON containing the last-seen order id and timestamp.

## Consequences
- Clients cannot jump to arbitrary pages, only navigate forward
- Cursors are stable across data changes
- Cursor format is not part of the public contract and may change
- Migration from offset pagination requires a deprecation window

## Alternatives considered
- Offset pagination: rejected due to performance and consistency
- Keyset pagination with exposed keys: rejected due to leaking
  internal id format to clients
- Time-based pagination: rejected because orders within the same
  millisecond can collide

This format is short enough that writing it does not feel like a chore. It is structured enough that future readers can find the parts they care about quickly. The skill produces drafts in this format from a brief verbal description of the decision.

The API Documentation Skill

API documentation is its own discipline. Module documentation tells you how a piece of code works internally. API documentation tells you how to call a piece of code from outside it. The two have different audiences and different requirements.

I covered API documentation in detail in my Claude Code for API design article. The short version is that API documentation should be generated from specifications, not from code, and the specifications should be written before the code. Claude Code makes both halves of that workflow practical.

The relevant skill for this article is the one that takes existing code that does not have specifications and reverse-engineers documentation from it. This is what you do when you inherit an undocumented API and need to bootstrap documentation without rewriting everything from scratch.

The skill reads the route handlers, the request validation, the response shapes, and the tests, and produces a draft specification document for each endpoint. The draft is incomplete because the code does not always tell you the full story. Authentication requirements might be enforced by middleware that is not visible in the route handler. Idempotency behavior might be implicit in the database constraints. Error responses might depend on conditions the code only handles indirectly.

I review the drafts and fill in the gaps. The drafts get me 70 percent of the way there. Closing the last 30 percent is the part that requires my judgment. But starting from a 70 percent draft is dramatically faster than starting from nothing.

The Tutorial Skill

Reference documentation tells you what is possible. Tutorials tell you how to actually do something useful. Most projects have reference documentation and no tutorials, which is why most projects have a steep onboarding curve.

The tutorial skill takes a goal ("connect this service to a Postgres database with TLS," "set up authentication with custom JWT claims," "deploy this service behind a load balancer") and produces a step-by-step tutorial with code examples, explanations, and troubleshooting tips for the common failure modes.

The tutorials are not autogenerated content with empty filler. They are actual narratives that walk a reader from a starting state to a completed setup, with the reasoning visible at each step. The skill produces these narratives by reading the code, the existing documentation, and the issue tracker (where troubleshooting tips often live as resolved tickets).

I edit the tutorials before publishing. Sometimes I add screenshots. Sometimes I correct steps that Claude Code got slightly wrong because the documentation was outdated. But the structure is sound and the content is mostly correct, which is what matters. Tutorials I would not have written because the cost was too high now exist because the cost is trivial.

If you are starting a new project and want documentation built into the workflow from day one, the CLAUDE.md context file pattern is how you make Claude Code understand your project well enough to produce documentation that does not feel generic.

The Inline Comment Skill

Inline comments are a paradox. Most inline comments are noise. Comments that restate what the code already says are worse than no comments because they take up space and rot when the code changes. But the inline comments that explain non-obvious decisions are gold. The trick is writing the second kind without writing the first.

The inline comment skill reads code and proposes inline comments only for the lines where context is genuinely missing. Hidden constraints. Subtle invariants. Workarounds for specific bugs. Behaviors that would surprise a reader. Things that an engineer reading the code six months from now would wonder about.

The skill is conservative by design. If it is not sure that a comment adds value, it does not propose one. The proposed comments are short, factual, and focused on the why rather than the what.

I review the proposals and accept the ones that make sense. Usually I accept three or four out of every ten proposed. The rest I either reject (the comment was redundant) or modify (the comment had the wrong emphasis). The result is that the code has comments where comments are useful and is comment-free where comments would be noise.

This is the kind of detail work that I would never have time to do manually but that meaningfully improves the readability of code I revisit months later.

The Changelog Skill

Changelogs are documentation that nobody writes and everybody wants. Users want to know what changed in the version they just upgraded to. Maintainers want to remember why they made certain changes when they look back at the version history. Both groups are usually disappointed.

The changelog skill takes the commit history between two release tags and produces a human-readable changelog with sections for new features, improvements, bug fixes, breaking changes, and deprecations. The classification is based on the commit messages and, when those are inadequate, the actual code changes.

The skill is not magic. It cannot tell you which changes are exciting and which are boring. But it can produce a complete first draft that captures the structural changes accurately. I edit the draft to add commentary, group related changes, and highlight the things users actually care about. The whole process takes 20 minutes per release. Without the skill, it would take two hours, which is why I used to skip it.

The Cost of This Workflow

The total time investment to set up the module documentation, README, ADR, API documentation, tutorial, inline comment, and changelog skills was about two days. Most of that was iterating on the prompts to produce output I trusted. The ongoing cost is essentially zero. The skills run as part of my normal development flow.

The benefit is that the projects I work on now have documentation. Not perfect documentation. Not comprehensive documentation. But the kind of documentation that makes a difference for the next engineer who has to work on the codebase. The README explains what the project is. The module documentation explains how the modules work. The ADRs capture the major decisions. The tutorials cover the common workflows. The changelog tracks the releases. The inline comments illuminate the non-obvious lines.

Six weeks of context recovery, like the project I inherited two years ago, would not happen with this workflow. The original engineer would have run the skills as part of finishing the project, the documentation would have been comprehensive enough that I could have onboarded in days rather than weeks, and the company would have gotten back five weeks of my time that they instead spent on me reading code.

The Bottom Line

Documentation is a leverage activity that most engineers skip because the in-the-moment cost feels too high. The cost was always lower than the benefit. Claude Code makes the cost actually low, which removes the last excuse for skipping it.

If you have ever inherited an undocumented codebase, you know how much time gets lost to the absence of context. The engineers who came before you were not lazy or careless. They were busy and the documentation was the thing they could safely skip. Claude Code removes "safely skip" as an option by making documentation cheap enough that there is no longer a reason to skip it.

If this resonates and you want to build a documentation pipeline into your team's workflow, the Claude Code skills guide shows how to package these workflows so that every engineer on the team gets the leverage automatically. The hardest part of documentation is making it routine. Skills make it routine.

The codebases I am proudest of are the ones future engineers will actually be able to read. Claude Code is what makes that possible.

Claude Code for API Design: How I Stopped Shipping Endpoints I Regret Six Months Later

Nex Tools — Fri, 08 May 2026 09:00:57 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

The first public API I designed had 47 endpoints. Eight months later, 31 of them were either deprecated, broken, or quietly ignored by the only client that ever consumed them. Two were so badly named that we shipped a v2 just to rename them. One returned a different shape depending on which day of the week you called it, because of a bug nobody had caught in code review. The whole thing was a monument to what happens when an engineer designs an API by writing endpoints in the order they get requested.

That was 2021. Since then I have shipped four more public APIs and a handful of internal ones. Three of them I actually like. The other one is fine. None of them have the embarrassing mid-stream redesigns that the first one had. The difference is not that I got smarter. The difference is that I stopped designing APIs by typing route handlers and started designing them by having a conversation with Claude Code about what the API is for.

This is a workflow article, not a theory article. I am going to walk you through how I use Claude Code to design APIs from a blank slate, how I review existing APIs for design problems before they ship, and how I evolve APIs without breaking the clients that depend on them. The patterns are language and framework agnostic. I have used them for REST, GraphQL, and gRPC services. The tooling matters less than the discipline.

Why API Design Goes Wrong

Most APIs that age badly share a common failure mode. The team designs the API by following the immediate request pattern. The first client wants a way to fetch users, so we add GET /users. The second client wants a way to fetch a user by id, so we add GET /users/:id. The third client wants to filter users by status, so we add GET /users?status=active. Six months later we have 40 endpoints, three different ways to filter, two different pagination strategies, and an inconsistent response envelope.

The problem is not that any individual decision was wrong. Each endpoint, viewed in isolation, made sense at the time it was added. The problem is that nobody designed the API as a whole. The API emerged from accumulated requests, and emergent designs almost always have rough edges.

The second failure mode is designing for the present implementation instead of the future contract. The team exposes the database schema directly because that is what the implementation looks like today. Six months later the database schema changes and now the API has to either change too (breaking clients) or include a translation layer that nobody has time to maintain. Either way, somebody is unhappy.

The third failure mode is the optimistic naming problem. The team names the endpoint after what it does today, not what it represents conceptually. POST /sendWelcomeEmail becomes a problem the moment the product team decides welcome emails should sometimes be SMS messages. Now you have an endpoint named after a transport when the actual concern is the welcome flow.

APIs are contracts. Contracts are about what they promise, not how they are currently fulfilled.

I have made all three of these mistakes more than once. The reason I have stopped making them is that Claude Code now flags them before they ship.

The Design Conversation Skill

Before I write any route handlers, I have a design conversation with Claude Code about what the API is for. This is not a casual chat. It is a structured process that produces a markdown document I commit to the repo before the first endpoint exists.

The conversation has five sections.

The first section is the actor inventory. Who calls this API, and what are they trying to accomplish? Not what features they need, but what jobs they are trying to do. A mobile app trying to render a user profile is doing a different job than a backend service trying to validate a webhook signature, even if both involve the same user record. Most APIs are easier to design when you know the jobs first.

The second section is the resource inventory. What are the nouns this API talks about? Not the database tables. The conceptual nouns. Sometimes a database has six tables but the API only has two resources because the other four are implementation details. Sometimes the database has one table but the API has three resources because what looks like one entity to the storage layer is three different concepts to the consumer.

The third section is the operation inventory. For each resource, what operations does the API support? Create, read, update, delete are the obvious ones, but most APIs need more. Listing with filters. Bulk operations. State transitions that are not just updates. Search. Subscriptions. Each one needs to be explicit so that nothing surprises us later.

The fourth section is the consistency rules. How does this API handle pagination? Errors? Idempotency? Versioning? Authentication? These are the cross-cutting concerns that, if left ad-hoc, end up inconsistent across endpoints. Decide them once and document them.

The fifth section is the explicit non-goals. What is this API not for? Which use cases are out of scope? This section saves more arguments than any other section in the document. Six months from now when somebody asks "can we add a search endpoint to this API," the non-goals section gives a principled answer.

The skill takes my rough description of what I am building and produces a draft of all five sections. I review it, push back on the parts that feel wrong, and iterate until I have a document I would be willing to defend in a design review.

The Endpoint Specification Skill

Once the design document is settled, I move to specifying individual endpoints. The endpoint specification skill takes a resource and an operation and produces a complete specification including the URL path, the HTTP method, the request shape, the response shape, the error responses, the idempotency behavior, and the authentication requirements.

This is where most of the bugs in API design get caught. Writing a specification forces you to confront edge cases that are easy to ignore when you are typing route handlers. What happens if the request body is malformed? What happens if the resource does not exist? What happens if the user is authenticated but lacks permission? What happens if a required field is empty versus missing?

The specification format I use looks like this:

POST /api/v1/orders
Auth: Bearer token (scope: orders:write)
Idempotency: Idempotency-Key header (UUID, 24h retention)

Request:
  {
    customer_id: string (required)
    line_items: [{
      product_id: string (required)
      quantity: integer (required, min: 1, max: 999)
      unit_price_cents: integer (optional, defaults to product price)
    }] (required, min: 1, max: 50)
    shipping_address: AddressObject (required)
    notes: string (optional, max: 500)
  }

Response 201:
  Order resource (full shape)
  Location: /api/v1/orders/{id}

Response 400:
  ValidationError with field-level details

Response 409:
  IdempotencyConflict if same key seen with different body

Response 422:
  BusinessLogicError (e.g., product out of stock)

The skill produces these specifications for every endpoint in the API. I commit them to a specs/ directory. They become the contract that the implementation has to match and that the tests verify.

The discipline of writing specifications first sounds bureaucratic. It is not. It is faster than typing route handlers and discovering the design problems through bugs. The specifications take maybe an hour each. The bugs they prevent take days each.

Want the playbook for setting up Claude Code skills like the API design conversation skill? It is in the Claude Code skills guide. Start with one skill and add more as you find friction in your workflow.

The Consistency Audit Skill

Designing endpoints in isolation is how inconsistencies creep in. The third endpoint uses created_at and the fourth uses createdAt and the fifth uses creationDate. The first list endpoint uses cursor pagination and the second uses offset pagination. The first error response includes a code field and the second does not. Each individual decision is fine. The collection is a mess.

The consistency audit skill takes the full set of endpoint specifications and produces a report of inconsistencies. Naming conventions, pagination strategies, error formats, authentication patterns, response envelopes. Anything that varies across endpoints when it should not.

I run this skill at three points in the API lifecycle. First, after the initial design pass, before any code is written. The earliest fixes are the cheapest. Second, after every batch of new endpoints is added. The skill catches drift from the original conventions. Third, before any major release. The skill provides a final sanity check.

The audit report is brutal in a useful way. Last month it told me I had three different pagination strategies in an API I thought was internally consistent. I had been pattern matching against whichever endpoint I was looking at most recently. The skill noticed what my tired eyes had missed.

The Versioning Strategy Skill

Versioning is where most APIs go to die. The team picks a strategy that sounds reasonable, ships v1, and discovers six months later that the strategy does not work for the actual changes the API needs. By then there are clients depending on v1 and changing the strategy means breaking them.

The versioning strategy skill takes the API design document and produces a versioning plan that covers four scenarios. How do additive changes get versioned? How do breaking changes get versioned? How long does each version stay supported? What is the deprecation process?

The reason this matters is that different versioning strategies suit different APIs. URL path versioning (/api/v1/, /api/v2/) is simple but creates massive code duplication if you maintain multiple major versions. Header versioning is more flexible but harder for clients to discover. Date-based versioning works well for SaaS APIs where clients pin to a specific release. Each strategy has tradeoffs and the right choice depends on the API.

The skill walks through the tradeoffs for the specific API and recommends a strategy with reasoning. I have never accepted the first recommendation without modification, but the recommendation is always close enough to argue with productively.

The Client SDK Generation Skill

Most APIs are easier to use through an SDK than through raw HTTP calls. The problem is that maintaining SDKs in five languages is more work than most teams can sustain. So the SDK either does not exist, or exists in only one language, or exists in five languages but only two are kept up to date.

The client SDK generation skill takes the endpoint specifications and produces SDK code for whatever languages I need. TypeScript, Python, Ruby, Go, Java. The generated SDKs include type definitions, error handling, retries, idempotency key generation, and pagination helpers. They are not as polished as a hand-written SDK by an expert in that language, but they are 80 percent of the way there and they stay in sync with the API automatically.

The trick is that the SDK generation reads the same specifications that the implementation tests verify. If the implementation drifts from the spec, the tests fail. If the spec is updated, the SDK regenerates. The whole pipeline is connected.

This is the kind of thing that used to require a dedicated developer experience team. Now it requires a half day of skill setup and an evening of polish per language.

The same spec-driven workflow applies to internal team APIs too. If you are building a service mesh of internal APIs, the Claude Code memory files approach gives every service a shared context that makes cross-service design conversations dramatically easier.

The Breaking Change Detector

The single most expensive class of API mistake is the accidental breaking change. You think you are making a backwards-compatible change. You are not. A client breaks in production. You roll back. The team loses a day. Trust in the deployment process drops.

The breaking change detector takes the current API specification, the proposed API specification, and produces a report of every change classified as additive, breaking, or ambiguous. Adding a new optional field is additive. Removing a field is breaking. Changing a field from optional to required is breaking. Changing a field from required to optional is technically additive but might break clients that expect the field to always be present.

The ambiguous category is the interesting one. There are changes that are technically backwards-compatible but practically break some clients. Changing the order of fields in a list response. Changing the precision of a float. Changing the timezone of a timestamp. The detector flags these explicitly so I can make a deliberate decision rather than discovering the breakage in production.

I run the detector on every PR that touches the specifications. It is wired into CI. If the PR introduces a breaking change without a corresponding version bump, CI fails. The discipline is enforced by tooling rather than memory.

The Documentation Generation Skill

API documentation is the work that nobody has time for and that everybody complains about when it is missing. Most teams ship documentation that is either nonexistent, out of date, or autogenerated from code in a way that makes it technically complete but practically unusable.

The documentation generation skill takes the endpoint specifications and produces documentation that is more useful than autogenerated reference material. It includes example requests and responses. It includes common workflows that span multiple endpoints. It includes troubleshooting guides for common error conditions. It includes migration guides between versions.

The trick is that the documentation reads the same specifications that everything else reads. So the documentation never drifts from the actual API behavior. If the spec changes, the documentation regenerates. If there is a bug in the documentation, fixing it usually means fixing the spec, which means fixing the implementation, which means the bug gets fixed everywhere at once.

This is one of the workflows where the leverage from Claude Code is most obvious. Documentation that used to take a week to write and that nobody trusted now takes an afternoon to generate and reflects reality.

What This Workflow Has Cost Me

Setting up the design conversation, endpoint specification, consistency audit, versioning strategy, SDK generation, breaking change detector, and documentation generation skills took me about three days. Most of that was iterating on the prompts to produce output I trusted. The skills themselves are short. The expertise is in knowing what good API design looks like, which is not something Claude Code can give me but is something Claude Code can amplify.

The ongoing cost is essentially zero. I run the skills as part of my normal development flow. They produce artifacts I would have wanted to produce anyway. The friction of using them is lower than the friction of skipping them.

The benefit is that I have not shipped a regrettable API since I started using this workflow. The APIs I ship are more consistent, better documented, and more pleasant to consume. The clients that integrate with them complain less. The teams that maintain them complain less. The on-call burden from API issues is lower.

There is a version of this article that is about specific tools. This is not that article. The tools I use are not magic. The discipline is.

The Bottom Line

API design is a leverage activity. The decisions you make in the first week of an API live with you for years. The cost of getting them right early is dramatically lower than the cost of getting them wrong and discovering the mistake six months later when there are clients depending on the wrong shape.

Claude Code does not turn me into a better API designer. It turns me into the API designer I would be if I had infinite patience for writing specifications, running consistency audits, and producing migration guides. Most engineers know what good API design looks like. Most engineers do not have time to do all the work that good API design requires. Claude Code closes that gap.

If you are about to design a new API, do not start by typing route handlers. Start by writing the design document. Then write the endpoint specifications. Then audit them for consistency. Then implement. The implementation will be faster because you will not be redesigning while you type, and the API will be better because you will have thought it through before you committed to it.

If this workflow resonates, the team workflows guide shows how to scale these patterns across an engineering team. The hardest part is not the tooling. The hardest part is the cultural shift from typing first to thinking first.

The APIs I am proudest of were the ones I designed slowest. Claude Code makes slow design fast.