Forem: Rohit Gavali

My Workflow for Validating AI Outputs Before Shipping Code

Rohit Gavali — Wed, 18 Mar 2026 10:08:54 +0000

I shipped AI-generated code to production exactly once without a validation workflow. It took down our payment processing for forty minutes and cost us three customer escalations.

The code looked perfect. Clean structure, proper error handling, comprehensive logging. It passed our test suite. The AI that generated it—Claude Opus 4.6—confidently assured me it was production-ready.

The bug was subtle: the payment retry logic used exponential backoff with no maximum delay. After five retries, it was waiting sixteen minutes before attempting the sixth retry. Users saw pending payments that never resolved. Our monitoring didn't catch it because technically nothing crashed—the code was just waiting.

A human would have questioned sixteen-minute delays. The AI never considered whether the behavior made sense in production context. It implemented the algorithm correctly but didn't reason about the consequences.

That incident forced me to build a systematic validation workflow. Not because AI code is inherently bad, but because AI-generated code fails in different ways than human-written code, and our traditional review processes don't catch those failures.

The Core Problem With AI Code Review

Traditional code review assumes the author understood the requirements and attempted to meet them. The reviewer checks if the implementation matches the intent.

AI-generated code breaks this assumption. The AI didn't understand requirements—it pattern-matched against similar code in its training data. Sometimes the pattern is right. Sometimes it's subtly wrong in ways that look correct until you reason about behavior.

This means standard code review questions don't work:

"Does this implementation match the requirements?" — AI code usually matches the literal requirements while missing implicit constraints you'd assume any developer would understand.

"Are there edge cases that aren't handled?" — AI code often handles edge cases you specified while introducing new edge cases you didn't think to mention.

"Is this maintainable?" — AI code is usually well-structured and readable. Maintainability isn't the problem. Correctness is.

I needed a validation workflow that accounted for AI's specific failure modes, not just general code quality issues.

The Validation Workflow That Actually Works

After six months of shipping AI-generated code without incidents, here's the workflow that survived:

Stage 1: Multi-Model Generation

I never ship code generated by a single AI model. I generate implementations from at least two different models and compare them.

When I needed a function to parse and validate user-uploaded configuration files, I asked both Claude Opus 4.6 and Gemini 3.1 Pro to implement it independently.

Claude's version prioritized error messages and validation feedback. It returned detailed errors explaining what was wrong with malformed configs.

Gemini's version prioritized performance. It validated config structure in a single pass and returned boolean valid/invalid with minimal error detail.

Neither was wrong. But the comparison revealed an implicit requirement I hadn't specified: we needed detailed error messages for user feedback, not just validation results.

If I'd shipped the first implementation I received, I would have implemented the wrong behavior. The multi-model comparison forced me to clarify requirements I'd assumed were obvious.

Using platforms that let you compare AI models side-by-side makes this stage practical. You can see both implementations simultaneously without copy-pasting between interfaces.

Stage 2: Behavioral Verification

I don't review AI-generated code the way I review human code. I don't ask "does this look right?" I ask "what does this actually do?"

For every AI-generated function, I manually trace execution with specific inputs:

Happy path input: Does it produce the expected output?

Boundary conditions: Empty strings, null values, zero, maximum values—what happens at the edges?

Malformed input: What happens with invalid data? Does it fail gracefully or crash?

Production-scale input: What happens with realistic data volumes? Does performance degrade?

For the payment retry logic that failed, this stage would have caught the issue. Tracing through the exponential backoff with actual numbers would have revealed the sixteen-minute delay.

I use tools that help verify the logical flow of generated code, not just syntax. The goal is to confirm the code behaves correctly under all conditions, not just that it compiles and runs.

Stage 3: Cross-Model Review

After selecting an implementation, I have a different AI model review it.

If Claude generated the code, I ask Gemini to review it. If Gemini generated it, I ask GPT-5.4 to review it.

Each model has different blind spots. Code that passes Claude's conceptual review might fail Gemini's performance analysis. Code that passes GPT's readability check might have architectural issues Claude would catch.

The key is asking the right review questions:

Not: "Is this code correct?"

Instead: "What could go wrong with this code in production?"

Not: "Does this follow best practices?"

Instead: "What implicit assumptions does this code make?"

Not: "Is this well-written?"

Instead: "What edge cases might this code not handle?"

Cross-model review isn't about finding syntax errors. It's about surfacing assumptions the generating model made that might be invalid for your specific context.

Stage 4: Test Case Generation

I have AI generate comprehensive test cases for the code, then review those tests more carefully than the code itself.

AI-generated tests reveal assumptions the model made during implementation. If the tests don't cover a scenario you care about, the code probably doesn't handle it correctly.

For the payment retry function, I had Claude Sonnet 4.5 generate test cases. The tests covered retry counts, error handling, and backoff timing—but none tested total elapsed time.

That omission revealed the model didn't consider time limits as a constraint worth testing. Which meant it didn't consider them during implementation either.

I now add test cases the AI didn't generate, specifically targeting:

Time-based behavior (timeouts, delays, expiration)
Resource constraints (memory, connections, file handles)
Concurrent access (race conditions, locking)
Production-scale data (performance, pagination)

These are areas where AI-generated code consistently has gaps.

Stage 5: Context Validation

This is the stage most developers skip, and it's where the subtlest bugs hide.

AI doesn't know your system architecture, your constraints, or your operational requirements. It generates code that works in isolation but might fail in context.

For every AI-generated component, I explicitly verify:

Does this integrate correctly with existing systems? AI might use patterns that conflict with how the rest of your codebase works.

Does this match our performance requirements? AI optimizes for correctness, not performance. It might choose approaches that work but don't scale.

Does this handle our operational constraints? Retry limits, timeout budgets, connection pools—AI doesn't know these exist unless you specify them explicitly.

Does this maintain our security posture? AI might use libraries or approaches that introduce vulnerabilities in your specific context.

I use AI-powered analysis tools to validate that generated code handles our specific data patterns correctly. But the final verification is always manual—checking that the code makes sense within our system's constraints.

The Validation Checklist I Actually Use

Before shipping any AI-generated code, I run through this checklist:

[ ] Generated by at least two different models and compared

Different implementations reveal ambiguities in requirements

[ ] Manually traced execution with realistic inputs

Confirms code does what I think it does, not just what it claims to do

[ ] Reviewed by a different AI model than the one that generated it

Catches blind spots specific to the generating model

[ ] Test cases generated and reviewed for gaps

AI-generated tests reveal what the model considered important

[ ] Additional tests written for time, resources, concurrency, scale

Areas where AI consistently misses edge cases

[ ] Verified integration with existing systems

Confirms code works in context, not just in isolation

[ ] Checked against operational constraints

Ensures code respects system-specific limits and requirements

[ ] Security review for libraries, approaches, data handling

AI might introduce vulnerabilities specific to your context

[ ] Performance tested with production-scale data

Confirms code doesn't just work but works at scale

[ ] Documentation reviewed for accuracy

AI-generated docs often describe what code should do, not what it actually does

This sounds like a lot. In practice, it takes 10-15 minutes for a typical function. That's longer than reviewing human-written code, but shorter than debugging production incidents caused by skipping validation.

What This Workflow Catches

Implicit requirements the AI missed: Multi-model generation reveals ambiguities you didn't realize existed in your requirements.

Logic errors that look syntactically correct: Manual execution tracing catches bugs that pass automated testing.

Model-specific blind spots: Cross-model review surfaces assumptions one model made that another would question.

Missing edge cases: Test case generation plus manual additions ensure coverage of scenarios AI doesn't naturally consider.

Context mismatches: Validation against system constraints catches code that works in isolation but fails in production.

What This Workflow Costs

Time: 10-15 minutes per function instead of 2-3 minutes for standard review.

Context switching: Using multiple models means explaining the same requirements multiple times.

Cognitive load: Comparing implementations and tracing execution requires more mental effort than accepting the first plausible solution.

Tool overhead: Managing multiple AI models and comparison workflows requires infrastructure.

But here's what I learned: the time cost of validation is negligible compared to the time cost of debugging production issues caused by skipped validation.

That forty-minute payment outage cost me six hours of debugging, incident response, and customer communication. Plus reputation damage that's harder to quantify.

The validation workflow would have caught that bug in ten minutes. The ROI is obvious.

The Skills This Workflow Requires

Validating AI code isn't about knowing how to prompt better. It's about developing specific review skills:

The ability to read code behaviorally, not structurally. Don't ask if the code looks right. Ask what it actually does with specific inputs.

Pattern recognition for AI failure modes. After validating dozens of AI-generated functions, you start recognizing the types of bugs AI consistently introduces.

The discipline to check what seems obvious. AI code looks so clean and confident that your brain wants to trust it. You need to develop skepticism that overrides that instinct.

Comfort with multiple models. You need to be fluent enough with different AI systems to quickly generate and compare implementations.

The judgment to know when validation is overkill. Not every AI-generated snippet needs full validation. A one-line string transformation doesn't need multi-model review. A payment processing function does.

When I Skip Steps

I don't run every AI-generated code snippet through full validation. That would be inefficient.

I skip validation for:

Pure data transformations with no side effects. If the function just transforms input to output with no external dependencies, the input/output tests are usually sufficient.

Code I'm going to manually rewrite anyway. Sometimes I use AI to generate a starting point that I'll completely refactor. Full validation is overkill.

Non-critical scripts and tools. Deployment scripts, data migration helpers, one-off analysis tools—if failure is low-cost, lightweight validation is fine.

Code that's easy to verify through use. UI components, formatting utilities, display logic—if you can immediately see whether it works through normal use, formal validation isn't necessary.

I run full validation for:

Anything that handles money, authentication, or user data. High-stakes code gets maximum scrutiny.

Performance-critical paths. Code that needs to scale or run efficiently requires validation of resource usage and timing behavior.

Complex business logic. Anything implementing domain-specific rules where correctness isn't obvious from casual inspection.

Integration points between systems. Code that connects different parts of your architecture where bugs can cascade.

The judgment about when to validate thoroughly is a skill you develop by seeing what types of AI-generated code tend to have subtle bugs versus what's usually fine.

What Changed After Adopting This Workflow

I ship AI-generated code confidently. Before the workflow, every deploy felt risky. Now I trust validated AI code as much as code I wrote myself.

I catch bugs before they reach production. The last six months: zero production incidents from AI-generated code. Compare that to one major incident in the first month before I had a validation workflow.

I write less code but understand it better. AI handles implementation, I focus on verification. This forces me to think deeply about behavior rather than syntax.

I'm faster overall despite validation overhead. AI generates code in seconds. Validation takes minutes. Writing code manually takes hours. Net win.

I've developed pattern recognition for AI failures. After validating hundreds of functions, I can spot likely bugs in AI code quickly. It's a learnable skill.

What I'd Tell Someone Starting Today

Don't ship AI code without validation. The time savings from AI generation disappear instantly when you have to debug production issues.

Build validation into your workflow from day one. Use multiple models to compare implementations. Manually trace execution. Cross-model review. Test comprehensively.

Use tools that make multi-model workflows practical. Platforms like Crompt AI let you generate and compare outputs from different models without switching between interfaces. This makes validation fast enough to actually do it.

Develop skepticism for code that looks too clean. AI-generated code is suspiciously well-structured. That's a red flag, not a green light.

Learn to recognize AI's failure patterns. Off-by-one errors, missing timeouts, ignored resource constraints, subtle regex bugs—these show up repeatedly. Pattern recognition makes validation faster.

The goal isn't to not use AI. It's to use it safely. AI can generate code faster than you can type. But only careful validation ensures that code actually works in production.

-Rohit

What I Learned After Letting Different AI Models Refactor the Same Function

Rohit Gavali — Tue, 17 Mar 2026 11:25:11 +0000

I had a function that bothered me. Not broken—just inelegant. 200 lines of nested conditionals handling user permissions across three different access levels with special cases for admin overrides and temporary grants.

It worked. Tests passed. But every time I looked at it, I knew it could be better.

So I did something unusual. I asked five different AI models to refactor it. Same function, same context, same instruction: "Make this better."

What I got back revealed something fundamental about how different AI systems think about code—and exposed assumptions I didn't know I was making about what "better" even means.

The Function That Started It

The original code looked like this (simplified for clarity):

function checkPermission(user, resource, action) {
  if (user.role === 'admin') {
    return true;
  }

  if (user.temporaryGrants) {
    const grant = user.temporaryGrants.find(g => 
      g.resource === resource && 
      g.action === action &&
      new Date() < new Date(g.expiresAt)
    );
    if (grant) return true;
  }

  if (user.role === 'editor') {
    if (action === 'read' || action === 'write') {
      if (resource.type === 'document' || resource.type === 'draft') {
        if (resource.ownerId === user.id || resource.sharedWith?.includes(user.id)) {
          return true;
        }
      }
    }
    if (action === 'read' && resource.public) {
      return true;
    }
  }

  if (user.role === 'viewer') {
    if (action === 'read') {
      if (resource.public || resource.sharedWith?.includes(user.id)) {
        return true;
      }
    }
  }

  return false;
}

Functional, but the nested conditionals obscured the actual permission logic. Each model saw this differently.

What Claude Focused On

When I fed this to Claude Opus 4.6, it took a strategy-first approach. It didn't just refactor—it restructured around permission strategies.

Claude's version introduced a permission strategy pattern:

const permissionStrategies = {
  admin: () => true,

  temporaryGrant: (user, resource, action) => {
    return user.temporaryGrants?.some(grant =>
      grant.resource === resource &&
      grant.action === action &&
      !isExpired(grant.expiresAt)
    ) ?? false;
  },

  editor: (user, resource, action) => {
    const canEdit = ['read', 'write'].includes(action) &&
      ['document', 'draft'].includes(resource.type) &&
      (resource.ownerId === user.id || resource.sharedWith?.includes(user.id));

    const canReadPublic = action === 'read' && resource.public;

    return canEdit || canReadPublic;
  },

  viewer: (user, resource, action) => {
    return action === 'read' && 
      (resource.public || resource.sharedWith?.includes(user.id));
  }
};

function checkPermission(user, resource, action) {
  const strategies = [
    () => user.role === 'admin' && permissionStrategies.admin(),
    () => permissionStrategies.temporaryGrant(user, resource, action),
    () => permissionStrategies[user.role]?.(user, resource, action)
  ];

  return strategies.some(strategy => strategy());
}

What struck me: Claude optimized for conceptual clarity. Each permission type became explicit. The code was longer, but the logic was clearer. If someone asked "how do editor permissions work?", you could point to a single function.

But there was a tradeoff. The strategy pattern added abstraction overhead. For a function this size, was the pattern worth it? Claude thought in terms of extensibility and maintainability. It assumed this code would grow.

What Gemini Prioritized

Gemini 3.1 Pro took a completely different approach. It focused on data-driven configuration.

Instead of encoding permission logic in code, Gemini extracted it into a declarative structure:

const permissionRules = {
  admin: { allowAll: true },

  editor: {
    read: {
      documentTypes: ['document', 'draft'],
      conditions: ['isOwner', 'isSharedWith', 'isPublic']
    },
    write: {
      documentTypes: ['document', 'draft'],
      conditions: ['isOwner', 'isSharedWith']
    }
  },

  viewer: {
    read: {
      conditions: ['isPublic', 'isSharedWith']
    }
  }
};

const conditions = {
  isOwner: (user, resource) => resource.ownerId === user.id,
  isSharedWith: (user, resource) => resource.sharedWith?.includes(user.id),
  isPublic: (user, resource) => resource.public
};

function checkPermission(user, resource, action) {
  // First check temporary grants
  if (hasValidTemporaryGrant(user, resource, action)) {
    return true;
  }

  const rules = permissionRules[user.role];
  if (!rules) return false;
  if (rules.allowAll) return true;

  const actionRules = rules[action];
  if (!actionRules) return false;

  // Check document type if specified
  if (actionRules.documentTypes && 
      !actionRules.documentTypes.includes(resource.type)) {
    return false;
  }

  // Check if any condition is satisfied
  return actionRules.conditions.some(conditionName =>
    conditions[conditionName](user, resource)
  );
}

Gemini was optimizing for configurability. Want to add a new role? Update the config. Change permission logic? Modify the rules object. The code itself barely needs to change.

This approach would shine in systems where permissions evolve frequently or need to be managed by non-developers. But it introduced indirection—to understand what permissions a viewer has, you need to trace through the config object and condition functions.

What GPT Emphasized

GPT-5.4 went minimal. It focused on reducing cognitive load through simplification and early returns.

function checkPermission(user, resource, action) {
  // Admin access
  if (user.role === 'admin') return true;

  // Temporary grants
  if (hasActiveTemporaryGrant(user, resource, action)) return true;

  // Public read access
  if (action === 'read' && resource.public) return true;

  // Shared resource access
  const hasSharedAccess = resource.sharedWith?.includes(user.id);

  // Editor permissions
  if (user.role === 'editor') {
    const isEditableType = ['document', 'draft'].includes(resource.type);
    const isOwner = resource.ownerId === user.id;
    const canModify = ['read', 'write'].includes(action);

    return canModify && isEditableType && (isOwner || hasSharedAccess);
  }

  // Viewer permissions
  if (user.role === 'viewer') {
    return action === 'read' && hasSharedAccess;
  }

  return false;
}

function hasActiveTemporaryGrant(user, resource, action) {
  return user.temporaryGrants?.some(grant =>
    grant.resource === resource &&
    grant.action === action &&
    new Date() < new Date(grant.expiresAt)
  ) ?? false;
}

GPT extracted one helper function and flattened the rest. It optimized for readability through simplicity. No patterns, no config objects, just straightforward conditional logic with good naming.

This version was easiest to read linearly. But it wouldn't scale well. Adding a fourth role means adding another conditional block. The logic is inline, which makes it clear but also harder to reuse or test in isolation.

What The Differences Revealed

Each model made implicit assumptions about what "better" meant:

Claude assumed the code would grow. It optimized for future extensibility even though the current requirements didn't demand it. The strategy pattern adds complexity now to make changes easier later.

Gemini assumed the logic would change frequently. It separated logic from code, optimizing for configurability. This is brilliant if permissions need to be modified by non-developers or if you're building a multi-tenant system where each tenant defines their own rules.

GPT assumed simplicity was the highest virtue. It reduced abstraction, making the code as straightforward as possible. This works great for stable, well-understood requirements that won't grow much.

None of these approaches is objectively better. They're optimized for different futures.

The Assumptions I Didn't Know I Had

Watching AI models refactor the same code exposed my own biases.

I initially favored Claude's approach because I value extensibility. I've been burned by rigid code that became painful to extend. But that's my history, not necessarily this code's future.

The Gemini approach made me uncomfortable because I've seen over-engineered configuration systems that became harder to understand than code. But I've also seen systems where separating logic from code was exactly right.

The GPT approach felt too simple at first. Then I realized that was internalized complexity bias—the assumption that good code must involve some sophisticated abstraction. Sometimes the simple solution is actually the right one.

What This Means For Using AI To Refactor

Different AI models have different philosophies about code quality, and those philosophies reflect different tradeoffs:

Some models optimize for future flexibility. They'll add abstractions that make the code more complex now but easier to extend later. Great if you're building something that will evolve. Overkill if you're not.

Some models optimize for separation of concerns. They'll extract configuration, create clear boundaries, and make components testable in isolation. Valuable for complex systems. Unnecessary overhead for simple ones.

Some models optimize for immediate clarity. They'll keep things simple and readable even if it means sacrificing some extensibility. Perfect for stable code. Limiting for code that needs to grow.

When you ask an AI to refactor code, you're not just getting a technical transformation—you're getting a philosophy about what makes code good. Understanding which philosophy fits your actual needs is more important than accepting whatever the AI suggests.

How I Actually Use Multiple Models Now

I don't ask one AI to refactor and accept the result. I ask several and compare their approaches.

Using platforms like Crompt AI that let you work with multiple AI models side-by-side, I can see different perspectives on the same code simultaneously. Not to find the "right" answer, but to understand the tradeoffs.

When refactoring now, I ask myself:

How likely is this code to change?
What kind of changes will it face?
Who will maintain it?
What's the cost of added abstraction?
What's the cost of missing abstraction?

Then I look at which AI approach optimizes for my actual constraints, not theoretical best practices.

Sometimes I take Claude's strategy pattern because I know the permission system will grow. Sometimes I take GPT's simplicity because the requirements are stable and the team values clarity. Sometimes I take Gemini's config-driven approach because permissions actually do need to be managed separately from code.

And sometimes I take elements from multiple approaches, using the AI suggestions as a menu of options rather than a prescription.

The Pattern Recognition Problem

The most interesting thing I learned: AI models are pattern matchers, and they match different patterns.

Claude sees permission code and matches it to strategy patterns it's seen work well in large systems. Gemini sees permission code and matches it to configuration-driven systems that provide flexibility. GPT sees permission code and matches it to straightforward implementations that prioritize readability.

None of them asked about my specific constraints. They can't—they don't know if this is a startup prototype that will change daily or a stable enterprise system that will run unchanged for years.

This is why blindly accepting AI refactoring suggestions is dangerous. The AI is optimizing for patterns it's seen succeed in its training data, not for your specific context.

What You Should Do

Next time you're tempted to ask AI to refactor your code:

Ask multiple models. Compare their approaches. Notice what each one optimizes for. Then make a conscious decision about which tradeoffs align with your actual needs.

Don't treat AI as an oracle that knows the "right" way to structure code. Treat it as a source of different perspectives on what "better" could mean.

The value isn't in getting one perfect refactoring. It's in seeing multiple valid approaches and understanding the philosophical differences between them.

Your code's future depends on constraints the AI doesn't know: how often requirements change, who maintains the code, how the system will evolve. Choose the approach that fits your actual constraints, not the one that sounds most sophisticated.

Sometimes that means taking the simple version and resisting the urge to over-engineer. Sometimes it means accepting extra abstraction because you know complexity is coming. Sometimes it means splitting the difference.

The AI gives you options. The judgment about which option fits your context? That's still on you.

-ROHIT

Lessons from Zero-Downtime Postgres Migrations That Nearly Took Prod Down

Rohit Gavali — Mon, 19 Jan 2026 04:41:19 +0000

The migration was supposed to be routine. Add an index, update some constraints, deploy the new application code. Zero downtime, zero risk. We'd done this dozens of times.

Then at 2:47 PM on a Wednesday, our production database locked up. API response times spiked from 50ms to 30 seconds. User sessions started timing out. The queue of pending requests grew exponentially. Within ninety seconds, our entire platform was effectively down.

The migration was still running. The index creation we thought would take two minutes had been holding an exclusive lock for three minutes and counting. Every query waiting for that lock was blocking other queries. The cascade failure was complete.

We had to make a choice: kill the migration and restore service, or wait for it to complete and hope the platform survived. We killed it. Service restored in fifteen seconds. But the damage was done—users had experienced downtime we promised would never happen.

The irony: we had followed best practices. We'd tested the migration in staging. We'd verified the execution plan. We'd even calculated the expected lock time. Everything looked safe.

We were wrong about what "safe" meant.

The Confidence That Kills You

Zero-downtime migrations sound straightforward in theory. You design schema changes that don't require locking tables. You deploy code that works with both old and new schemas. You migrate data in small batches. You verify everything in staging.

This works beautifully until production has ten times the data volume, different query patterns, and active connections you can't replicate in testing.

Our staging database had 2 million rows in the table we were indexing. Production had 40 million. The index creation we tested took 90 seconds in staging. In production, it took over 5 minutes—and held an exclusive lock the entire time.

The lock wasn't technically required for index creation. Postgres supports CREATE INDEX CONCURRENTLY which builds indexes without blocking writes. We knew this. We used it.

What we didn't account for: the table had active long-running transactions when the migration started. CREATE INDEX CONCURRENTLY waits for existing transactions to complete before it can proceed without blocking. In staging, there were no long-running transactions. In production, there were three.

One was a analytics query someone had kicked off five minutes earlier. Another was a batch job that had been running for eight minutes. The third was a zombie connection that had been idle in transaction for over an hour.

Our "concurrent" index creation waited for these transactions to complete. While waiting, it held locks that blocked new queries. The cascade began.

We had tested the migration. We just hadn't tested it under production conditions.

What "Zero-Downtime" Actually Means

The term "zero-downtime migration" creates a dangerous illusion: that you can change database schemas without affecting system performance or availability.

This is technically possible. It's also rarely what actually happens.

Real zero-downtime migrations aren't about eliminating all impact. They're about controlling and minimizing impact in ways that users don't notice. There's a difference between "no user-facing downtime" and "no database impact."

Every schema change has impact. The question is whether that impact stays within acceptable boundaries or cascades into user-visible failures.

Acceptable impact: Slightly elevated CPU during index creation. Temporary increase in replication lag. Brief moments where query plans are suboptimal.

Unacceptable impact: Queries timing out. Connections refused. Response times degrading to the point where features stop working.

The line between these isn't clear until you cross it. And in production, you often don't know you've crossed it until the alerts start firing.

The Patterns That Fail

We analyzed our near-disaster and five other problematic migrations from the previous year. Patterns emerged—not in what we did wrong technically, but in what we assumed incorrectly.

Assumption one: Staging matches production. It never does. Production has more data, different data distribution, different query patterns, different connection behavior, different resource contention. A migration that runs perfectly in staging can behave completely differently in production.

We started actually measuring production conditions before migrations. Not just table sizes—connection counts, active transaction lengths, query patterns during the deployment window, disk I/O patterns. We'd use tools to analyze our database performance metrics over the previous week to understand what "normal" looked like.

Assumption two: Lock duration is predictable. It's not. Even with CREATE INDEX CONCURRENTLY, locks can persist longer than expected. Even with carefully designed multi-phase migrations, unexpected locks can emerge.

We stopped trusting execution time estimates and started setting hard timeouts. If a migration step runs longer than expected, kill it. Better to abort cleanly than let it cascade into a full outage.

Assumption three: You can test everything in advance. You can't. Production has edge cases you can't replicate. The combination of active queries, concurrent transactions, and resource contention creates scenarios that don't exist in testing.

We started treating every migration as a potential incident. Not pessimistically—pragmatically. We had rollback plans. We had monitoring during execution. We had clear criteria for when to abort.

Assumption four: If it worked before, it's safe now. Previous success doesn't guarantee future safety. Table sizes grow. Data distributions change. Query patterns evolve. A migration strategy that worked six months ago can fail today because conditions have changed.

We started reviewing migration approaches every quarter, not just reusing patterns that had worked previously.

The Multi-Phase Approach That Actually Works

After our production incident, we redesigned our migration process. Not the technical implementation—the operational approach.

Phase one: Make the schema compatible. Add new columns, tables, or indexes without removing anything old. The database now supports both old and new application code. This phase might degrade performance slightly, but it doesn't break anything.

We'd deploy this during low-traffic periods and monitor closely. If anything looked wrong, rollback was simple—just drop the new schema elements.

Phase two: Deploy application code that uses new schema. The application starts writing to new columns or using new indexes, but still maintains compatibility with old schema. Both versions of the application can coexist.

This is where we'd use AI to help review our code changes for potential edge cases—having a fresh set of eyes (even artificial ones) often caught assumptions we'd embedded in the migration logic.

Phase three: Migrate existing data. In small batches, during low-traffic periods, with extensive monitoring. If migration causes problems, we can pause or rollback without affecting new data.

For complex data transformations, we'd sometimes use Claude Sonnet 3.7 to help verify our migration scripts caught all edge cases in the data—it's surprisingly good at spotting scenarios you didn't consider.

Phase four: Remove old schema elements. Only after verifying that nothing is using them. This is often weeks after the migration started.

This approach is slower than "deploy everything at once." It's also far more reliable.

The Monitoring You Actually Need

Standard database monitoring tells you when things have already gone wrong. You need monitoring that tells you when things are about to go wrong.

Lock monitoring during migrations. We built custom tooling that watches for locks during migration execution. If any lock lasts longer than expected, if queries are queuing behind locks, if transaction wait times spike—abort the migration immediately.

Query performance tracking before and during migrations. We baseline query performance in the hours before a migration, then monitor for regressions during execution. A 2x slowdown might not trigger alerts, but it's a signal that something isn't working as expected.

Connection pool monitoring. Migrations can exhaust connection pools in subtle ways. We watch for increasing connection acquisition times and pool exhaustion patterns.

Replication lag tracking. Schema changes can cause replication lag spikes. For systems relying on read replicas, this can cascade into user-visible issues even if the primary database is fine.

We'd use analytical tools to quickly parse and visualize metrics during migrations, helping us spot patterns that would take too long to notice manually.

The Rollback Plan You Need Before You Start

The worst time to figure out rollback is when things are failing. We learned this by nearly making our outage worse.

When our index creation locked up production, we panicked briefly trying to remember the correct way to kill it safely. Could we just terminate the migration connection? Would that leave the database in a corrupted state? Should we wait for it to complete?

These questions should have been answered before we started.

Now every migration has a documented rollback procedure written before execution begins:

Immediate abort criteria. Clear thresholds for when to kill the migration. Lock duration exceeding X seconds. Query queue depth exceeding Y. Response time degradation beyond Z. No judgment calls during an incident—just follow the criteria.

Abort procedure. Exact commands to safely stop the migration. Not "kill the connection"—the specific SQL commands, in order, with expected outcomes for each.

Verification steps. How to confirm the database is in a stable state after abort. What queries to run, what metrics to check, what behaviors indicate success.

Rollback procedure if needed. If aborting the migration isn't enough, how to roll back schema changes. This is especially critical for multi-phase migrations where partial completion might leave the schema in an unexpected state.

Communication plan. Who to notify when aborting, what to tell them, how to coordinate with application deployments if needed.

Writing this before the migration forces you to think through failure modes clearly. You're not optimizing for success—you're optimizing for surviving failure.

What We Changed Permanently

The near-outage changed how we think about database migrations fundamentally.

We stopped doing migrations during business hours. Even "safe" migrations. The risk isn't worth the convenience. Nights and weekends aren't fun, but they give you breathing room if something goes wrong.

We started doing dry runs in production. Not the actual migration—test runs that verify assumptions. Check for long-running transactions before scheduling the migration. Verify connection counts are within expected ranges. Confirm query patterns match what we planned for.

We built in mandatory waiting periods. After deploying code that supports new schema, we wait at least 24 hours before migrating data. After migrating data, we wait at least a week before removing old schema. Rushing migrations causes problems.

We created a migration review process. Every migration gets reviewed by someone who didn't write it. Fresh eyes catch assumptions the original author embedded without realizing.

Using platforms like Crompt AI where you can compare different AI model outputs helped us during reviews—we'd ask multiple AIs to review migration scripts and identify potential issues. Different models caught different edge cases.

We started measuring migration success differently. Success isn't "the migration completed." Success is "the migration completed without user impact." We track metrics during every migration and review them afterward, even for successful migrations.

The Hard Truth

Zero-downtime database migrations are possible. They're also fragile, complex, and dependent on conditions you can't fully control.

Every migration carries risk. The question isn't whether to avoid risk—it's whether you've done enough to survive when things go wrong.

Testing in staging helps but doesn't eliminate uncertainty. Following best practices helps but doesn't guarantee success. Having smart engineers helps but doesn't prevent mistakes.

What helps most is accepting that migrations can fail and building systems that handle failure gracefully. Have rollback plans. Have abort criteria. Have monitoring that tells you when to bail out before user impact becomes severe.

The migration that nearly took down our production system followed all the best practices we knew at the time. It still almost failed catastrophically. Not because we were careless, but because production is different than staging in ways you can't fully predict.

The lesson isn't "don't do database migrations." It's "respect the complexity and plan for failure."

Your migrations will work ninety-nine times. It's the hundredth time—when production conditions align in ways you didn't anticipate—that determines whether your "zero-downtime" approach actually delivers zero downtime or just a slower disaster.

Managing complex database migrations? Use Crompt AI to review migration scripts, analyze patterns, and catch edge cases before they hit production—because the best outages are the ones you prevent.

-ROHIT

How Production Logs Forced Me to Simplify API Error Handling

Rohit Gavali — Fri, 16 Jan 2026 08:34:38 +0000

At 3 AM on a Tuesday, our API threw an error that took me forty-five minutes to understand from the logs alone.

Not because the error was complex. Because our error handling was.

We had built what we thought was a sophisticated error handling system. Detailed error codes, extensive logging, custom exception hierarchies, contextual metadata attached to every failure. The kind of system that looks impressive in code review and feels like enterprise-grade engineering.

Then production hit, and I found myself scrolling through thousands of log lines, unable to quickly answer the simplest question: "What actually went wrong?"

That night, staring at logs that told me everything except what I needed to know, I realized we had optimized for the wrong thing. We had built error handling for the code's elegance, not for the human debugging it at 3 AM.

The Abstraction Trap

Our error handling started simple. Catch exceptions, log them, return appropriate HTTP status codes. Basic, functional, boring.

Then we started adding "improvements."

We created custom exception classes for every failure mode. DatabaseConnectionException, InvalidAuthTokenException, RateLimitExceededException, UpstreamServiceTimeoutException. Each with its own error code, severity level, and metadata schema.

We built middleware that caught these exceptions, transformed them into standardized error responses, logged them with rich context, and tracked them in our monitoring system. We had error hierarchies, error factories, error serializers.

The code looked clean. The architecture felt robust. The error handling was thorough and type-safe.

And it was completely useless when trying to debug production issues.

The problem wasn't that our errors lacked information—they had too much. Every error logged twenty fields of context. Stack traces were pristine. Error codes were precise. But when scanning through logs at 3 AM trying to understand why the API was returning 500s, I couldn't quickly distinguish signal from noise.

Our sophisticated error system had created a new problem: information overload that masked the actual failures.

What Production Logs Revealed

After that 3 AM incident, I started actually reading our production logs. Not during incidents—during normal operation. What I found was humbling.

Most of our carefully crafted error context was never useful. The detailed metadata we attached to exceptions? Rarely relevant. The precise error codes mapping to specific failure modes? Nobody referenced them. The error hierarchies we'd designed? They didn't help anyone understand what was failing.

What actually helped during debugging was simple, direct information:

What was the API trying to do?
What went wrong?
What should we do about it?

Everything else was noise.

I noticed patterns in how I actually debugged production issues. I'd grep for the endpoint that was failing, scan for error keywords, look for repeated failures, check for upstream service names. The sophisticated error handling we'd built didn't support this workflow—it fought against it.

Our logs looked like this:

[ERROR] Exception caught in middleware layer
Type: DatabaseConnectionException
Code: DB_CONN_001
Severity: HIGH
Message: Unable to establish connection to database
Context: {
  "request_id": "abc123",
  "user_id": "user_456",
  "endpoint": "/api/users/profile",
  "database_host": "prod-db-1.internal",
  "connection_pool_size": 50,
  "retry_attempt": 3,
  "timeout_ms": 5000,
  ...15 more fields
}
Stack trace: [50 lines]

What I actually needed:

[ERROR] /api/users/profile - Database connection failed after 3 retries (prod-db-1 timeout)

The first format was "complete." The second was useful.

The Simplification

I started rewriting our error handling with a new principle: optimize for the person reading logs, not the person writing code.

First change: Flatten the error hierarchy. Instead of custom exception classes for every failure mode, we went to three categories: client errors (4xx), server errors (5xx), and dependency failures (upstream services, databases, etc.). That's it.

This felt wrong at first. We were losing type safety. We were giving up precise error categorization. But in production logs, those distinctions didn't matter. What mattered was: Is this our fault or the client's fault? Is this a code bug or an infrastructure issue?

Second change: Structure logs for grep, not JSON parsers. We had been logging errors as structured JSON, thinking it would make them easier to query. In practice, it made them harder to read. When debugging, you scan logs visually. JSON objects spread across multiple lines are hard to scan.

We switched to a simple format: [LEVEL] endpoint - what happened (relevant context). One line per error. No nested objects. Critical information in predictable positions.

Third change: Context only when it matters. We stopped attaching comprehensive metadata to every error. Instead, we included only the context that would help debug that specific failure type.

Database connection failed? Log which database and how many retries. Don't log request IDs, user IDs, or the entire request context—those are already in the access logs.

Rate limit exceeded? Log the endpoint and the limit. Don't log the client's entire request history.

Fourth change: Make errors actionable. Every error should suggest what to do next. Not in a user-facing message, but in the logs themselves.

Instead of: InvalidAuthToken we logged: Authentication failed - token expired (client should refresh)

Instead of: UpstreamServiceTimeout we logged: Payment service timeout after 5s - check payment-service health

This changed how we thought about errors. They weren't just failures to categorize—they were signals for action.

The Tools That Actually Help

Once we simplified our error handling, we needed better ways to make sense of the patterns emerging in logs.

We started using AI to analyze log patterns when we noticed repeated errors. Not to replace human investigation, but to quickly surface correlations we might miss. "These three endpoints are failing at the same rate—probably the same root cause."

For complex debugging sessions, we'd use Claude Sonnet 4.5 to help structure our investigation. Paste in a sample of errors, ask it to identify the common pattern or suggest what to check next. The AI didn't debug for us, but it helped organize our thinking when we were overwhelmed.

When logs revealed issues with specific data transformations or validation logic, we'd use tools that could analyze and extract structured information from messy error patterns, helping us understand what types of inputs were causing failures.

The goal wasn't to automate debugging—it was to accelerate the pattern recognition that helps you form hypotheses about what's actually broken.

What We Gave Up (And Why It Didn't Matter)

Simplifying our error handling meant sacrificing things that felt important:

Detailed error taxonomies. We went from 50+ error types to basically three categories. This felt like a loss of precision. In practice, the precision wasn't helping anyone. Knowing the exact error type didn't make debugging faster—knowing what was broken did.

Comprehensive metadata on every error. We stopped logging everything we could and started logging only what was relevant. This meant sometimes we'd have to add more logging after discovering we needed additional context. That's fine—better to add specific logging when needed than drown in unused context always.

Type-safe error handling. Our custom exception hierarchy gave us compile-time guarantees about error handling. Removing it felt risky. But runtime reliability isn't about type safety—it's about humans understanding failures quickly and fixing them correctly.

Sophisticated error transformation pipelines. We had middleware that enriched errors, categorized them, and routed them to different logging systems based on type. We deleted most of it. Simpler error handling meant fewer places for bugs to hide in the error handling itself.

What we gained was worth more than what we lost: the ability to debug production issues quickly.

The Pattern That Emerged

After six months with simplified error handling, we noticed something interesting: we were fixing bugs faster, but we weren't fixing more bugs.

The complex error handling hadn't prevented bugs. It had just made them harder to understand. When you can't quickly diagnose what's failing, you either ignore intermittent errors (hoping they'll go away) or spend excessive time debugging simple issues.

With clearer errors, we could quickly distinguish between:

Known issues we're monitoring
New failures that need immediate attention
Client errors that don't require action
Infrastructure problems vs. code bugs

This meant less time investigating false alarms and more time fixing actual problems.

The developers on our team started writing simpler error handling in new code. Not because we mandated it, but because they saw how much easier it made their own debugging. The cultural shift from "comprehensive error handling" to "useful error handling" happened organically.

What This Means for Your API

If you're building error handling right now, here's what I'd do differently:

Start with simple logging. Don't build sophisticated error categorization until you've actually debugged production issues and know what information you need. Your first error handling should be almost embarrassingly simple.

Optimize for human scanning, not machine parsing. Structured logging has its place, but errors should be readable first, queryable second. When something's on fire, you need to scan logs visually and quickly form hypotheses.

Make errors actionable. Every error should tell you what to do next. "Database connection failed" isn't enough. "Database connection failed - check if prod-db-1 is accepting connections" actually helps.

Include context that matters, exclude context that doesn't. You don't need to log everything about the request with every error. You need to log what's relevant to that specific failure mode.

Test your error handling by reading logs. Don't just test that errors are caught and logged. Actually read the logs and see if you can quickly understand what's failing. If it takes you more than a few seconds to understand an error, your error handling is too complex.

Use platforms like Crompt AI that let you work with multiple AI models to help analyze error patterns when you're debugging complex issues. Not as a replacement for good logging, but as a thinking partner when you're trying to make sense of what logs are telling you.

The Real Lesson

Error handling isn't about catching every possible failure mode and logging comprehensive context. It's about making failures understandable to the person who has to fix them.

The best error handling I've seen isn't sophisticated—it's simple, direct, and optimized for human comprehension under pressure.

Your errors will be read by tired developers at inconvenient times trying to fix problems quickly. Write error handling for them, not for the idealized version of yourself that has unlimited time to investigate issues.

The sophistication comes from understanding what information actually helps during debugging, not from building elaborate

-ROHIT

How Production Logs Forced Me to Simplify API Error Handling

Rohit Gavali — Fri, 16 Jan 2026 07:07:05 +0000

At 3 AM on a Tuesday, our API threw an error that took me forty-five minutes to understand from the logs alone.

Not because the error was complex. Because our error handling was.

Then production hit, and I found myself scrolling through thousands of log lines, unable to quickly answer the simplest question: "What actually went wrong?"

The Abstraction Trap

Our error handling started simple. Catch exceptions, log them, return appropriate HTTP status codes. Basic, functional, boring.

Then we started adding "improvements."

The code looked clean. The architecture felt robust. The error handling was thorough and type-safe.

And it was completely useless when trying to debug production issues.

Our sophisticated error system had created a new problem: information overload that masked the actual failures.

What Production Logs Revealed

After that 3 AM incident, I started actually reading our production logs. Not during incidents—during normal operation. What I found was humbling.

What actually helped during debugging was simple, direct information:

What was the API trying to do?
What went wrong?
What should we do about it?

Everything else was noise.

Our logs looked like this:

[ERROR] Exception caught in middleware layer
Type: DatabaseConnectionException
Code: DB_CONN_001
Severity: HIGH
Message: Unable to establish connection to database
Context: {
  "request_id": "abc123",
  "user_id": "user_456",
  "endpoint": "/api/users/profile",
  "database_host": "prod-db-1.internal",
  "connection_pool_size": 50,
  "retry_attempt": 3,
  "timeout_ms": 5000,
  ...15 more fields
}
Stack trace: [50 lines]

What I actually needed:

[ERROR] /api/users/profile - Database connection failed after 3 retries (prod-db-1 timeout)