Forem: Razorpay

LinkedPayments: Building Payment Chains in a Microservices Architecture

Vatsal Mehta — Wed, 04 Feb 2026 08:24:36 +0000

If you've ever tried to use a gift card or wallet for an online purchase, you've probably experienced this frustration. You have a ₹500 as gift card or in wallet, but your cart total is ₹750. The checkout won't let you combine the gift card with your credit card. You either abandon the extra items or don't use the gift card at all.

This isn't a technical limitation; it's an architectural one. Most payment systems treat each payment method as an isolated transaction rather than composable building blocks.

At Razorpay, we power checkout for thousands of merchants across India, and this limitation was costing them real money.

Merchants wanted to accept partial payments through gift cards, loyalty points, or store credit, with customers covering the remaining balance through standard payment methods. However, our payment gateway architecture wasn't designed for chaining multiple payment methods in a single order.

Each payment method lived in its own silo, processed independently, with no concept of partial payments or sequential authorization.

That's why we built LinkedPayments, a system that treats payment methods as composable units that can be chained together to fulfill an order. Customers can now use a gift card for ₹500, then pay the remaining ₹250 via UPI, card, or any other supported method.

The system handles authorization sequencing, failure recovery, settlement splitting, and reconciliation across linked payments automatically.

Here's how we built it, and more importantly, how we designed for reliability when payment chains introduce complex failure modes.

The Core Challenge: Payment Atomicity vs. Composability

Before diving into the solution, let's talk about what makes linked payments architecturally complex. The fundamental tension is between atomicity (payments either complete fully or fail entirely) and composability (payments can be partial and sequential).

Traditional payment systems optimize for atomicity. When a customer initiates a payment, the system authorizes the full amount from one payment method, captures it if authorization succeeds, and settles it to the merchant.

This flow is simple, deterministic, and easy to reason about. Either the payment worked or it didn't. Reconciliation is straightforward because there's exactly one authorization, one capture, and one settlement per order.

Linked payments break this simplicity. Now you have multiple authorizations for a single order, sequential dependencies between them (the second payment only happens if the first succeeds), partial failure scenarios (first payment succeeds, second fails), and split settlements (money comes from different sources). Each of these complexities introduces new failure modes.

Here's what makes this architecturally challenging: each payment method at Razorpay has its own microservice and its own database.

Gift cards are managed by one service with its own data store, UPI by another, cards by yet another. In a monolithic architecture, coordinating sequential payments would be straightforward; you'd just use database transactions. But in a microservices architecture, maintaining consistency across multiple independent services with separate databases requires explicit orchestration, distributed state management, and careful handling of partial failures.

Each of these complexities introduces new failure modes that don't exist when everything lives in a single service.

Consider what happens if the first payment method (gift card) succeeds but the second (UPI) fails. You can't just tell the customer "payment failed, try again" because you've already captured ₹500 from their gift card. You need to reverse that capture, refund the gift card balance, and handle the coordination between two different payment providers who might have different reversal timelines and APIs.

The reconciliation complexity multiplies. Merchants need to understand which portion of an order's payment came from which source. Settlement reports need to break down gift card amounts separately from standard payment method amounts. Refunds become complicated because you need to potentially refund across multiple payment methods in the correct proportions. Tax calculations need to account for different payment sources that might have different tax treatments.

The LinkedPayments Architecture: Building for Complexity

Our solution treats linked payments as a first-class concept with explicit orchestration rather than trying to shoehorn it into existing single-payment flows.

The architecture has several key components designed specifically to handle the sequential, multi-source nature of payment chains.

The Payment Orchestrator is the brain of the system. It maintains the state machine for payment chains, coordinates sequencing across multiple payment methods, handles failure recovery and reversals, and ensures exactly-once semantics even when components fail and retry.

The orchestrator knows that gift cards must be attempted first (per business rules), subsequent methods can only proceed after previous ones capture successfully, and any failure in the chain triggers reversal of all completed payments.

Payment Method Handlers are specialized for their respective methods. The Gift Card Handler understands gift card balance checks, partial redemption logic, and expiry validation.

Each handler exposes a consistent interface (authorize, capture, reverse) but implements method-specific logic internally. This abstraction lets the Orchestrator treat all payment methods uniformly while each handler optimizes for its method's unique characteristics.

The data model is critical for maintaining consistency. Here's how it actually works: when an order is created, the LinkedPayment gets initialized by the customer. The order stores all payment_ids in the chain format within order_meta, indexed by order_id. When the primary payment method (UPI or card) is authorized, an event is sent to Kafka. A worker consumes this event, checks order_meta using the order_id to retrieve the array of stored payment_ids in chain format, and then authorizes the next payment in the sequence.

This event-driven approach ensures sequential processing while maintaining loose coupling between payment steps.

Redis provides distributed locking to prevent race conditions. When processing a payment chain, we acquire a lock on the order ID to ensure only one process attempts to advance the chain at a time. This prevents double-processing.

Handling Failure Scenarios: The Hard Part

The real complexity in LinkedPayments isn't the happy path; it's the failure scenarios. Payment systems are inherently distributed, which means failures can happen at every boundary.

Networks timeout, payment providers have outages, authorization succeeds but capture fails, reversals are requested but the provider is temporarily unavailable. The system needs to handle all of these gracefully without leaving orders in inconsistent states.

Scenario 1: Second payment fails after first succeeds. If a Customer attempts to pay ₹750 using ₹500 gift card and ₹250 UPI. Gift card authorization succeeds, but UPI authorization fails (insufficient balance or technical issue). Here's the critical design decision: we don't capture any payments until all authorizations in the chain succeed. This means we only have an authorization hold on the gift card, not an actual capture. The system simply reverses the authorization (releases the hold), and importantly, this reversal doesn't incur any charges to the merchant because no money was actually captured. The customer can retry the order with a different payment combination without any financial impact.

Scenario 2: Capture fails after authorization succeeds. Both payment methods authorize successfully, but capturing the second payment fails due to a provider outage. We can't just retry the capture indefinitely because authorizations expire (typically 7 days for cards, but gift cards might expire sooner). The system needs to reverse all authorizations that won't be captured and notify the customer to retry the entire order.

Scenario 3: Partial refund on linked payments. Customer requests a partial refund for ₹200 on a ₹750 order paid via ₹500 gift card and ₹250 UPI. Which payment method should we refund from? The system provides merchants with the flexibility to decide. Merchants can choose to refund from the gift card, refund from UPI, or even split the refund across both methods based on their business policies or customer preferences. The system calculates the specified refund distribution, triggers appropriate reversals for the chosen payment methods, and updates settlement records to reflect the new amounts. This flexibility lets merchants optimize for customer experience or operational efficiency based on their specific use case.

Scenario 4: Idempotency on retries. Network glitches can cause the same capture request to be sent multiple times. The system must recognize duplicate requests (using idempotency keys) and return the same result without double-capturing.

We handle these scenarios through careful state machine design. Each payment in the chain progresses through well-defined states (created, authorized, captured, failed, refunded) with explicit transitions. State transitions are atomic database operations with pessimistic locking to prevent concurrent modifications. The Orchestrator implements retry logic with exponential backoff for transient failures and circuit breakers for sustained provider outages.

The API Contract: Making It Simple for Merchants

Despite the internal complexity, we designed the merchant-facing API to be remarkably simple. Here's one of the key architectural advantages: merchants don't need to create anything special for linked payments orders.

This is crucial because you never know upfront whether a customer will combine two payments or use a single method. Forcing merchants to create different order types would be impractical. Instead, merchants create orders exactly like any normal payment:

Creating an order with linked payments:

POST /v1/orders
{
  "amount": 7500,
  "currency": "INR",
}

That's it. Just a standard order creation with amount and currency. The system returns an order ID and checkout URL.

The magic happens dynamically at checkout. When the customer reaches the payment page, they choose their payment method. If they select "Use Gift Card + Another Method," the LinkedPayments flow activates automatically. The system handles the entire chain (gift card authorization, then secondary payment method) without the merchant needing to have anticipated this during order creation. Everything is updated dynamically based on the customer's actual payment preference.

Webhook events keep merchants informed:

Webhook events keep merchants informed, but with an important design choice: we withhold authorization webhooks until all payments in the chain are authorized. This prevents merchants from receiving premature notifications that might suggest the order is ready to fulfill when subsequent payments could still fail.

{
  "event": "payment.captured",
  "order_id": "order_XYZ",
  "linked_payment_id": "lp_001",
  "method": "gift_card",
  "amount": 5000,
}

Only after all authorizations succeed do merchants receive webhook notifications. When all Linked Payments have been captured, we send an order paid webhook notification. If any payment authorization fails, the order is not marked as "paid" and the amount which was authorized automatically reverses.

This event-driven approach lets merchants track progress without polling APIs. They can update their order status in real-time, show customers which payments completed, and handle failures gracefully.

Reconciliation and Settlement: Following the Money

Here's one of the most elegant aspects of the LinkedPayments architecture: reconciliation and settlement work out-of-the-box without requiring any changes. This wasn't accidental; it was a deliberate design choice.

The system was architected so that each linked payment behaves like an independent payment from the settlement perspective.

When ₹500 comes from a gift card and ₹250 from UPI, the existing settlement infrastructure treats them as two separate payment transactions associated with the same order. The gift card portion settles according to existing gift card program rules. The UPI portion settles through standard UPI settlement flows.

Merchants receive settlement reports that automatically break down amounts by payment method using the same reporting infrastructure they already use for regular payments. No new report formats, no special reconciliation processes, no additional integration work. The LinkedPayments logic is transparent to downstream settlement and reconciliation systems.

This design choice meant we could ship LinkedPayments without requiring merchants to update their financial operations, accounting integrations, or reconciliation workflows. It just works with their existing setup.

What We Got Right (And What We'd Do Differently)

Building LinkedPayments taught us several lessons about designing payment systems for composability.

State machines are essential for reliability. Explicit state transitions with validation make the system predictable and debuggable. When a payment gets stuck, we can look at its state and understand exactly where in the flow it stopped and what the next valid transitions are.

Idempotency isn't optional. Payment operations must be safely retryable. Using idempotency keys for every authorization, capture, and reversal request ensures that network issues don't cause double-processing. We learned to make idempotency keys deterministic based on order ID and sequence number so retries naturally use the same key.

Design for automatic reversal handling. Rather than building complex reversal logic as a separate concern, we ensured our design handled reversals automatically. If any payment in the chain gets stuck or fails, the system automatically reverses completed authorizations without requiring manual intervention or specialized reversal services. This design-first approach to failure handling proved more reliable than bolt-on reversal mechanisms.

Settlement works out-of-the-box. As mentioned earlier, we designed LinkedPayments so that settlement was supported from day one without requiring any changes to existing settlement infrastructure. Each linked payment leverages the standard settlement flows for its respective payment method. This meant zero additional complexity for settlement routing or reconciliation.

Testing linked payments requires scenario coverage. Unit tests aren't sufficient. We built integration test suites covering dozens of scenarios: all payments succeed, first fails, second fails, both fail, reversals succeed, reversals fail, partial refunds, full refunds. This comprehensive testing caught edge cases that would have been painful to discover in production.

If we were starting over, we'd invest even more heavily in observability from day one. The ability to trace a payment chain's execution across multiple services, understand decision points, and replay sequences for debugging is invaluable. We added this later but wish we'd built it into the initial architecture.

Real-World Impact: When Flexibility Drives Adoption

The business impact of LinkedPayments is currently focused on gift card programs, where we've seen measurable improvements in merchant adoption and customer behavior.

Merchants using gift card programs saw redemption rates increase. Customers who previously abandoned gift cards because they couldn't cover full purchases now use them confidently, knowing they can combine with other payment methods. This increased redemption drives customer loyalty and repeat purchases.

Average order values increased for merchants offering gift cards. When customers can apply gift card balances as partial payment, they're more willing to make larger purchases. The psychology of "I'm already getting ₹500 off" encourages adding more items to the cart.

Payment success rates improved because customers have more flexibility. If a card payment fails, they can try splitting it with a gift card and a smaller card payment. This fallback path converts what would have been failed checkouts into successful orders.

However, LinkedPayments was architected to support any combination of payment methods, not just gift cards.

The same infrastructure that enables gift card + UPI combinations can support scenarios we're actively exploring: store credit + card for e-commerce platforms with loyalty programs, wallet + netbanking for customers managing balances across multiple sources, corporate credit + personal card for expense reimbursement scenarios, and multiple cards for high-value purchases split across credit limits.

From a platform perspective, LinkedPayments positions Razorpay competitively for merchants running loyalty programs, gift card initiatives, or store credit systems.

These merchants need payment infrastructure that supports their business model, not just basic transaction processing. By providing this capability out of the box, we've differentiated ourselves from competitors who would require custom integration work.

The Broader Lesson: Composability as Infrastructure

LinkedPayments demonstrates a broader principle about building payment infrastructure: composability multiplies capability. When you design payment methods as modular, chainable units rather than isolated silos, you enable use cases you didn't originally anticipate.

We've already seen merchants using LinkedPayments for scenarios we never explicitly designed for. Subscription payments where customers apply account credit before charging the registered card.

This composability emerges naturally from the architecture. Because the Orchestrator treats payment methods uniformly and the state machine handles arbitrary sequencing, any combination of supported methods "just works" without requiring special case implementation.

The lesson applies beyond payments. When building platform capabilities, investing in composability early pays dividends. Each new composable unit you add doesn't just enable one new use case; it enables N new combinations with existing units. The value grows combinatorially rather than linearly.

The Bottom Line

LinkedPayments transforms payment acceptance from an all-or-nothing proposition to a flexible composition of sources. By treating payment methods as chainable building blocks with robust orchestration, failure handling, and settlement routing, we've enabled merchants to support complex payment scenarios that would otherwise require custom development.

The technical complexity is real. Payment chains introduce failure modes, latency challenges, reconciliation challenges, and settlement complications that don't exist in single-payment flows. However, the business value justifies this complexity. Merchants need this flexibility to run modern loyalty programs, gift card initiatives, and customer credit systems.

The architecture we've built demonstrates that composability is achievable in payment systems when you design explicitly for it. Clear state machines, comprehensive failure handling, idempotent operations, and observable execution create systems that remain reliable even as complexity increases.

If you're building payment infrastructure or financial systems that need to support multiple funding sources, the lessons from LinkedPayments apply directly. Design for composability from day one, make failure handling first-class, invest in state machine rigor, and build observability that lets you understand what's happening when things go wrong.

The future of payment systems isn't just supporting more payment methods; it's enabling flexible combinations of those methods to match how customers actually want to pay. LinkedPayments is our step toward that future, and the patterns we've discovered are worth considering for anyone building similar financial infrastructure.

editor: @paaarth96

Gateway Integration Agent: How We Cut Payment Gateway Integration Time from Weeks to Days

Nikhilesh Chamarthi — Wed, 17 Dec 2025 13:39:49 +0000

contributors : @ankit_choudhary_2209, @jating06, @amanlalwani007

If you've ever integrated a payment gateway, you know it's rarely as simple as the documentation makes it seem. API quirks, undocumented headers, field mappings that change between environments, and edge cases that only surface in production. At Razorpay, we integrate with dozens of banks and payment providers across India, and until recently, each new gateway meant 2-3 weeks of developer time, extensive testing, and inevitable deployment surprises.

The traditional approach: read hundreds of pages of documentation, map request/response fields between formats, implement error handling for dozens of failure scenarios, write integration code matching existing patterns, create comprehensive tests, and iterate through multiple rounds of bug fixes. This process is tedious, error-prone, and doesn't scale when you're adding three new gateways simultaneously or when banks update APIs and break existing integrations.

That's why we built the Gateway Integration Agent on our SWE Agent platform. This isn't just code generation; it's an intelligent system that understands bank documentation, learns from our existing implementations, and generates production-ready integration code with comprehensive test coverage. The result? We've reduced integration time from 2-3 weeks to 4-5 days, increased throughput by 3x, and freed developers to focus on genuinely complex edge cases. Even product managers can now initiate integrations.

The Problem: When Payment Integration Becomes a Bottleneck

Razorpay supports multiple payment methods across different gateways: credit cards, debit cards, UPI, netbanking, wallets. Each method has different workflows, API contracts, and failure modes. Integrating a new gateway means implementing support for whichever methods that gateway offers, often five or six different endpoints with distinct structures.

Documentation quality varies wildly. Some banks provide comprehensive API specs. Others give you PDFs with SOAP screenshots and vague field descriptions. You spend hours clarifying whether txn_id equals transaction_reference or something entirely different. Authentication schemes range from simple API keys to byzantine combinations of certificates, HMAC signatures, and rotating tokens.

Our integration codebase has accumulated patterns over years. Gateway implementations live in integrations-go with established conventions for client structure, retry handling, logging, and data transformation. New integrations must follow these patterns for consistency and maintainability. Developers can't just implement what bank docs say; they need to implement it the Razorpay way.

Testing is another massive time sink. Unit tests covering happy paths, error scenarios, timeout handling, retry logic. Integration tests verifying field transformation correctness. Handling cases where banks return success codes with error messages in the body. Writing comprehensive tests often takes as long as writing the integration code itself.

Gateway integrations became a bottleneck. Business partnerships signed, merchants requested specific banks, and the development backlog took weeks to clear. We needed dramatic acceleration without sacrificing code quality or test coverage.

Phase 1: Manual LLM-Assisted Integration

Our first attempt used Claude with cursor rules to assist developers. Load bank docs and a reference gateway into context, iteratively prompt for functions, error handlers, and tests. This augmented development approach cut integration time from weeks to days in many cases. Claude excelled at complex field mappings, boilerplate generation, and test case creation.

However, limitations emerged. Still required experienced developers to orchestrate the process and make architectural decisions. Context window limitations meant carefully curating what information to provide. Consistency varied based on how developers crafted prompts. Quality depended heavily on prompt engineering and output review.

What we learned: LLMs could handle payment gateway complexity when properly guided. The limitation wasn't model capability; it was lack of systematic orchestration. We needed to encode the entire workflow into an automated system applying best practices consistently.

Enter SWE Agent: Platform for Engineering Automation

SWE Agent is our internal automation platform for streamlining repetitive SDLC tasks through AI-powered workflows. An orchestration layer for developer productivity where you define agents that understand engineering contexts, make intelligent decisions, and execute complex multi-step tasks autonomously.

The React frontend democratizes automation access. Teams browse available agents, configure parameters, and monitor execution through a clean web interface without needing command-line expertise.

The FastAPI backend handles API routing, authentication, and job scheduling. An integrated MCP Server exposes the Model Context Protocol, allowing IDEs and external clients to interact programmatically.

LangGraph workflows provide intelligent orchestration with state machines and conditional logic. Workflows execute multi-step processes (Initialize → Run Parallel → Validate → Deploy → E2E Tests → Create PRs) that can branch, retry, or fail gracefully based on intermediate results. This is crucial for complex tasks where subsequent steps depend on earlier discoveries.

Background workers provide scalable async execution via SQS. Long-running tasks execute on horizontally scalable worker nodes without blocking APIs or hitting timeouts.

The Agent Execution Layer supports multiple headless CLI agents: Claude Code (via AWS Bedrock and Vertex AI), Gemini CLI (via Vertex AI), and Agent-to-Agent communication through Google ADK. MCP integrations extend capabilities: Memory, Sequential Thinking, Service Level, E2E Tests, K8s, Devstack, Infra, and Data Lake access.

Infrastructure includes MySQL for persistence, Redis for caching, GitHub CLI for repos, AWS EFS for shared codebase access, and dual LLM provider support (Bedrock and Vertex AI) for environment-specific model selection.

What makes this powerful is the combination: intelligent orchestration making context-aware decisions, robust execution infrastructure handling scale and retries, comprehensive tooling interacting with Git/GitHub/K8s/test frameworks like developers would, and an extensible agent catalogue where new agents integrate without core modifications.

Gateway Integration Workflow: The Complete Flow

Step 1: Configuration. User selects Gateway Integration Agent, provides gateway name, reference gateway, APIs to integrate, payment methods, and bank documentation in MDC (Markdown Catalog) format. MDC is structured documentation clearly delineating endpoints, schemas, authentication, error codes, and validations, allowing programmatic extraction.

Step 2: Documentation Analysis. LangGraph parses MDC semantically, identifying endpoint patterns, required vs optional fields, authentication schemes, error codes, and response mappings. This creates a knowledge graph of the gateway's API surface. The agent understands that to initiate payment, you call endpoint X with fields A, B, C; error 1001 means invalid credentials; the bank uses ISO 8601 timestamps in responses but expects epoch milliseconds in requests.

Step 3: Reference Learning. Agent clones the reference gateway repository and analyzes implementation patterns: client initialization structure, configuration constant placement, authentication token refresh handling, retry logic for network failures, transaction logging approaches, field transformation abstractions. This pattern recognition ensures new gateways match existing conventions automatically.

Step 4: Conditional Routing. Workflow checks payment methods being integrated. UPI QR payments route to specific repositories; other methods route to integrations-go. This dynamic routing based on configuration makes the system extensible without core rewrites.

Step 5: Code Generation. Agent generates gateway client (authentication, connection, request execution), field transformers (Razorpay format to bank format bidirectionally), error handlers (catching, classifying, standardizing errors), validators (pre-flight requirement checks), and logging instrumentation. The code is customized to specific bank quirks: custom headers, XML error responses, specific transaction ID patterns.

Step 6: Automated Error Resolution. Agent runs linters checking style violations, then compilers checking syntax errors and type mismatches. When errors occur, the workflow analyzes error messages, understands what went wrong, generates targeted fixes, applies them, and recompiles. This iterative process continues until achieving clean compilation.

Step 7: Test Generation. Agent generates comprehensive unit tests targeting 80%+ coverage: successful payment flows, error scenarios, timeout/retry logic, field transformation correctness, validation edge cases, authentication refresh. Tests are informed by bank error codes and reference gateway patterns, ensuring consistency.

Step 8: Pull Request Creation. Workflow creates feature branch, commits code and tests, pushes to remote, and opens PR with detailed description including gateway name, supported methods, implemented endpoints, bank-specific quirks, and coverage statistics. The PR is ready for human review; tedious work done, critical oversight remains human.

Real-World Impact

Integration time dropped from 2-3 weeks to 4-5 days (75% reduction). We've successfully onboarded Decentro and HDFC Smart Gateway using the agent. Throughput increased 3x; teams now handle three to four gateways monthly versus one. Developer workload shifted from 100% hands-on coding to 25% review and edge case handling. Code consistency improved dramatically; every generated gateway follows identical patterns, conventions, and logging approaches.

Product managers can now initiate integrations through the intuitive UI without deep technical knowledge. This democratization means integrations start immediately after partnerships sign, without waiting for developer availability.

What Makes This Different

Semantic understanding of MDC documentation. The agent parses it as structured API specifications, not just text. Pattern learning from reference implementations beyond simple copying, identifying abstract patterns applied contextually. Conditional orchestration via LangGraph making intelligent routing decisions dynamically. Iterative error resolution analyzing, fixing, and retrying until clean compilation. Infrastructure integration creating branches, running linters/compilers, generating tests, opening PRs. Comprehensive test generation achieving 80%+ coverage informed by error scenarios and reference patterns.

The Bottom Line

The Gateway Integration Agent represents a fundamental shift in how Razorpay approaches payment infrastructure development. We've moved from manual bottlenecks constraining business growth to rapid, consistent integrations with minimal developer overhead.

Results: 75% time reduction, 3x throughput increase, democratized capability. Decentro and HDFC Smart Gateway successfully onboarded, proving production viability. Code quality is high, test coverage comprehensive, architectural patterns consistent.

More importantly, we've built a scalable foundation for continued improvement. Every gateway integrated makes the system smarter. Every edge case enriches pattern libraries. Every workflow improvement benefits all future integrations automatically.

This approach demonstrates that complex technical tasks traditionally requiring significant human expertise can be automated when you combine AI capabilities with systematic orchestration, robust tooling integration, and iterative refinement. The Gateway Integration Agent proves this works in production at scale.

If you're facing similar integration bottlenecks, the lessons apply directly. Build structured documentation formats machines can parse. Create reference implementations encoding your patterns. Use orchestration frameworks supporting conditional logic and state management. Integrate deeply with development infrastructure. Generate comprehensive tests, not just code. Design for continuous improvement rather than one-time automation.

The future of developer productivity isn't replacing engineers; it's empowering them to focus on genuinely complex problems by automating routine work. That's exactly what the Gateway Integration Agent does.

editor: @paaarth96

Project Viveka: A Multi-Agent AI That Does Root Cause Analysis in Under 90 Seconds

Anuj Gupta — Tue, 09 Dec 2025 09:04:24 +0000

If you've ever been on-call for production systems, you know the 2 AM drill. An alert fires. You groggily open your laptop, check the incident dashboard, jump into Grafana to examine metrics, dig through Coralogix logs looking for error spikes, SSH into Kubernetes to check pod health, review recent deployments, correlate across six different data sources, and thirty minutes later you're still trying to figure out what's actually wrong.

At Razorpay, where payment infrastructure processes billions of rupees daily, this manual investigation dance was costing us precious time during every incident.

The industry obsesses over Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR), but there's a critical metric in between that often gets overlooked: Mean Time to Investigate (MTTI). This is the gap between knowing something is broken and understanding why it's broken.

Traditional incident response spends the majority of time in this investigation phase, manually following runbooks, querying systems, and correlating signals. By the time you understand the root cause, you've already burned through the minutes that matter most for customer impact.

That's why we built Project Viveka, a multi-agent AI system that automates the entire investigation workflow. When an alert fires, Viveka orchestrates specialist agents across our observability stack, correlates evidence, and produces a structured root cause analysis with supporting data in under 90 seconds.

The name comes from Sanskrit meaning "discernment" or "wisdom," which felt appropriate for a system designed to cut through observability noise and find signal.

The Investigation Problem: When Manual Triage Doesn't Scale

Before diving into how Viveka works, let's talk about why incident investigation is so painful at scale. The challenge isn't having too little data; it's having too much data in too many disconnected systems, and needing human intelligence to connect the dots.

Our observability stack spans multiple systems. Zenduty handles alert routing and incident coordination. Grafana and VictoriaMetrics provide metrics dashboards and PromQL queries. Coralogix aggregates logs from hundreds of services. Kubernetes provides pod health and deployment information. AWS surfaces infrastructure-level signals about compute, load balancers, and networking. Each system has valuable information, but they don't talk to each other automatically.

When an alert fires for something like "Payment success rate dropped below 50%," an engineer follows a mental runbook. Check recent deployments. Look at error logs. Examine pod restarts. Query database metrics. Check downstream dependencies. Cross-reference all these signals to form a hypothesis about what's wrong.

This manual correlation is where time disappears. Each check takes minutes, involves context switching between tools, and requires the engineer to remember how different signals relate to each other. Moreover, the quality of investigation depends heavily on who's on-call. Experienced engineers know exactly which signals matter for which alerts. Junior engineers might check irrelevant systems or miss critical correlations. This inconsistency means similar incidents get diagnosed differently depending on who's investigating.

The consequences are measurable. High MTTI because no system automatically correlates signals across observability tools. Engineers spend 20-40 minutes just figuring out what's wrong before they can start fixing it. Inconsistent diagnosis because different engineers investigate the same symptoms differently. Knowledge silos because the correlation logic lives in people's heads rather than documented playbooks. After-hours pain because automated systems can detect problems but can't explain them, requiring human intervention regardless of the hour.

Solution Architecture: Multi-Agent Orchestration

Our response was to encode the investigation workflow into an AI system that thinks like an experienced SRE. Rather than a single monolithic AI trying to understand all observability signals, we built a multi-agent system where specialized agents handle different domains, orchestrated by a Supervisor that coordinates the investigation.

The Supervisor Agent is built on LangGraph, a framework for creating stateful multi-agent workflows. It receives incident context, retrieves relevant knowledge from our RAG systems, creates an investigation plan based on alert runbooks, delegates tasks to specialist agents, and synthesizes their findings into a coherent root cause analysis. Think of it as the incident commander making strategic decisions about what to investigate and how to correlate findings.

The Specialist Agents are domain experts. The Kubernetes Agent knows how to check pod health, identify failed rollouts, and spot resource constraints. The AWS Agent understands infrastructure patterns like load balancer saturation, network issues, or compute degradation. The Coralogix Agent analyzes logs for error spikes, exception patterns, and anomalous behavior. The PromQL Tool queries metrics to understand performance degradation, latency increases, or throughput drops. Each agent is focused and excellent in its domain.

The RAG systems provide contextual memory. Application Info contains service architecture, dependencies, ownership, and common failure modes. Alert Runbooks store diagnostic procedures specific to each alert type. When investigating a payment service alert, the Supervisor retrieves that service's architecture and the specific runbook for payment success rate degradation. This contextual grounding prevents generic responses and ensures investigations follow proven procedures.

The Memory system is crucial for correlation. After each agent completes its investigation, results get stored as structured evidence: what was checked, what was found, confidence level, and supporting data. Once all agents finish, the Supervisor reviews all stored evidence together, identifies patterns and correlations, resolves conflicts between signals, and constructs the most likely hypothesis based on collective evidence.

The Investigation Workflow: From Alert to Answer

Let me walk through exactly what happens when an incident triggers Viveka. Understanding the step-by-step flow reveals why this approach dramatically reduces investigation time.

Step 1: Context Retrieval. When the alert arrives, the Supervisor immediately pulls relevant information from both RAG collections. Application Info provides the payment service's architecture, dependencies, recent changes, and ownership. Alert Runbook provides the specific diagnostic procedure for payment success rate alerts. This contextual loading takes 2-3 seconds and ensures the investigation is targeted rather than generic.

Step 2: Investigation Planning. The Supervisor parses the runbook and creates a task plan. For a payment success rate alert, the plan might specify: check recent deployments, analyze error logs for payment failures, query success rate metrics over time, examine pod health and restarts, validate database connection health. This planning phase takes 1-2 seconds and produces a prioritized list of checks.

Step 3: Parallel Investigation. Here's where the architecture shines. The Supervisor delegates tasks to multiple agents simultaneously. While the Kubernetes Agent checks pod health, the Coralogix Agent analyzes logs, the PromQL Tool queries metrics, and the AWS Agent validates infrastructure. These investigations happen in parallel with a per-agent timeout of 5-8 seconds. Total investigation time is bounded by the slowest agent, not the sum of all agents.

Step 4: Evidence Storage. As each agent completes, it writes structured evidence to Memory: the input (what was checked), the output (what was found), a one-line note (interpretation), and confidence level. This structured storage is critical because it creates a fact base that the Supervisor can reason over during synthesis.

Step 5: Correlation and Hypothesis Scoring. With all evidence collected, the Supervisor builds an incident timeline ordered by timestamp. Deployment at T, error spike at T+2 minutes, pod restarts at T+5 minutes. Temporal correlation is powerful; events happening in sequence suggest causality. The Supervisor generates multiple hypotheses (bad deployment, infrastructure issue, downstream dependency failure) and scores each based on evidence count, temporal correlation, and historical patterns. The highest-scoring hypothesis becomes the primary explanation.

Step 6: RCA Generation. The Supervisor generates a human-readable summary including the root cause hypothesis, confidence score, key supporting evidence with citations, reasoning trail explaining the conclusion, and recommended next actions. This isn't just "here's what's wrong" but "here's what's wrong, here's the evidence, here's what you should do."

Step 7: Slack Posting. The RCA gets posted to the relevant team's Slack channel in a threaded format under the original alert. This keeps conversation tied to the incident and provides visibility to the entire team. Engineers can review the analysis, provide feedback on accuracy, and discuss remediation approaches without switching tools.

Why Multi-Agent Architecture Matters

You might wonder why we chose a multi-agent approach rather than a single powerful model. The answer reveals fundamental insights about building reliable AI systems for production operations.

Specialization beats generalization for complex domains. A single model trying to understand Kubernetes, AWS infrastructure, log patterns, metrics interpretation, and incident correlation would need enormous context and struggle with domain-specific nuances. Specialist agents can be optimized for their specific task, use domain-specific reasoning patterns, and maintain focused expertise.

Parallel execution dramatically reduces latency. A sequential investigation checking systems one after another would take minutes. Parallel agent execution means total time is bounded by the slowest check (typically 5-8 seconds), not the sum of all checks. This parallelism is critical for hitting the 90-second target.

Bounded context prevents token overflow. Each agent receives only the context it needs for its specific check. The Kubernetes Agent gets pod names and namespace, not the entire application architecture. This focused context prevents token limit issues that plague single-agent approaches with comprehensive context.

Memory-based synthesis reduces hallucinations. Rather than asking the LLM to remember everything from the investigation, we store facts in Memory and have the Supervisor reason over concrete evidence. This grounds the analysis in observable data rather than model-generated speculation.

Compositional improvement over time. When we improve an individual agent (better log analysis, more sophisticated metrics queries), all investigations automatically benefit. This modularity makes the system easier to enhance incrementally.

Results and Impact

The shift to AI-powered investigation produced measurable improvements. MTTI dropped by approximately 80%. Investigations that previously took 20-40 minutes of manual correlation now complete in 90 seconds. MTTR improved by 50-60% because faster investigation means faster remediation. Engineers can act on findings immediately rather than spending half their time figuring out what's wrong.

Consistency improved dramatically. Every alert of a given type follows the same investigation procedure, checking the same systems and correlating the same signals. Junior and senior engineers see identical analysis quality. This consistency also improves knowledge sharing; when the investigation is documented automatically, everyone learns from each incident.

After-hours coverage became genuine. Previously, automated alerting still required human investigation. Now the investigation happens automatically, and the on-call engineer receives a complete RCA alongside the alert. In many cases, the recommended action is clear enough that remediation can start immediately without additional investigation.

The system posts approximately 90-second analyses for most alerts, and teams rate accuracy through feedback in Slack threads. These ratings feed back into improving the RAG knowledge base and refining runbooks over time.

The Bottom Line

Project Viveka demonstrates that AI-powered incident investigation is practical, reliable, and dramatically faster than manual approaches. By encoding investigation workflows into orchestrated multi-agent systems, we've automated the most time-consuming phase of incident response while maintaining the quality and thoroughness of human investigation.

The 80% MTTI reduction isn't just a number; it represents minutes saved during every incident, which compounds into hours saved weekly and days saved annually. More importantly, it changes how engineers experience on-call. Rather than starting from scratch with every alert, they receive structured analysis immediately and can focus on remediation rather than diagnosis.

The multi-agent architecture is key to this success. Specialization enables domain expertise, parallelism enables speed, memory enables accurate correlation, and modularity enables continuous improvement. This isn't a single AI trying to do everything; it's a coordinated system of specialist AIs working together like an experienced SRE team.

If your organisation faces similar incident investigation challenges, the lessons apply broadly. Build specialist agents for different observability domains, orchestrate them with explicit workflows, ground analysis in stored evidence rather than model memory, and integrate directly into team communication channels. The technology exists, the patterns are proven, and the benefits are measurable.

editor: @paaarth96

Meet Bumblebee: Agentic AI Flagging Risky Merchants in Under 90 Seconds

Ankur — Tue, 02 Dec 2025 10:59:50 +0000

contributors: @parin-k, @sumit12dec @yashshree_shinde

If you're familiar with a payments company, you know the drill. Risk agents manually review thousands of merchant websites every month, checking for red flags: sketchy privacy policies, misaligned pricing, questionable social media presence, suspicious domain registration patterns.

At Razorpay, our risk operations team was conducting 10,000 to 12,000 manual website reviews monthly, each taking roughly four minutes of human attention. That's 700 to 800 human hours consumed every month, and the quality was inconsistent because different agents would interpret the same signals differently.

The traditional approach to fraud detection involves throwing bodies at the problem or building rigid rule engines that break the moment fraudsters adapt their tactics. We needed something better, something that could scale with our transaction volume while actually getting smarter over time.

That's why we built what we're calling Agentic Risk, a multi-agent AI system that automates merchant website evaluation from end to end while maintaining the nuanced judgment that used to require human expertise.

Here's what makes this interesting: we didn't just replace humans with AI and call it done. We went through three distinct architectural iterations, each one teaching us hard lessons about what works and what doesn't when you're building AI agents for production fraud detection.

The journey from our initial n8n prototype through an AI agent to our current multi-agent architecture reveals fundamental truths about building reliable AI systems at scale.

The Business Problem: When Manual Review Can't Keep Up

Let me paint the picture of what risk operations looked like before automation. When a new merchant signs up for Razorpay or when our fraud detection system flags an existing merchant, a case lands in our Risk Case Manager system. A human agent picks up that case and begins the investigation dance.

This process takes four minutes when everything goes smoothly, but that's rarely the case. Websites are structured differently, policy pages are hidden in weird places, domain information services have different interfaces, and social media handles aren't always obvious. The worst part isn't the time; it's the inconsistency. One agent might flag a merchant for having a generic privacy policy while another agent considers the same policy acceptable.

We were also paying thousands of dollars monthly for a third-party explicit content screening service, and it was generating about 50 alerts per month with less than 10% precision. Moreover, this service only caught one specific type of risk while ignoring dozens of other fraud indicators we cared about.

The fundamental issue was that we had excellent observability tools, structured data systems, and experienced risk analysts, but the connective tissue between all these components was human labor. Scaling meant hiring more agents, which meant more inconsistency, higher cost, and no improvement in detection speed or accuracy.

Phase 1: The n8n Prototype - When Visual Orchestration Hits Its Limits

We started with n8n, a visual workflow automation platform, to quickly prototype and validate our hypothesis. Within weeks, we had a working proof-of-concept integrating webhook ingestion, merchant metadata fetching, website content review via multimodal AI, domain lookups, GST enrichment, fraud metrics, and LLM-based risk analysis.

The prototype validated that automation was feasible and helped us identify the complete set of data points needed. However, n8n quickly revealed fundamental limitations: branch explosion (handling edge cases created unmaintainable 40-node workflows with duplicated logic), observability gaps (debugging failed nodes was painful with coarse logs), and platform instability (non-deterministic behavior in HTTP and merge operations). The n8n prototype taught us that production-grade risk automation would require a code-first approach with proper observability and the ability to use Python libraries directly.

Phase 2: Python + ReAct Agent - Better Control, New Bottlenecks

We rebuilt as a Python web application with an API frontend and task workers. This immediately solved several Phase 1 problems: native Python libraries, structured logging with trace IDs, proper exception handling with retry logic, and complex NLP preprocessing capabilities.

The core was a single ReAct-style agent that iteratively reasoned about which tools to call, executed them, and incorporated results until producing a structured risk assessment. Phase 2 brought full observability, easy tool addition, and dynamic behavior that replaced brittle conditional logic.

However, new bottlenecks emerged. Token bloat became critical as the agent accumulated 50KB+ of HTML content, domain data, and fraud metrics in its context window, regularly hitting token limits. Sequential execution meant tool invocations happened one after another even when they had no dependencies, scaling linearly with tool count. Temperature conflation forced a compromise setting that was suboptimal for both exploration (tool selection) and exploitation (final scoring). Phase 2 proved agentic orchestration was right, but single-agent architecture couldn't scale to thousands of concurrent evaluations.

Phase 3: Multi-Agent Architecture - When Specialization Wins

The breakthrough came when we stopped treating fraud detection as a single AI task and started building a multi-agent collaboration system. Rather than one agent doing everything, we split responsibilities across specialized agents optimized for specific roles: Planner, Fetchers, and Analyzer.

The Planner Agent receives the merchant case, examines available tools, checks system health and API quotas, and generates an execution plan. This isn't a rigid script; it's a structured specification of what information to gather, with priorities, timeouts, token budgets, and expected schemas. The Planner enforces business rules deterministically. Skip GST validation for non-Indian merchants. Deprioritize social media checks for B2B merchants where social presence matters less. This reduces unnecessary API calls and focuses resources on high-signal checks.

Data Fetcher Agents execute in parallel, each owning one data source or tool. Website scraping, WHOIS lookups, fraud database queries, social media metrics, pricing comparisons, policy verification. Here's the critical insight: fetchers don't just retrieve raw data. They perform local data pruning before returning results.

The website content reviewer doesn't send back 50KB of HTML. It extracts only relevant sections: privacy policies, contact information, pricing tables, product descriptions. Using keyword matching or lightweight NLP models, it returns a compact JSON payload with structured snippets, confidence scores, and provenance links. This solves the token bloat problem. Instead of accumulating full raw outputs, the system maintains small, information-dense summaries.

Each fetcher compresses its domain's data into a format optimized for downstream analysis. Fetchers also implement caching for data that doesn't change frequently. WHOIS information and domain reputation scores get cached with appropriate TTLs, reducing redundant external API calls and improving throughput during traffic spikes.

The Analyzer Agent consumes these structured payloads and produces the final risk assessment. It runs deterministic rules first: hard thresholds for fraud metrics, blacklist checks, compliance violations. These rules are fast, explainable, and don't require LLM inference.

Only after deterministic rules does the Analyzer invoke the LLM for interpretive tasks: generating human-readable summaries, explaining why certain indicators triggered, identifying nuanced patterns that don't fit simple rules. Because fetchers already pruned and structured the data, the Analyzer's LLM calls work with minimal context, avoiding token limit issues entirely.

Different agents use different temperature settings tuned for their roles. The Planner runs at medium temperature for flexible tool selection. The Analyzer uses very low temperature for deterministic risk scoring and higher temperature when generating business narratives where creative expression improves readability. This per-agent temperature control eliminates the compromises from Phase 2.

The execution model leverages Celery for orchestration. When a case arrives, the API enqueues a planning job. The Planner generates the execution plan and enqueues multiple fetcher jobs in parallel. As fetchers complete, their results stream into a shared state store. The Analyzer subscribes to fetcher completion events and begins processing as soon as enough data is available, not waiting for every fetcher if some are slow or failing.

If a fetcher fails entirely (website unreachable, API rate-limiting), the Planner degrades gracefully. The Analyzer proceeds with available data and flags the missing information for manual review rather than blocking the entire evaluation. This resilience was impossible in Phase 2's sequential architecture.

The Results: When Architecture Meets Reality

The shift to multi-agent architecture produced measurable improvements across every dimension. Token usage dropped 60% through fetcher-level pruning and elimination of full raw data in LLM context. End-to-end latency fell from 35 seconds to 8-12 seconds via parallel fetcher execution and focused LLM calls. Success rate rose from 88% to 99%+, measured as cases completing without token limits or LLM failures.

Cost per evaluation decreased despite adding sophisticated analysis. Smaller context windows meant cheaper LLM calls. Caching at the fetcher level reduced external API expenses. The system now handles thousands of concurrent evaluations without bottlenecking, scaling horizontally by adding task workers rather than vertically with bigger servers.

The most important improvement is maintainability and extensibility. Adding a new risk signal requires writing a new fetcher agent with its pruning logic and output schema. The Planner automatically incorporates new tools once registered. The Analyzer adapts to new data sources without modification. This composability enables continuous fraud detection improvement by adding signals incrementally rather than requiring architectural rewrites.

The multi-agent approach provides observability impossible in earlier phases. Each agent logs trace IDs, tokens consumed, latency, confidence scores, and reasoning. When a case produces unexpected results, we replay the exact sequence of fetcher outputs, examine what the Analyzer saw, and understand why it reached that conclusion. This audit trail is critical for debugging, regulatory compliance, and explaining decisions to merchants who dispute risk assessments.

What We Learned: Principles for Building Production AI Agents

Our journey from n8n through ReAct to multi-agent orchestration taught us several lessons that apply broadly to anyone building AI systems for production use cases.

Start simple, evolve deliberately. N8n was the right choice for Phase 1 even though we knew it wouldn't scale. Rapid prototyping and stakeholder validation matter more than architectural purity in early stages. What's critical is recognizing when you've outgrown your current approach and having the discipline to rebuild rather than patch over fundamental limitations.

Token budgets are real constraints. Many blog posts about AI agents gloss over token management, but in production systems with large, messy real-world data, token limits are where architectures break. Design explicitly for token efficiency: prune early, prune often, and never pass raw, unstructured data to LLMs when you can send structured summaries instead.

Specialization beats generalization at scale. A single agent trying to handle planning, data fetching, and analysis will hit walls that can't be solved with better prompts or bigger models. Splitting responsibilities across specialized agents with clear interfaces between them produces systems that are faster, more reliable, and easier to understand.

Temperature is not a hyperparameter you tune once. Different tasks need different temperature settings, and trying to find a compromise temperature for a single agent produces mediocre results everywhere. Per-agent temperature control is a fundamental architectural requirement, not an optimization detail.

Parallelism matters more than model size. Running multiple smaller, focused agents in parallel often outperforms running one large agent sequentially, both in terms of latency and cost. This runs counter to the instinct to throw the biggest model available at every problem.

Observability is not optional. Without structured logging, trace IDs, and the ability to replay decision sequences, debugging production AI systems is nearly impossible. Invest in observability infrastructure early, ideally before you have a production incident that requires it.

The Path Forward: Continuous Improvement by Design

What makes the multi-agent architecture particularly powerful is that it's designed for continuous improvement. As we accumulate more cases, we can identify patterns where the Analyzer produces low-confidence results or where human reviewers frequently override AI decisions. These cases become training data for improving fetcher pruning heuristics, refining Planner rules, and tuning Analyzer prompts.

We're exploring several extensions. Fine-tuning small, specialized models for specific fetchers rather than relying entirely on general-purpose LLMs. This could further reduce cost and latency while improving accuracy for domain-specific tasks like policy compliance checking. Implementing feedback loops where human overrides automatically update Planner rules or Analyzer thresholds, creating a self-improving system that gets smarter as risk operators correct its mistakes.

Another direction is adding predictive agents that don't just evaluate merchant risk at onboarding but continuously monitor for behavioral changes that might indicate fraud. Imagine fetchers running periodically in the background, detecting when a merchant's website content changes significantly, when pricing diverges from competitors, or when social media presence suddenly evaporates. The same multi-agent architecture that handles point-in-time evaluation can drive continuous risk monitoring with minimal modification.

Why This Matters Beyond Fraud Detection

I've been talking about merchant risk evaluation specifically, but the architectural patterns we discovered apply broadly to any domain where AI agents need to process large amounts of heterogeneous data, make complex decisions, and produce explainable results. Financial services, healthcare, supply chain management, cybersecurity, and legal research all have similar characteristics: multiple data sources with different formats and latencies, domain expertise encoded in rules and models, and requirements for auditability and compliance.

The lesson isn't "use multi-agent architecture for everything." The lesson is that as AI systems scale from demos to production, the architecture that got you to the first prototype often becomes the main thing preventing you from scaling further. Having the discipline to recognize when you've hit architectural limits, the willingness to rebuild from first principles, and the engineering rigor to measure improvements objectively separates successful production AI from expensive science projects.

At Razorpay, we've taken fraud detection from a manual, inconsistent process consuming 800 agent hours monthly to an automated system that evaluates merchants in seconds with higher accuracy and comprehensive audit trails. We've reduced our per-review time by 75%, improved detection consistency, and freed up risk operators to focus on genuinely complex cases that require human judgment. And we've done it with an architecture that gets better over time rather than more fragile.

If you're building AI agents for production use cases, the technology is ready. The LLMs are capable, the orchestration frameworks exist, and the integration tools work. The hard part is designing systems that handle real-world messiness, scale with your business, and maintain reliability when things inevitably break. That's where architecture matters, and that's what we learned the hard way through three iterations of building Agentic Risk.

editor: @paaarth96

Delta Lake Health Analyzer

Nitesh Jain — Wed, 26 Nov 2025 11:14:37 +0000

If you're running Delta Lake at any meaningful scale, you've probably experienced this. Queries that used to complete in seconds now take minutes. Your cloud storage bill shows mysterious costs. When you finally dig into the file structure, you discover tens of thousands of tiny files causing chaos.

The problem with Delta Lake table health is that it's invisible until it becomes a crisis. Small files accumulate gradually, partitions develop skew, storage bloats with old data. By the time query performance degrades noticeably, fixing it requires expensive OPTIMIZE operations you can't justify without understanding the scope.

We needed visibility into our Delta Lake health, but existing solutions didn't fit. Commercial tools are platform-locked, open-source alternatives require complex setup. We wanted something instant: point it at a table, get actionable insights in seconds.

That's why we built the Delta Lake Health Analyzer, a completely browser-based diagnostic tool using DuckDB WASM. Everything runs in your browser, and the data never leaves your machine.

The Fundamental Problem: Delta Lake Degradation Patterns

Delta Lake provides ACID transactions on top of object storage, which is powerful but introduces operational complexity. The core issues fall into three categories.

Small file proliferation is the most common problem. Streaming pipelines writing every few seconds generate thousands of files daily. Query engines need to open each file separately, causing thousands of S3 API calls. Cloud storage providers charge per request, not just volume. Reading 10,000 files of 1MB each costs significantly more than reading 10 files of 1GB each.

Partition skew develops as data patterns change. Initially balanced transaction volumes become dominated by a few large merchants, creating massive partitions while others remain small. Query engines can't parallelize effectively when partitions have wildly different sizes.

Storage inefficiency from uncompacted data. Delta Lake maintains historical versions for time travel, accumulating storage costs. Without regular VACUUM, you're paying to store data that will never be queried. Tables with frequent updates develop tombstone files marking deleted rows without removing them from storage.

These problems accumulate gradually. By the time query performance degrades noticeably, the file structure is badly fragmented.

How It Works: From Checkpoint to Actionable Insights

Let's walk through what actually happens when you analyze a table. Understanding the technical flow reveals why this approach is both practical and powerful.

The process begins when you provide a Delta table path. The browser reads the _last_checkpoint file from the _delta_log directory to determine the most recent checkpoint. This small JSON file tells us which checkpoint Parquet file contains the latest table state. We then fetch that checkpoint file from S3 using a pre-signed URL with the user's AWS credentials.

This checkpoint file is the key to everything. It's a Parquet file containing metadata for every active file in the Delta table: file paths, sizes, partition values, modification times, row counts, and statistics. For a table with 50,000 files, this checkpoint might be 20-30MB, which loads quickly even on modest internet connections. Once loaded into browser memory, DuckDB WASM makes this data queryable via SQL.

The file-level analysis examines the distribution of file sizes. We run queries like "how many files are under 128MB?" and "what's the total size of files under 10MB?" Small files are the primary indicator of optimization opportunities because they directly impact query performance and cloud costs. We also calculate the coefficient of variation (CV) for file sizes to understand how uniform the file distribution is. A high CV means file sizes vary wildly, suggesting inconsistent ingestion patterns or lack of compaction.

The partition-level analysis looks at how data is distributed across partitions. We count total partitions, calculate files per partition, and compute the coefficient of variation of partition sizes. High partition skew (high CV) means some partitions are massive while others are tiny, which hurts query parallelism. We identify the largest and smallest partitions by row count and size, helping users understand where imbalances exist.

The health scoring algorithm combines these metrics into a single 0-100 score. Here's the actual scoring logic we use:

def calculate_health_score(metrics):
    score = 100

    # Small files penalty (up to -40 points)
    small_file_ratio = metrics['small_files_count'] / metrics['total_files']
    if small_file_ratio > 0.5:
        score -= 40
    elif small_file_ratio > 0.3:
        score -= 25
    elif small_file_ratio > 0.1:
        score -= 10

    # Partition skew penalty (up to -30 points)
    if metrics['partition_cv'] > 2.0:
        score -= 30
    elif metrics['partition_cv'] > 1.5:
        score -= 20
    elif metrics['partition_cv'] > 1.0:
        score -= 10

    # Average file size penalty (up to -20 points)
    avg_file_size_mb = metrics['avg_file_size'] / (1024 * 1024)
    if avg_file_size_mb < 64:
        score -= 20
    elif avg_file_size_mb < 128:
        score -= 10

    # Partition count penalty (up to -10 points)
    if metrics['partition_count'] > 10000:
        score -= 10
    elif metrics['partition_count'] > 5000:
        score -= 5

    return max(0, score)

This scoring approach is opinionated but based on observed patterns across hundreds of tables. The small file ratio is weighted most heavily because it has the biggest impact on query performance. Partition skew matters for parallelism. Average file size provides a sanity check on overall table structure. Partition count flags tables that might have excessive partitioning granularity.

The beauty of this browser-based architecture is that once the checkpoint is loaded, all these analyses execute instantly. Users can explore different aspects of table health without waiting for backend processing. Want to see which specific partitions have the most files? Run a query. Curious about file size distribution over time? We can infer that from modification timestamps. Wondering if certain columns have high null rates that suggest pruning opportunities? Column statistics from the checkpoint reveal that immediately.

What the Tool Actually Does: Key Features

Let's talk about the capabilities that make this tool useful in day-to-day operations. These aren't just interesting statistics; they're actionable insights that drive real optimization decisions.

Health scoring and visualization provides the at-a-glance assessment. When you load a table, the first thing you see is the health score (0-100) with color coding: green for healthy (80+), yellow for attention needed (50-79), red for critical (below 50). Below the score, we break down the contributing factors: small file percentage, partition skew coefficient, average file size, and partition count. This breakdown helps you understand which specific issue is dragging down the score.

Here's how a Health Score Breakdown works:

File analysis digs into the details. We show file count distribution across size buckets (under 10MB, 10-64MB, 64-128MB, 128MB+) so you can see exactly where files cluster. A histogram visualizes this distribution, making patterns obvious. If you see a massive spike of files under 10MB, that's your smoking gun for why queries are slow. The tool also lists the largest and smallest files by path, which helps identify specific ingestion jobs or time periods that created problems.

Partition analysis reveals imbalances. We display partition count, files per partition (average, min, max), size per partition (average, min, max), and the coefficient of variation for partition sizes. High CV means significant skew. We also rank partitions by size and file count, showing the top 10 largest and most fragmented partitions. This targeting is valuable; you often don't need to optimize the entire table, just the handful of partitions causing the real problems.

Column-level insights come from Delta's built-in statistics. When Delta writes files, it collects min/max/null count statistics for each column. We surface these at the table level: which columns have the most nulls, which have the widest ranges, which might benefit from ZORDER optimization. ZORDER co-locates similar values in the same files, dramatically improving query performance when you're filtering on high-cardinality columns. The tool identifies candidate columns by looking at their cardinality and filter frequency patterns.

Cost estimation translates metrics into dollars. This was the feature that got the most enthusiastic feedback because it provides business justification for running optimization commands. We calculate estimated costs based on two factors: S3 API request pricing and query compute costs.

For S3 costs, the calculation is straightforward:

def estimate_s3_cost_savings(current_files, optimal_files):
    # S3 GET request pricing (rough average across regions)
    cost_per_1000_requests = 0.0004  # USD

    current_monthly_scans = current_files * 30  # assuming daily queries
    optimal_monthly_scans = optimal_files * 30

    current_cost = (current_monthly_scans / 1000) * cost_per_1000_requests
    optimal_cost = (optimal_monthly_scans / 1000) * cost_per_1000_requests

    savings = current_cost - optimal_cost
    return savings

For query compute costs, we estimate based on scan time reduction. Fewer files mean fewer seeks, less metadata processing, and faster query completion. The relationship isn't perfectly linear, but empirical testing shows that reducing file count by 10x typically improves query time by 3-5x for scan-heavy workloads. We use conservative estimates to avoid overpromising.

When users see "estimated monthly savings: $X from S3 optimization, $Y from faster queries," it changes the conversation. Suddenly running OPTIMIZE isn't just an operational task; it's a cost reduction initiative with measurable ROI.

Pruning recommendations identify opportunities to clean up old data. Delta Lake's time travel is powerful, but maintaining 90 days of history for a table that's only queried for the last 7 days is wasteful. The tool analyzes file modification timestamps and data freshness patterns to recommend appropriate VACUUM retention periods. We also flag tables with excessive deletion tombstones that need compaction to reclaim space.

What We Learned Building This

Building a browser-based data analysis tool taught us several lessons that weren't obvious from the outset.

DuckDB WASM is genuinely production-ready. We were skeptical about running a full SQL engine in the browser, but DuckDB WASM parsed our largest checkpoint files (30MB+, 50,000+ rows) without issues. Complex aggregations execute in milliseconds, and the SQL interface proved complete enough for all our analysis needs.

Browser memory limits matter less than expected. Modern browsers handle datasets in the hundreds of megabytes without problems. We implemented guardrails for extremely large checkpoints, but these edge cases are rare. Most Delta Lake tables have manageable checkpoint sizes.

Cost estimates drive action more than performance metrics. We thought query performance insights would motivate optimization. We were wrong. Showing "you're wasting $X per month on excessive S3 requests" provided concrete justification. Finance teams control prioritization, and they care about costs.

Column statistics are underutilized. Surfacing Delta Lake's min/max/null count statistics revealed patterns people didn't know existed. High null rates flagged data quality issues. Unexpected ranges revealed incorrect data types. The column analysis section became unexpectedly popular for data quality monitoring beyond just optimization.

What This Approach Can't Do (And Why That's Acceptable)

Browser-based analysis isn't a silver bullet. Massive tables with hundreds of thousands of partitions exceed browser capabilities. Real-time monitoring with automated alerts requires backend infrastructure. Historical trending is manual since we don't maintain server-side metrics. For very large tables, we sample files rather than analyzing all of them, introducing statistical uncertainty.

These limitations are real, but they don't invalidate the approach. For the vast majority of Delta Lake tables at typical organizations, browser-based analysis works excellently. The 5% of edge cases that exceed browser capabilities can use alternative tools. Optimizing the common case while providing escape hatches for edge cases is good engineering.

Future Directions: From Diagnostic to Predictive

The Delta Lake Health Analyzer has proven valuable as a diagnostic tool, but we're seeing patterns that suggest predictive possibilities.

Real-time streaming pipelines predictably create small file problems within 48-72 hours. Batch loads develop skew after 30-60 days when transaction volumes shift. These patterns are consistent enough to enable proactive maintenance.

Imagine automatic warnings: "Table X will hit critical small file threshold in 3 days" or "Partition skew will impact performance next week unless compaction runs today."

We're also exploring automated optimization recommendations beyond "run OPTIMIZE," integration with workflow orchestration platforms like Airflow, and data-driven ZORDER recommendations based on actual query patterns from warehouse logs.