Forem: Mayank Gupta

Efficiency at Scale: Scaling, Scheduling, and Measuring Databricks SQL

Mayank Gupta — Wed, 22 Apr 2026 12:07:32 +0000

In our final look at Databricks SQL, we move beyond individual table tweaks to the broader architecture. Optimization isn't just about making one query fast; it’s about building a sustainable, cost-efficient system. This means picking the right warehouse size, automating recurring workloads, and—most importantly—proving your impact with hard data.

1. Right-Sizing Your Warehouse

A common trap is assuming a larger warehouse is always better. While doubling a warehouse size (e.g., from Small to Medium) often cuts query time in half, it also doubles your DBU (Databricks Unit) spend.

Sizing Strategies:

2X-Small to X-Small: Best for light exploratory queries and cost-sensitive, low-concurrency tasks.
Small to Medium: The "sweet spot" for interactive dashboards and general ad-hoc analytics.
Large and Beyond: Reserved for heavy ETL (Extract, Transform, Load) jobs, massive aggregations, and high-concurrency production environments.

Cost Control Checklist:

Auto-Stop: Set this to a low threshold (e.g., 1–10 minutes) to prevent paying for idle compute.
Serverless: Use serverless warehouses to eliminate "cold starts." They spin up in 2–6 seconds, allowing you to be more aggressive with auto-stop settings.

2. Scheduling and Automation Patterns

You shouldn't be running production workloads manually from the SQL editor. Databricks provides three ways to move from "manual" to "managed."

The Three Modern Patterns:

Scheduled Queries: Great for daily reports or cleaning tasks. Always save your query first, then use the Schedule button to define the cadence.
Materialized Views (MVs): These pre-compute expensive aggregations. Instead of re-scanning raw data every time, users query the MV and get instant results.
Streaming Tables: These ingest data continuously, ensuring your dashboards are always fresh without the "spiky" load of scheduled batch jobs.

3. The Power of Parameterization

Stop hard-coding your WHERE clauses! Using parameters (e.g., :start_date) makes your SQL more secure and much more efficient.

Security: Prevents SQL injection by separating the query logic from the input data.
Cache Efficiency: Databricks can reuse the same execution plan because the "text" of the query remains identical even when the parameter values change.
Reusability: A single query can power multiple dashboard widgets by simply changing the input values.

-- Using parameters for a reusable, cache-friendly query
SELECT 
    region, 
    sum(total_sales) 
FROM silver.sales_data
WHERE sale_date >= :start_date 
  AND status = :status_filter
GROUP BY 1;

4. Measuring Success: The Optimization Feedback Loop

Optimization is meaningless if you can't prove it. You need to established baselines and track four key metrics:

Metric	Why it Matters
P95 Duration	Detects "outlier" queries that are frustrating your users.
DBU Consumption	The "Bottom Line"—tracks the literal cost of your SQL workloads.
Bytes Scanned	Validates that your pruning and Z-ORDERing are actually working.
Cache Hit Ratio	Measures how often you are getting "free" results from the result cache.

The "Before & After" Audit

To prove your value, query the system.query.history and system.billing.usage tables. Compare a 24-hour window before you applied Liquid Clustering vs. a 24-hour window after.

-- Check your top 20 most recent queries for scan efficiency
SELECT 
    statement_text, 
    total_duration_ms, 
    read_bytes / (1024*1024) AS mb_scanned
FROM system.query.history
ORDER BY start_time DESC
LIMIT 20;

Best Practices Summary (The Do's and Don'ts)

✅ Do:

Stagger Schedules: Don't have 50 dashboards refresh exactly at 8:00 AM; space them out by 5 minutes to avoid resource contention.
Use CTEs: Common Table Expressions (using the WITH clause) make your logic readable and easier for the optimizer to handle.
Monitor Parallelism: Use the warehouse monitoring tab to see if you are leaving compute capacity on the table.

❌ Don't:

Keep-Alive Queries: Don't run "dummy" queries just to keep a warehouse from spinning down. Use Serverless and let Auto-Stop do its job.
Skip ANALYZE: Always run ANALYZE TABLE after large loads so the cost-based optimizer (CBO) has fresh statistics.
Function Wrapping: Avoid WHERE YEAR(date) = 2026; it breaks partition pruning.

Final Takeaway

Optimization is a cycle, not a destination. By monitoring with Query History, diagnosing with Query Profiles, fixing with Liquid Clustering, and measuring with System Tables, you transform your Databricks environment into a high-performance, cost-effective data powerhouse.

Interview Questions

How do you determine if a SQL Warehouse needs to be scaled up or if the queries need optimization?
What are the benefits of using a Materialized View over a standard View in Databricks SQL?
How does query parameterization improve the "Cache Hit Ratio"?
How would you calculate the total DBU cost of a specific user's queries over the last 30 days?

Now that you have the full toolkit, which of these optimization strategies will you implement first to lower your DBU burn?

Optimizing Delta Tables: From Maintenance to Managed Excellence

Mayank Gupta — Wed, 22 Apr 2026 12:06:33 +0000

If high-performance SQL queries are the engine of your data platform, then your Delta tables are the fuel. Even the best-written SQL can't overcome a poorly organized data layer. In this guide, we shift from query logic to the physical storage layer—exploring how to maintain, cluster, and automate your Delta tables for maximum efficiency.

The Problem: "Small File Syndrome" and Data Scattering

Two main issues plague Delta table performance over time:

Small File Problem: Frequent streaming or incremental writes create thousands of tiny Parquet files. Each file requires a separate I/O task, leading to massive scheduling overhead.
Data Scattering: Without organization, related records (e.g., all sales for a specific user_id) are scattered across hundreds of files, forcing the engine to scan the entire table for a single lookup.

1. File Compaction with `OPTIMIZE`

The OPTIMIZE command is your first line of defense. It solves the small file problem by physically rewriting many tiny files into large, efficient 1 GB chunks.

Impact: Reduces file open/close overhead by 10x to 100x.
Best Practice: Run OPTIMIZE after large batch loads or frequent streaming updates.
Auto-Compaction: For streaming tables, set optimizeWrite = true. This coalesces files during the write process so you don't have to manage manual maintenance jobs.

-- Manually compact small files in a table
OPTIMIZE sales_unoptimized;

-- Enable auto-compaction for continuous maintenance
ALTER TABLE sales_streaming SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true);

2. Z-ORDER: High-Performance Data Skipping

Compaction makes files larger, but Z-ORDER makes them smarter. By co-locating related data in the same files, Z-ORDER allows the engine to skip the majority of data using file-level Min/Max statistics.

When to use: On high-cardinality columns (e.g., user_id, product_id) that appear frequently in WHERE or JOIN clauses.
Limit: Stick to 1–4 columns. Each additional column reduces the clustering effectiveness.
Example: A point-lookup that previously scanned 1,600 files might only touch 1 or 2 files after Z-ORDERing.

-- Optimize and cluster data by high-cardinality columns
OPTIMIZE fact_sales 
ZORDER BY (sale_date, product_id);

3. Liquid Clustering: The Modern Standard

While Z-ORDER is powerful, it is rigid (requiring full table rewrites if keys change). Liquid Clustering is the modern replacement that simplifies everything.

Why Liquid Clustering Wins:

Incremental: It only re-clusters new data, avoiding expensive full-table rewrites.
Flexible: You can change your clustering keys with a simple ALTER TABLE without migrating data.
Intelligent: It automatically handles skew and data distribution.

-- Defining a table with Liquid Clustering
CREATE TABLE fact_sales (
  order_id LONG,
  region STRING,
  sale_date DATE
)
CLUSTER BY (region, sale_date);

4. Reclaiming Storage with `VACUUM`

Delta Lake's "Time Travel" is a lifesaver, but keeping every version of every file forever will explode your storage costs. VACUUM removes files no longer needed for time travel.

Retention: The default is 7 days (168 hours).
Safety: You cannot vacuum files newer than 7 days on Serverless SQL Warehouses to prevent breaking active queries.
Pro-Tip: Always use DRY RUN first to see what will be deleted!

-- Preview files to be deleted
VACUUM sales_data RETAIN 168 HOURS DRY RUN;

-- Execute the cleanup
VACUUM sales_data;

5. The "Set and Forget" Strategy: Predictive Optimization

The ultimate goal of a Data Engineer is to spend less time on maintenance. Predictive Optimization is a managed service where Databricks monitors your tables and automatically runs OPTIMIZE, VACUUM, and ANALYZE when needed.

-- Enable Databricks to manage maintenance automatically
ALTER TABLE sales_data ENABLE PREDICTIVE OPTIMIZATION;

Summary / Key Takeaways

Compact: Use OPTIMIZE to merge small files and reduce I/O overhead.
Cluster: Use Liquid Clustering for all new tables to enable massive data skipping with total flexibility.
Clean: Use VACUUM to keep storage costs down by removing stale data.
Automate: Enable Predictive Optimization at the catalog or schema level to let the platform handle the heavy lifting.

Interview Questions

What is the "Small File Problem," and how does OPTIMIZE resolve it?
How does Z-ORDER improve performance for point-lookup queries?
Why is Liquid Clustering considered superior to traditional Hive partitioning or Z-ORDER?
What are the risks of running VACUUM with a very short retention period?

How much manual maintenance are you currently doing on your Delta tables, or have you already moved toward Predictive Optimization?

Hands-On Performance: Diagnosing and Fixing Databricks SQL Bottlenecks

Mayank Gupta — Wed, 22 Apr 2026 12:05:20 +0000

Once you know how to monitor your queries, the next step is taking action. In Databricks SQL, performance tuning isn't a "set it and forget it" task—it’s a hands-on process of reducing data scans, optimizing joins, and leveraging intelligent caching.

This guide moves from theory to execution, showing you exactly how to identify bottlenecks and apply the right fixes to make your queries run faster and cheaper.

The Problem: Inefficient "Brute Force" Queries

A common mistake for SQL developers is writing "brute force" queries that scan entire tables to find a single row. While modern engines are fast, this approach is unsustainable at the petabyte scale.

Inefficient queries lead to:

High Latency: Users waiting minutes for simple dashboard refreshes.
Wasted Spend: Paying for compute resources to read data that is immediately discarded.
Resource Contention: One "heavy" query slowing down the entire warehouse for everyone else.

Core Concept: The Golden Rule of Big Data

The fastest way to speed up a query is to read less data.

To achieve this, Databricks uses three primary scan-reduction techniques:

Partition Pruning: Skipping entire directories of files based on a filter (e.g., WHERE region = 'North').
Predicate Pushdown: Using metadata (Min/Max statistics) within files to skip specific blocks of data before they are even read into memory.
Dynamic File Pruning: Eliminating files at runtime based on values discovered from the other side of a join.

Deep Dive: Join Strategies & Optimization

Joins are often the most expensive part of a query execution. Understanding how Databricks handles them is key to fixing a slow DAG.

1. BroadcastHashJoin (The Winner)

The engine takes the smaller table and sends a full copy to every worker node.

Why it's fast: No data shuffle is required.
Best for: Joining a massive fact table with a smaller dimension table.

2. SortMergeJoin (The Workhorse)

Both tables are shuffled across the network, sorted by the join key, and then merged.

Why it's used: It is the standard for joining two very large tables that don't fit in memory.
The downside: Heavy network and I/O overhead.

3. Adaptive Query Execution (AQE)

Databricks isn't static. AQE can look at a query during execution and say, "Wait, this table I thought was big is actually small. Let's switch from a SortMergeJoin to a BroadcastJoin on the fly."

Technical Implementation: Writing Cache-Friendly SQL

Caching can make a query go from 30 seconds to 0.5 seconds, but only if you write your SQL correctly.

The "Cache-Busters" to Avoid

Certain functions make your query "non-deterministic," meaning the engine can't be sure the result will be the same next time, so it refuses to cache it.

❌ Always Misses Cache	✅ Always Hits Cache (Recommended)
`WHERE date = CURRENT_DATE()`	`WHERE date = '2024-05-20'`
`WHERE ts > NOW() - INTERVAL 1 DAY`	`WHERE ts > '2024-05-19 08:00:00'`
Adding/Removing random whitespace	Consistent, formatted SQL blocks

Pre-Warming the Cache

If you have a high-priority dashboard, you can "pre-warm" the SSDs of your warehouse using the CACHE SELECT command. This ensures the data is sitting on local fast storage before the first user even logs in.

-- Pre-warm the local Delta cache for high-priority tables
CACHE SELECT * FROM gold.sales_summary;

System Design: The Performance Playbook

To build a high-performance environment, follow this four-step cycle:

Step 1: Identify the "Heavy Hitters"

Query the system tables to find the top 10 most expensive queries by total_task_duration_ms.

SELECT 
    statement_text, 
    total_task_duration_ms, 
    read_bytes / 1024 / 1024 AS mb_read
FROM system.query.history
WHERE start_time > date_add(now(), -1)
ORDER BY total_task_duration_ms DESC
LIMIT 10;

Step 2: Analyze the DAG

Open the Query Profile. Look for the "Scan Table" node. If the Pruning Percentage is low, you are reading too much data.

Step 3: Apply the Fix

Missing Pruning? Add a filter on a partitioned column.
Massive Shuffles? Use a /*+ BROADCAST(small_table) */ hint.
Slow Scans? Run OPTIMIZE table_name ZORDER BY (frequent_filter_column).

Step 4: Verify & Alert

Compare the "Before" and "After" metrics in system.query.history. If the read_bytes dropped significantly, your fix worked.

Best Practices & Pitfalls

Avoid Functions on Filter Columns: Writing WHERE YEAR(my_date) = 2023 prevents the engine from using partition pruning. Use WHERE my_date BETWEEN '2023-01-01' AND '2023-12-31' instead.
Right-Size the Warehouse: Don't use a Large warehouse for 2X-Small tasks. Use the smallest tier that meets your SLA to save money.
Monitor Parallelism: A low Parallelism Ratio in your history logs means your query is running sequentially and not taking advantage of your cluster's power.

Summary / Key Takeaways

Read Less: Partition pruning and predicate pushdown are your best friends.
Filter Early: The closer a filter is to the source scan, the less work every downstream join and aggregate has to do.
Be Deterministic: Replace NOW() and CURRENT_DATE() with static parameters to unlock the Result Cache.
Use Serverless: For bursty workloads, serverless warehouses prevent you from paying for idle compute time.

Interview Questions

What is the difference between Partition Pruning and Predicate Pushdown?
How does wrapping a column in a function (like TO_DATE()) affect query performance?
When would you choose a ShuffleHashJoin over a BroadcastHashJoin?
What metrics in the Query Profile indicate that a table needs Z-Ordering or better partitioning?

Master Your Queries: A Guide to Databricks SQL Performance Monitoring

Mayank Gupta — Wed, 22 Apr 2026 12:03:31 +0000

Optimization isn't just about writing cleaner SQL; it's about knowing exactly where your compute dollars are going. In a world of auto-scaling warehouses and serverless compute, a single "bad" query can silently inflate your monthly cloud bill.

Whether you are a Data Engineer trying to slash execution times or a Platform Architect managing costs, Databricks provides a powerful duo of tools to help you: Query History and Query Profile. In this post, we’ll explore how to move from reactive "firefighting" to proactive performance management.

The Problem: The "Black Box" Query

We've all been there: you trigger a SQL statement, and the loading spinner just keeps turning. Is the warehouse overloaded? Is your join causing a massive data shuffle? Or is the engine simply struggling to read millions of unpartitioned files?

Without visibility, optimization is just guesswork. Monitoring these queries is essential because:

Cost Control: Reducing "Scan Volume" directly lowers DBU (Databricks Unit) consumption.
User Experience: Faster dashboards mean happier business stakeholders.
Resource Allocation: Identifying if you need a larger warehouse or simply better SQL.

Core Concepts: History vs. Profile

Before we dive into the code, let's distinguish between the two primary diagnostic layers in Databricks SQL.

1. Query History (The Macro View)

Think of this as your Flight Log. It shows every query executed over a period. It is your first stop for isolating "slow performers."

Key Insight: Wall-clock breakdown.
The Breakdown: It splits time into Scheduling, Compilation, and Execution.
- High Scheduling Time? Your warehouse is likely queued up and needs more clusters.
- High Execution Time? Your SQL logic or data layout is the bottleneck.

2. Query Profile (The Micro View)

Think of this as the X-Ray. It provides a Directed Acyclic Graph (DAG) of the execution plan.

Key Insight: Operator-level metrics.
The Breakdown: It shows exactly how many rows went into a filter and how many came out, which operators "spilled" to disk, and how much data was shuffled across the network.

Deep Dive: Anatomy of a Query Profile

When you open a Query Profile, you are looking at a visual representation of the Spark engine at work.

The Top Operators Panel

This panel ranks operations by time. If a FileScan takes 80% of the time, your issue is IO-bound (likely missing partitioning). If a Join takes 80%, you have a compute/shuffle issue.

Memory Spills: The Silent Killer

Keep a sharp eye on Spill to Disk. This occurs when an operation (like a large Sort or Join) exceeds the available RAM in the executor.

The Fix: Increase the warehouse size or optimize the query to handle less data at once (e.g., using a BROADCAST hint for smaller tables).

Predicate Pushdown

The "Rows In vs. Rows Out" ratio is the most underrated metric. If an operator reads 10 million rows only to filter out 9.9 million, that filter should have happened earlier (at the Scan level). This is known as Predicate Pushdown.

Technical Implementation: Monitoring via System Tables

The UI is great for one-offs, but for long-term governance, you should query the System Tables directly. This allows you to build automated dashboards and alert on SLA breaches.

Example: Identifying High-Cost Outliers

The following query identifies queries running longer than 60 seconds and calculates the "Data Scanned" to help you find inefficient full-table scans.

-- Identify long-running queries with high data scan volume
SELECT
    statement_id,
    statement_text,
    executed_as,
    duration_ms / 1000 AS duration_seconds,
    external_links.query_profile AS profile_link,
    -- Convert bytes to GB for better readability
    round(read_bytes / (1024 * 1024 * 1024), 2) AS gb_scanned
FROM
    system.query.history
WHERE
    -- Filter for queries longer than 1 minute
    total_duration_ms > 60000 
    AND statement_type = 'SELECT'
ORDER BY
    duration_ms DESC
LIMIT 10;

Example: User Cost Leaderboard

To see which users or teams are driving the most load, you can aggregate metrics:

SELECT
    executed_as AS user_email,
    count(*) AS total_queries,
    avg(total_duration_ms) / 1000 AS avg_duration_sec,
    sum(read_bytes) / (1024 * 1024 * 1024) AS total_gb_scanned
FROM
    system.query.history
GROUP BY
    1
ORDER BY
    total_gb_scanned DESC;

Real-World Applications

Retail/E-commerce: Monitoring "Peak Season" dashboard performance to ensure sub-second latency for executive reports.
FinTech: Auditing query history for compliance to ensure only authorized service accounts are touching sensitive PII tables.
SaaS Providers: Using System Tables to "Chargeback" compute costs to specific departments based on their DBU usage.

Performance Red Flags & Best Practices

Red Flag	Meaning	Recommended Action
Data Skew	One task takes 100x longer than others.	Check join keys for nulls or highly frequent values. Use skew hints.
Cartesian Product	Massive row count explosion.	Ensure all `JOIN` statements have a proper `ON` clause.
Stale Statistics	Optimizer making bad join choices.	Run `ANALYZE TABLE [table_name] COMPUTE STATISTICS`.
High Exchange Volume	Large amounts of data shuffling.	Optimize `GROUP BY` keys or use Z-Ordering on join columns.

Summary / Key Takeaways

Query History identifies which queries are slow; Query Profile explains why.
Check Wall-clock breakdown to see if the issue is the warehouse (Queuing) or the SQL (Execution).
Watch for Disk Spills—they are the primary cause of sudden slowdowns in large joins.
Use System Tables to move from reactive troubleshooting to a proactive observability practice.

Interview Questions

Explain the difference between Scheduling Time and Execution Time in Databricks Query History.
What does a "Spill to Disk" indicate in a Query Profile, and how would you resolve it?
What is Predicate Pushdown, and how can you verify it is working using the Query Profile?
How would you use the system.query.history table to find the top 5 most expensive queries by data volume?

What’s the most common performance bottleneck you’ve run into—is it usually the SQL logic or the underlying warehouse configuration?

Demystifying Databricks SQL: How Your Queries Actually Run Under the Hood

Mayank Gupta — Fri, 17 Apr 2026 12:45:29 +0000

In the world of big data, writing a SQL query is the easy part. The real challenge—and the mark of a great data engineer—is understanding how that query executes. If you’ve ever stared at a "Running..." status for ten minutes, you know that the "black box" of query execution can be frustrating.

In this guide, based on insights from industry experts we’re going to peel back the layers of Databricks SQL (DBBSQL). We’ll explore the architecture, the engine, and the lifecycle of a query so you can stop guessing and start optimizing.

The Problem: The "Black Box" of Latency

When a query is slow, most developers reflexively increase the cluster size. While "throwing hardware at the problem" sometimes works, it’s expensive and often masks underlying issues like data skew, poor pruning, or shuffle spills.

Understanding the execution pipeline allows you to identify exactly where the bottleneck lies: is it taking too long to find the files (Metadata), too long to read them (I/O), or too long to process the math (CPU)?

1. The Big Picture: Databricks SQL Architecture

Before a single row is read, your query travels through a specific ecosystem.

The Interface: You write your query in the SQL Editor or an external tool (Tableau, Power BI).
The Governance Layer: Unity Catalog checks if you actually have permission to see that data.
The Compute Layer: The SQL Warehouse (the "engine room") receives the request.
The Storage Layer: Your data lives in Delta Lake on cloud storage (S3, ADLS, or GCS).

Choosing Your Engine: SQL Warehouse Types

Not all warehouses are created equal. Your choice dictates performance and cost:

Serverless: The gold standard. Instant start, auto-managed, and uses the high-performance Photon engine.
Pro: Offers Photon engine benefits but gives you more manual control over configuration.
Classic: The legacy option. Cheaper per unit but lacks modern optimizations like Predictive I/O.

2. The Lifecycle of a Query: From SQL to Results

When you hit "Run," your query undergoes a 5-stage transformation:

Parsing: The engine checks your syntax. Are the commas in the right place? Does the table orders actually exist in Unity Catalog?
Logical Planning: The engine creates an abstract map of what you want to do (e.g., "Join Table A and B, then filter").
Physical Planning (The Optimizer): This is where the Cost-Based Optimizer (CBO) looks at table statistics. It decides the most efficient way to join tables—for example, broadcasting a small table to all nodes instead of a massive shuffle.
Execution: The Driver Node breaks the plan into small tasks and sends them to Worker Nodes. This is where Adaptive Query Execution (AQE) lives; if the engine notices the data is skewed during the run, it can change the plan on the fly.
Result Delivery & Caching: Results are sent back and cached. If you run the exact same query again, Databricks pulls it from the cache instantly.

3. The Secret Sauce: The Photon Engine

One of the biggest differentiators in Databricks is Photon. Unlike traditional Spark, which runs on the Java Virtual Machine (JVM), Photon is a native C++ engine.

Vectorized Execution: It processes data in batches (vectors) rather than row-by-row, which is significantly faster for modern CPUs.
Predictive I/O: It "guesses" which data blocks you'll need next, reducing the time the CPU spends waiting for data from the cloud.

Pro-Tip: Look for the Lightning Bolt icon in your Query Profile. This indicates the operation was handled by Photon. If it's missing, you've experienced a "Spark Fallback." This often happens if you use complex Python UDFs—try to stick to native SQL functions to keep things in Photon!

4. Deep Dive: Decoding the Query Profile

To optimize, you must learn to read the Query Profile. It’s the "medical X-ray" of your query.

Key Metrics to Watch:

Metric	What it tells you
Files Pruned	How many files were skipped. High pruning = Great performance.
Shuffle Spill	Data was too big for RAM and spilled to disk. This is a massive speed killer.
Scan Table	Shows how much raw data was pulled from storage.

Architecture of a Distributed Join

In a typical distributed join, data is "shuffled" across the network so that matching keys end up on the same worker node.

5. Practical Implementation: Querying with Best Practices

Here is a real-world example of a query designed to perform well in Databricks SQL.

-- Using native SQL functions to stay in the Photon Engine
SELECT 
    c.customer_region,
    COUNT(o.order_id) AS total_orders,
    SUM(o.amount) AS total_revenue
FROM 
    samples.tpch.customer AS c
JOIN 
    samples.tpch.orders AS o 
    ON c.c_custkey = o.o_custkey
WHERE 
    -- Filtering on a partitioned column (Date) for better pruning
    o.o_orderdate >= '1995-01-01' 
    AND o.o_orderdate <= '1995-12-31'
GROUP BY 1
ORDER BY total_revenue DESC;

6. Best Practices & Pitfalls

✅ The "Do's"

Right-size your Warehouse: Use 2X-Small for testing, but move to Large+ for heavy ETL to avoid memory pressure.
Leverage Serverless: It stops aggressively when not in use, saving you money on idle time.
Check Statistics: Ensure your tables have updated statistics (ANALYZE TABLE) so the Optimizer can make smart choices.

❌ The "Don'ts"

Avoid Python UDFs in SQL: These force the engine to leave the C++ Photon environment, slowing down execution significantly.
Don't ignore the Shuffle: If you see high shuffle numbers, consider if your join keys are causing "Data Skew" (where one worker does 90% of the work).

Summary: Key Takeaways

SQL Warehouses are the compute power; Serverless is generally the best choice for speed and cost-efficiency.
The Query Profile is your best friend for identifying bottlenecks like low file pruning or shuffle spills.
Photon is the high-performance C++ engine that powers modern Databricks SQL.
Adaptive Query Execution (AQE) optimizes your query while it is running.

Interview Questions

What is the difference between the Logical Plan and the Physical Plan in Databricks SQL?
How does Predictive I/O improve query performance compared to standard cloud storage reads?
What are "Spark Fallbacks," and how do they impact the performance of the Photon engine?

Building Resilient AI: Architectural Patterns for Event-Driven Agents

Mayank Gupta — Sun, 12 Apr 2026 11:05:08 +0000

In the rush to build the next generation of "agentic" AI systems, developers often focus on the LLM's reasoning capabilities while neglecting the pipes that carry the data. But here is the hard truth: Most agentic systems fail or fly based on one decision—how you design your infrastructure.

When you move from a simple chatbot to an autonomous agent that can process orders, detect fraud, or triage support tickets, you are no longer just making API calls. You are managing state, concurrency, and reliability across a distributed landscape.

In this guide, we’ll explore how to build a robust backbone for your AI agents using event-driven architecture (EDA).

The Problem: The Fragility of Synchronous Agents

Traditional "request-response" architectures are brittle. If an agent calls a payment service and that service is down, the agent hangs. Even worse, if the agent completes a task but the network blips before it can save the result, you end up with "ghost actions"—money spent, but no record of the transaction.

As we scale AI agents, we face three primary challenges:

Blast Radii: One failing component shouldn't crash the entire agent swarm.
State Inconsistency: Ensuring the agent's "brain" and the system's database always agree.
Throughput vs. Latency: Balancing the need for speed with the reality of heavy processing loads.

1. Choosing Your Backbone: Centralized vs. Federated

How you route events defines your system's DNA.

Centralized Event Bus: A single backbone (like a corporate Kafka cluster) offers strong governance, consistent security, and a single place to observe everything.
Federated/Decentralized: Each domain owns its own bus. This creates "failure domains," meaning a spike in your "Triage Agent" won't take down your "Payment Agent."

The Toolbelt: Kafka vs. NATS vs. Azure

Tool	Best For	Key Feature
Apache Kafka	Long-term durability & replay	Consumer Groups: Allows different teams to scale and process the same stream independently.
NATS	High-performance "Walkie-Talkie"	Fire-and-forget: Ultra-low latency. Use JetStream if you eventually need persistence.
Azure Trio	Enterprise Cloud Ecosystem	Event Hubs (Streaming), Service Bus (Messaging), Event Grid (Serverless/SaaS).

2. Maintaining Consistency: Sagas and Outboxes

In an event-driven world, we don't use traditional distributed transactions (which lock databases and kill performance). Instead, we use the Saga Pattern.

The Saga Pattern

A Saga is a multi-step story told through events. If Step 3 fails, the system triggers "compensatory actions" to undo Step 1 and 2.

Example: An Order Agent charges a card but finds the item is out of stock. The Saga triggers a refund event automatically.

The Outbox Pattern

To prevent the "Internal State Updated but Event Not Sent" bug, use an Outbox.

Your service writes the business change and the event to the same database in one transaction.
A background publisher reads that "Outbox" table and pushes the event to the bus.
This guarantees that state and events are always in sync.

3. Implementation: Idempotency and Concurrency

In distributed systems, "Exactly Once" delivery is a myth (or at least, incredibly expensive). Aim for Effectively Once by using Idempotency Keys.

Code Example: Idempotent Event Handler (Python)

import redis

# Initialize Redis for effect logging
cache = redis.Redis(host='localhost', port=6379, db=0)

def process_payment_event(event):
    event_id = event.get("idempotency_key")

    # 1. Check if we've already processed this specific event
    if cache.get(event_id):
        print(f"Duplicate event {event_id} ignored.")
        return {"status": "already_processed"}

    try:
        # 2. Perform the business logic
        execute_payment(event["amount"], event["user_id"])

        # 3. Log the effect with an expiration (e.g., 24 hours)
        cache.set(event_id, "processed", ex=86400)
        return {"status": "success"}

    except Exception as e:
        # 4. Handle failure (allow for retry)
        log_error(e)
        raise

Pro Tip: Use Optimistic Concurrency. Instead of locking a row, use an ETag or version_number. If two agents try to update the same record, the second one will fail the version check and can retry with fresh data.

4. Avoiding the "2 AM" Pitfalls

Real-world systems get "weird." Here is how to guard them:

Dead Letter Queues (DLQ): When an agent fails to process a "poison message" (bad data), don't let it block the line. Route it to a DLQ for manual inspection.
Event Storms: Sudden bursts of retries can act like a self-inflicted DDoS attack. Use Rate Limiting (Token Buckets) at the edge.
Hot Partitions: If all your events use the same ID (e.g., "User_1"), one server gets crushed while others sit idle. Hash your partition keys to spread the load.

5. Performance: Batching and Backpressure

Performance is a three-legged stool: Latency, Throughput, and Backpressure.

Batching: Grouping 100 events into one network call trades a little latency for massive throughput gains.
Circuit Breakers: If a downstream LLM provider is timing out, the circuit breaker "trips." The agent immediately fails-fast with a graceful message rather than making users wait 30 seconds for a timeout.
Pre-warming: For serverless agents (like Azure Functions), use "Premium" plans or "Always-on" instances to avoid Cold Start latency during critical paths.

Key Takeaways

Design for the "Bad Day": Assume events will be duplicated, out of order, or delayed.
Idempotency is King: Every action an agent takes should be safe to repeat.
Use the Right Tool: Kafka for history, NATS for speed, Cloud-native buses for ease of integration.
Observe and Partition: Keep your "junk drawer" clean by using well-defined topic schemas.

Interview Questions

What is the difference between a Saga and a distributed transaction (2PC)?
- Answer: 2PC locks resources and can hinder throughput; Sagas use asynchronous local transactions and compensatory actions for better scalability.
How does the Outbox Pattern ensure atomicity?
- Answer: It uses a single database transaction to commit both the record update and the event message, ensuring they either both succeed or both fail.
Explain "Effectively Once" processing.
- Answer: It is the combination of "At Least Once" delivery and an idempotent consumer that filters out duplicates using unique keys.

Foundations of Event-Driven Agentic Systems: From Chatbots to Proactive Teammates

Mayank Gupta — Sun, 12 Apr 2026 10:42:30 +0000

In the world of Generative AI, we often think of "agents" as sophisticated chatbots waiting for a user to type a prompt. But in a production environment, the world doesn't wait for a prompt. Systems are constantly whispering (or shouting) through a stream of data: "Payment declined," "Sensor spike detected," "Order shipped."

To build AI that actually works in the real world, we have to move away from request-response loops and toward Event-Driven Agentic Architecture. In this post, we’ll explore how to build agents that don't just answer questions, but react to the heartbeat of your business in real-time.

The Problem: The Latency & Context Gap

Traditional AI applications suffer from two main issues:

High Latency: Users won't wait 10 seconds for an agent to "think" while a process hangs.
Stale Context: If an agent isn't fed real-time data, it makes decisions based on yesterday’s news.

If a customer’s payment fails, they expect an immediate notification or a retry. If an agent has to wait for a manual trigger to check the logs, the "magic" of AI evaporates. We need systems that push context to agents the moment it exists.

Core Concepts: The Language of Events

Before building, we must define the vocabulary of an event-driven world. These aren't just synonyms; they have specific technical implications.

Concept	Definition	Example
Event	An immutable record of the past.	`UserLoggedIn`, `SnoozeClicked`
Command	An instruction to perform an action.	`SendEmail`, `ProcessRefund`
Fact	An event worth keeping forever for audit/memory.	`Order_123_Shipped`
Stream	An append-only sequence of events.	A Kafka topic or AWS Kinesis stream.
Saga	A coordinator for long-running workflows.	Managing a booking that spans 3 services.

The "Saga" Pattern

A Saga is critical for agents. If an agent issues a RefundCommand but the refund service is down, the Saga ensures a compensating action occurs (like alerting a human or retrying with a different gateway) to keep the system consistent.

Deep Dive: System Architecture

An Event-Driven Agentic system functions like a high-speed nervous system. Instead of the agent polling a database, the database (or service) emits a signal that "wakes up" the agent.

Messaging Patterns

How do these signals reach our agents?

Webhooks: The "doorbell." A third party (like Stripe) pings your URL.
Pub/Sub (Publish/Subscribe): The "bulletin board." One event (e.g., NewPurchase) is broadcast to multiple agents—one for fraud detection, one for inventory, and one for a personalized thank-you note.
CDC (Change Data Capture): The "security camera." Every tiny update in your SQL or NoSQL database is turned into a stream of events for the agent to watch.

Code Example: Building a Reactive Agent

Let's look at a Python-based example using a simple event-driven logic where an agent reacts to a payment_failed event.

import json

# Simulated Event from a Message Queue (like RabbitMQ or NATS)
event_data = {
    "event_type": "payment_failed",
    "payload": {
        "user_id": "U9921",
        "reason": "insufficient_funds",
        "amount": 49.99
    }
}

class AgenticSystem:
    def handle_event(self, event):
        etype = event.get("event_type")

        if etype == "payment_failed":
            self.process_recovery_logic(event["payload"])

    def process_recovery_logic(self, data):
        print(f"--- Agent Analysis Starting ---")
        # Step 1: Fact Gathering (Context)
        # In a real system, the agent might query a RAG store here
        context = f"User {data['user_id']} failed a payment of {data['amount']}."

        # Step 2: Agent Action (Command)
        print(f"Action: Issuing 'Offer_Alternative_Payment' command to User {data['user_id']}.")

        # Step 3: Emit new Event
        new_event = {"event_type": "recovery_flow_initiated", "user_id": data['user_id']}
        print(f"Result: {json.dumps(new_event)}")

# Running the system
agent = AgenticSystem()
agent.handle_event(event_data)

Real-World Applications

RAG Refresh: Instead of manually re-indexing your documents every night, an agent listens to your GitHub or Notion webhooks. The moment you save a doc, the agent updates your Vector Database.
Commerce Fraud: Agents act as "store detectives," monitoring IP address spikes or rapid-fire purchases to freeze accounts before the money leaves the building.
Ops Runbooks: When a server's disk hits 90%, an event triggers an agent to clear temp files, log the action, and summarize the incident for the dev team.

Best Practices & Pitfalls

Idempotency is King: Agents might receive the same event twice (network hiccups). Ensure that running the same event twice doesn't charge the customer twice.
The "Loop" Trap: Be careful. An agent's action could trigger an event that triggers the same agent. Use "Guardrails" to prevent infinite AI loops.
Verify Signatures: If you're using webhooks, always verify the cryptographic signature. Don't let unauthorized "doorbells" trigger your expensive AI workflows.

Key Takeaways

Events are History: They are immutable and tell us what happened.
Sagas provide Safety: They handle failures in multi-step agent workflows.
Push over Pull: Use Webhooks or Pub/Sub to reduce latency and keep agents "live."

Interview Questions

What is the difference between an Event and a Command in an agentic system?
How does Change Data Capture (CDC) help in maintaining a Retrieval-Augmented Generation (RAG) system?
Explain the concept of a 'Compensating Action' within a Saga.
Why is MQTT preferred over HTTP for IoT-based agents?

🚀 Swiggy System Design: How a Food Delivery Giant Scales to Millions

Mayank Gupta — Sat, 31 Jan 2026 08:46:32 +0000

Food delivery apps like Swiggy look deceptively simple on the surface—search, order, track, eat 😄
But behind the scenes, they operate one of the most complex real-time distributed systems in production today.

In this blog, we’ll break down Swiggy’s system design step by step—from requirements to APIs and high-level architecture—based on this excellent walkthrough:

System Design Video (Reference)
👉 https://youtu.be/xQnY-DDhEBw

Problem Statement

Design a food delivery platform similar to Swiggy / Zomato / Uber Eats that allows users to:

Discover nearby restaurants
Place orders
Make payments
Track delivery partners in real time
Receive notifications at every stage

The system must scale to millions of users and restaurants while remaining fast and reliable.

Step 1: Functional Requirements

User Side

User registration & authentication
Profile management and order history
Location-based restaurant discovery
Search restaurants by name and menu
View dynamic menus (price, availability, images)
Add multiple items to cart
Secure payments
Real-time order tracking
Notifications for every order state

Restaurant Side

Accept or reject orders
Update menu availability
Manage incoming orders

Delivery Partner Side

Driver discovery & assignment
Location updates
Optimized delivery routing
ETA calculation

Step 2: Non-Functional Requirements

Requirement	Expectation
Scale	~50M users, ~1M restaurants
Availability	Search & discovery must always work
Consistency	Payments & inventory must be accurate
Latency	Low latency for search & tracking
Reliability	Fault-tolerant order flow

Step 3: API Design (High Level)

Authentication & User

POST /api/v1/auth/register
POST /api/v1/auth/login
GET  /api/v1/users/me
POST /api/v1/users/location

Restaurants & Search

GET /api/v1/restaurants/nearby
GET /api/v1/restaurants/{restaurantId}
GET /api/v1/restaurants/search
GET /api/v1/restaurants/{restaurantId}/menu
GET /api/v1/menu/search

Cart & Orders

POST   /api/v1/cart/items
DELETE /api/v1/cart/items/{itemId}
POST   /api/v1/orders
GET    /api/v1/orders/{orderId}/tracking

Delivery & Tracking

POST  /api/v1/delivery/assign
PATCH /api/v1/delivery/orders/{orderId}/accept
POST  /api/v1/delivery/location

Notifications

POST /api/v1/notifications/send
GET  /api/v1/notifications

Step 4: High-Level Design (HLD)

Entry Layer

Load Balancer
API Gateway
- Authentication & Authorization
- Rate limiting
- Request routing
- Load balancing (Round Robin)

Microservices Architecture

Service	Responsibility	Database
User Service	Auth, profile, location	User DB
Search Service	Nearby restaurants, filtering	Restaurant DB
Cart Service	Cart operations	Cart DB
Order Service	Order lifecycle	Order DB
Payment Service	Payment processing	External PG
Delivery Matching Service	Driver assignment	Driver DB
Location Service	Real-time driver tracking	Geo Store
Notification Service	Push & in-app alerts	Event Store

Real-Time Delivery Tracking (Key Insight)

Drivers continuously send GPS updates
Location Service processes updates
Users poll or subscribe via WebSockets
ETA recalculated dynamically
Notifications triggered on status changes

This is where event-driven architecture and async messaging (Kafka / SQS / PubSub) shine.

Design Trade-offs

Search = High Availability 提醒 even stale data is acceptable
Payments = Strong Consistency No double charges, no missing orders
Tracking = Eventual Consistency Minor delays are acceptable

Final Thoughts

Designing Swiggy isn’t about cramming features—it’s about:

Scalability
Fault isolation
Latency optimization
Correctness where it matters

If you’re preparing for system design interviews or building large-scale distributed systems, this architecture is a goldmine.

Don’t forget to check out the original video:
👉 https://youtu.be/xQnY-DDhEBw

Clustering News Articles for Topic Detection: A Technical Deep Dive

Mayank Gupta — Sun, 22 Jun 2025 11:22:31 +0000

With the explosive growth of digital journalism, news readers and analysts often find themselves overwhelmed by an avalanche of information from numerous sources. Imagine a journalist trying to keep up with evolving stories across platforms like Times of India, CNN, and BBC, where the same events are covered from different angles and styles. This creates a dire need for systems that can automatically detect and group related news stories — a challenge the research paper tackles head-on using clustering-based topic detection techniques.

In this blog, we break down their methodology, discuss alternative approaches, and explain why agglomerative hierarchical clustering was chosen as the foundation for the topic detection system.

1. The Problem: Making Sense of News Floods

1.1 What Is Topic Detection?

Topic Detection is the unsupervised process of identifying distinct subjects or themes within a collection of text — here, news articles. The aim is to detect:

New topics (e.g., breaking news)
Subsequent articles covering those topics
Relationships between different articles on the same event

This enables systems to identify story boundaries and link news content semantically, even when they come from different publishers or regions.

1.2 Why Is It Important?

A few applications include:

News aggregators like Google News wanting to show “related stories”
Media analysts tracking how stories evolve
Enterprises monitoring press mentions of their competitors
Governments watching for sudden geopolitical shifts

2. Available Methods for Topic Detection

Before zooming into the chosen method, let’s look at the landscape of available techniques for detecting topics in unstructured text.

2.1 Rule-based and Heuristic Methods

Use keyword matching, regex rules, and metadata (tags, categories)
Drawback: Brittle and inflexible to language evolution or phrasing variations

2.2 Supervised Learning Approaches

Use labeled datasets to train classifiers (e.g., SVM, Naïve Bayes, Decision Trees)
Drawback: Need labeled examples for each topic; fails with unseen events

2.3 Deep Learning Methods

Models like LDA2Vec, BERT-topic, or LSTM-based classifiers
Strength: Capture contextual semantics well
Drawback: Computationally expensive, harder to interpret, and require large training sets

2.4 Clustering Techniques (Chosen by the Researchers)

Unsupervised: No labeled data required
Finds naturally occurring groupings in text based on similarity
Suitable when new, unknown topics may emerge dynamically

3. Why Agglomerative Hierarchical Clustering?

The researchers specifically opted for Agglomerative Hierarchical Clustering (AHC) with average linkage, due to the following reasons:

3.1 No Need for Predefined Cluster Count

Unlike K-means (which requires specifying k in advance), AHC builds a tree of clusters (dendrogram) from the bottom up—each document starts in its own cluster, and clusters are merged based on similarity.

This is ideal for unpredictable, real-world news data where the number of topics is not known beforehand.

3.2 Handles Multi-topic Overlaps and Duplicates

The dataset contains:

Different articles covering the same event (with different styles)
Near-duplicates from press agencies re-used by various outlets

AHC, with average linkage, balances between complete and single linkage to handle such redundancies and overlaps effectively.

3.3 Outlier Robustness

Using average distance (rather than minimum or maximum) mitigates sensitivity to noisy or outlier articles—important for large, heterogeneous news datasets.

4. Preprocessing Pipeline

Before clustering, textual data undergoes a series of NLP preprocessing steps:

4.1 Tokenization

Splits text into individual words (tokens) for processing.

Example:
Input: "Text mining extracts useful information."
Output: [Text, mining, extracts, useful, information]

4.2 Stopword Removal

Eliminates common but uninformative words like the, is, and, etc.

4.3 Stemming

Reduces words to their root form for better matching.
Example: walking, walks, walked → walk

This reduces vocabulary sparsity and improves similarity calculations.

5. Similarity Calculation Using TF-IDF + Cosine Distance

Each news article is vectorized using TF-IDF (Term Frequency – Inverse Document Frequency), which emphasizes terms that are important within a document but rare across documents.

Then, Cosine Similarity is used to measure document closeness:

This ensures that similarity is based on direction (not magnitude) of the document vectors—ideal when documents vary in length.

6. Algorithm: Agglomerative Hierarchical Clustering

Steps:

Treat each document as its own cluster.
Calculate pairwise distances between all clusters.
Merge the two closest clusters using average linkage:

Repeat until one global cluster remains (or stop early based on threshold).

7. Practical Example

Consider a paragraph like:

"Congratulations! You are selected for the interview. You can visit our office after 11 AM."

The system:

Tokenizes and stems the sentences.
Computes word probabilities (unigram, bigram).
Assigns the paragraph a label (topic) by checking the dominance of topic scores among predefined categories like educational, entertainment, personal, etc.

For classification, Hidden Markov Models (HMM) are used to label sequences of statements in the paragraph and associate the whole paragraph to the most likely category.

8. Evaluation and Future Scope

8.1 Initial Focus

The paper proposes initial experiments on news from the sports domain, with plans to extend to politics, education, and entertainment.

8.2 Limitations

No formal evaluation metrics (e.g., Precision, Recall) are presented
Scalability to real-time streams or multilingual content is not addressed
The use of HMM for classification could be modernized with transformer-based models

8.3 Future Enhancements

Add Topic Tracking (supervised component) to monitor evolving topics
Integrate Named Entity Recognition (NER) for enhanced similarity
Experiment with semantic vector models (e.g., Word2Vec, BERT)

Conclusion

The paper presents a well-structured and computationally reasonable approach to the complex problem of topic detection from news articles. By leveraging Agglomerative Hierarchical Clustering with TF-IDF-based cosine similarity, the researchers offer a robust framework for discovering story boundaries and organizing large-scale news data without needing manual labels.

For practitioners, the key takeaway is this: when dealing with dynamic, unlabeled news data, hierarchical clustering remains a practical, explainable, and extensible foundation.

Evaluating Google Gemini for Document OCR Using Hugging Face Invoice Dataset

Mayank Gupta — Thu, 19 Jun 2025 13:40:21 +0000

In the digital age, invoices are the lifeblood of businesses, but processing them manually can be a monumental task, prone to errors and inefficiency. This is where Optical Character Recognition (OCR) shines, transforming scanned documents into structured, usable data. With the rise of advanced AI models like Google's Gemini, the promise of highly accurate and intelligent OCR has never been closer.

But how well does Gemini actually perform on real-world documents like invoices? And how can we systematically evaluate its accuracy? This blog post dives into just that, demonstrating a practical approach to benchmark Gemini's OCR capabilities using the widely accessible Hugging Face invoices-donut-data-v1 dataset.

The Challenge of Invoice OCR: More Than Just Reading Text

Imagine an invoice. It's not just a block of text; it contains crucial, structured information: invoice numbers, dates, vendor details, line items with descriptions, quantities, and prices, and of course, the grand total. A truly effective OCR solution for invoices needs to do more than just extract raw text; it needs to understand the meaning of that text within the document's context, identify these specific fields, and present them in a structured format, typically JSON.

Traditional OCR might give you a jumbled string of all the words on the page. Advanced, intelligent OCR, like what Gemini aims to provide, should be able to tell you, "This is the invoice number," "This is the total amount," and so on.

Our Battlefield: The Hugging Face `invoices-donut-data-v1` Dataset

For our evaluation, we turn to a fantastic resource: the katanaml-org/invoices-donut-data-v1 dataset available on Hugging Face. This dataset is specifically designed for document understanding tasks, offering a collection of invoice images paired with their "ground truth" – the perfect, manually extracted JSON representation of the invoice data. This "ground truth" is our gold standard against which we'll compare Gemini's output.

Each sample in this dataset provides:

An image: The invoice document itself.
ground_truth: A JSON string containing the accurately extracted fields, often with a nested gt_parse key holding the structured data we care about.

The Gemini Advantage: Multimodal Power for Document Understanding

Gemini models, especially versions like Gemini 1.5 Pro and Flash, are inherently multimodal. This means they can process and understand information from various modalities simultaneously – text, images, and even audio or video. For OCR, this is a game-changer. Instead of just "seeing" pixels, Gemini can leverage its understanding of visual layout, textual patterns, and even common invoice structures to more accurately extract and interpret information.

While the exact API call for Gemini's specialized document parsing might vary, the core principle remains: you send an image, and you receive a structured response. For this demonstration, we'll assume an API endpoint (API_URL) that takes an image and returns a JSON object containing the OCR'd data. Your API_KEY will, of course, be required for authentication.

Setting Up the Evaluation Pipeline (Code Walkthrough)

Let's break down the Python code used for this evaluation.

First, we install necessary libraries:

!pip install --upgrade datasets fsspec huggingface_hub jiwer
!apt install git-lfs # For potential git large file storage needs, though not strictly required for this dataset
!git lfs install
!git clone https://huggingface.co/datasets/openthaigpt/thai-ocr-evaluation # Not directly used in this script but good for context

Next, we load the invoices-donut-data-v1 dataset:

from datasets import load_dataset
import requests
import io
from PIL import Image
import json

# Load dataset
dataset = load_dataset("katanaml-org/invoices-donut-data-v1")["test"]

results = []

for i, sample in enumerate(dataset):
    image = sample["image"]
    ground_truth_json_str = sample["ground_truth"] # Renamed to avoid shadowing

    # Convert PIL image to byte stream
    buffer = io.BytesIO()
    image.save(buffer, format="PNG")
    buffer.seek(0)

    # Prepare request for Gemini OCR API
    # The actual API call for Gemini might look different,
    # often involving `google.generativeai.GenerativeModel.generate_content`
    # and structuring your prompt to ask for structured data extraction.
    # For this example, we're simulating a generic OCR API call.
    files = {
        "files": ("image.png", buffer, "image/png"),
    }
    data = {
        "template": "benchmark" # This could be a prompt for Gemini to extract invoice data
    }
    headers = {
        "X-API-Key": API_KEY
    }

    # Send to your OCR API (simulating Gemini API call)
    # In a real Gemini integration, you'd use the `google.generativeai` client
    # and craft a prompt like:
    # response = model.generate_content([image, "Extract all invoice details as a JSON object, including invoice_number, total_amount, date, and line_items with description, quantity, and price."])
    # ocr_output = response.text or response.parts[0].text if it's text-based output
    response = requests.post(API_URL, headers=headers, files=files, data=data)

    try:
        response_json = response.json()
        # Adjust 'result' based on your actual Gemini API response structure
        ocr_output = response_json.get("result", "")
    except Exception as e:
        ocr_output = f"Error: {str(e)}"

    # We need a unique ID for each image, typically from the dataset itself or a generated one.
    # For simplicity, let's use the loop index or assume a unique ID field exists in `sample`.
    # As the original code didn't define image_id, let's use a simple index.
    image_id = f"sample_{i}"

    results.append({
        "id": image_id,
        "ground_truth": ground_truth_json_str, # Keep as string for initial storage
        "prediction": ocr_output,
    })

Key modification for Gemini: The requests.post call is a placeholder. In a real-world scenario, you would use the google-generativeai library. Your prompt to Gemini would be crucial, guiding it to extract the specific invoice fields in a structured (e.g., JSON) format.

For example, a conceptual Gemini integration might look like this:

import google.generativeai as genai
from PIL import Image

# Configure your Gemini API key
genai.configure(api_key=API_KEY)

# Initialize the model
model = genai.GenerativeModel('gemini-pro-vision') # Or 'gemini-1.5-flash', 'gemini-1.5-pro'

# Inside your loop:
# image is a PIL Image object
# Craft a detailed prompt for invoice extraction
prompt = (
    "Extract the following details from this invoice and provide them in a JSON format:\n"
    "{\n"
    "  \"gt_parse\": {\n"
    "    \"invoice_number\": \"\",\n"
    "    \"date\": \"\",\n"
    "    \"total_amount\": \"\",\n"
    "    \"vendor_name\": \"\",\n"
    "    \"line_items\": [\n"
    "      {\n"
    "        \"description\": \"\",\n"
    "        \"quantity\": \"\",\n"
    "        \"unit_price\": \"\",\n"
    "        \"amount\": \"\"\n"
    "      }\n"
    "    ]\n"
    "  }\n"
    "}"
    "Ensure all values are extracted as strings. If a field is not present, leave its value as an empty string."
)

try:
    response = model.generate_content([prompt, image])
    # Gemini's response.text contains the extracted JSON string
    ocr_output = response.text
except Exception as e:
    ocr_output = f"Error during Gemini processing: {str(e)}"

This conceptual integration highlights how Gemini's multi-modal capabilities allow you to provide both the image and a specific instruction (the prompt) to guide its OCR and information extraction process.

Measuring Success: Beyond Simple Text Comparison

Evaluating OCR for structured documents requires more than just a simple string match. We need to assess how accurately individual fields are extracted. For this, we'll use the Character Error Rate (CER) and field-level accuracy.

The jiwer library is excellent for calculating CER, which measures the minimum number of edits (insertions, deletions, substitutions) needed to change one string into another, divided by the length of the ground truth string. A lower CER indicates higher accuracy.

We'll also calculate "accuracy" as the proportion of fields that are exactly matched between the ground truth and the prediction.

import json
from jiwer import cer
from collections.abc import Mapping

# Utility: flatten nested dicts with compound keys
def flatten_dict(d, parent_key='', sep='.'):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, Mapping):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        elif isinstance(v, list):
            for i, item in enumerate(v):
                items.extend(flatten_dict(item, f"{new_key}[{i}]", sep=sep).items())
        else:
            items.append((new_key, str(v)))
    return dict(items)

# Compare
for r in results:
    try:
        # Load ground truth and prediction JSONs
        gt_json = json.loads(r["ground_truth"])
    except Exception as e:
        print(f"Invalid GT JSON in ID {r['id']}: {e}")
        continue

    pred_json = r["prediction"]
    if isinstance(pred_json, str):
        try:
            pred_json = json.loads(pred_json)
        except json.JSONDecodeError:
            print(f"Invalid Prediction JSON in ID {r['id']}. Prediction: {pred_json}")
            continue
    elif not isinstance(pred_json, Mapping): # Ensure it's a dictionary for flattening
        print(f"Prediction for ID {r['id']} is not a valid JSON object or dict. Prediction: {pred_json}")
        continue

    # Extract nested gt_parse only for both ground truth and prediction
    gt_flat = flatten_dict(gt_json.get("gt_parse", {}))
    pred_flat = flatten_dict(pred_json.get("gt_parse", {}))

    print(f"\n--- ID: {r['id']} ---")
    total_fields = len(gt_flat)
    correct_matches = 0
    total_cer = 0.0

    for key in gt_flat:
        gt_val = gt_flat[key]
        pred_val = pred_flat.get(key, "") # Get predicted value, default to empty string if not found

        field_cer = cer(gt_val, pred_val)
        total_cer += field_cer

        if gt_val.strip() == pred_val.strip():
            correct_matches += 1

        print(f"{key}: CER={field_cer:.2f} | GT='{gt_val}' | Pred='{pred_val}'")

    avg_cer = total_cer / total_fields if total_fields else 1.0
    acc = correct_matches / total_fields if total_fields else 0.0

    print(f"\nAccuracy (Exact Match): {acc:.2%} | Avg CER: {avg_cer:.2f}")

Explanation of Evaluation Metrics:

flatten_dict: This helper function is crucial for comparing nested JSON structures. It converts dictionaries like {"gt_parse": {"invoice_number": "123", "line_items": [{"description": "Item A"}]}} into a flat dictionary with compound keys: {"gt_parse.invoice_number": "123", "gt_parse.line_items[0].description": "Item A"}. This allows for straightforward field-by-field comparison.
Character Error Rate (CER): Calculated for each field, it tells us how "close" the predicted text is to the ground truth at a character level. A CER of 0.00 means a perfect match.
Accuracy (Exact Match): This metric specifically counts how many fields were extracted perfectly, meaning the predicted value exactly matches the ground truth value after stripping whitespace. This is particularly important for critical fields like invoice numbers or total amounts where even a single character error can invalidate the data.

Expected Outcomes and Why This Matters

When running this evaluation with a robust OCR model like Gemini, you would ideally observe:

Low Average CER: Indicating that Gemini is highly accurate at recognizing individual characters and words across the invoice.
High Accuracy (Exact Match): Especially for key fields like invoice_number, date, and total_amount. These fields are critical for automated processing and downstream systems. For example, if Gemini consistently extracts "12345" as the invoice number when the ground truth is "12345", that's a perfect exact match and a CER of 0.
Intelligent Extraction: Beyond just character accuracy, Gemini's multimodal understanding should enable it to correctly map extracted text to the right fields, even if the layout varies across invoices. For instance, correctly identifying the total amount even if it's styled differently on different invoices.

Let's consider an example for a single invoice:

Ground Truth (gt_parse):

{
  "invoice_number": "INV-2025-001",
  "date": "2025-06-15",
  "total_amount": "150.75",
  "line_items": [
    {
      "description": "Consulting Services",
      "quantity": "1",
      "unit_price": "100.00",
      "amount": "100.00"
    },
    {
      "description": "Travel Expenses",
      "quantity": "1",
      "unit_price": "50.75",
      "amount": "50.75"
    }
  ]
}

Gemini Prediction (gt_parse):

{
  "invoice_number": "INV-2025-001",
  "date": "2025-06-15",
  "total_amount": "150.75",
  "line_items": [
    {
      "description": "Consulting Services",
      "quantity": "1",
      "unit_price": "100.00",
      "amount": "100.00"
    },
    {
      "description": "Travel Expenses",
      "quantity": "1",
      "unit_price": "50.75",
      "amount": "50.75"
    }
  ]
}

In this ideal scenario, all fields would have a CER of 0.00 and contribute to 100% exact match accuracy.

Now consider a less ideal scenario:

Gemini Prediction (gt_parse):

{
  "invoice_number": "INV-2025-01", // Missing a '0'
  "date": "2025-06-15",
  "total_amount": "150.75",
  "line_items": [
    {
      "description": "Consulting Sercices", // Typo
      "quantity": "1",
      "unit_price": "100.00",
      "amount": "100.00"
    }
  ]
}

Here, "invoice_number" and "line_items[0].description" would show a non-zero CER, and would not count towards exact match accuracy. The "total_amount" and "date" fields, if correctly extracted, would still contribute to exact match accuracy and have a CER of 0.00. This granular evaluation helps pinpoint areas where the OCR model might need further refinement or where certain document layouts pose greater challenges.

Conclusion: Unlocking Automation with Intelligent OCR

Evaluating OCR models like Gemini against structured datasets such as invoices-donut-data-v1 is not just an academic exercise. It's a critical step in building robust, automated document processing workflows. By systematically measuring performance using metrics like CER and exact match accuracy, we can:

Validate Model Performance: Objectively determine how well Gemini handles invoice OCR.
Identify Strengths and Weaknesses: Pinpoint specific fields or document variations where Gemini excels or struggles.
Drive Improvement: Use the insights to refine prompts, fine-tune models, or implement post-processing steps to achieve even higher accuracy.

The ability of multimodal AI models like Gemini to not just "read" text but to "understand" documents is transformative for business automation. By rigorously testing and evaluating these capabilities, we move closer to a future where manual data entry from invoices becomes a relic of the past, freeing up human potential for more strategic and creative endeavors.

From Volume to Persistent Volume in Kubernetes

Mayank Gupta — Mon, 19 May 2025 17:03:45 +0000

In our previous blog, we explored how Kubernetes volumes help preserve data across container restarts. We worked with emptyDir, which retains data as long as the pod is running but loses it when the pod is deleted. Then, we improved this setup using hostPath, which allows a container to persist data at a specified directory on the node.

This worked seamlessly in Minikube because it runs on a single-node cluster. But what happens in multi-node cloud environments? The same solution will fail because Kubernetes dynamically schedules pods across multiple nodes, meaning the data stored in hostPath on one node won’t be available to a pod running on another node.

To solve this, we need Persistent Volumes (PVs) and CSI (Container Storage Interface).

Introducing Persistent Volumes

A Persistent Volume (PV) is independent of individual pods and nodes, providing a stable and reusable storage location across the cluster.

To configure it, we create a new file host-pv.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: host-pv
spec:
  capacity:
    storage: 1Gi
  volumeMode: Block
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data
    type: DirectoryOrCreate

Breaking Down the Configuration

apiVersion: v1 & kind: PersistentVolume

Defines the resource type as a Persistent Volume.
metadata.name: host-pv

Assigns a unique name to the PV, making it identifiable when claimed.
Storage Capacity (capacity.storage: 1Gi)

Specifies the storage capacity, set to 1GiB in this example.
Volume Mode (volumeMode: Block)

Declares the volume mode. Block is useful for low-level storage needs, but many use Filesystem for general applications.
Access Modes (accessModes)

Controls how pods can interact with the volume:
- ReadWriteOnce: Only one pod can mount it as read-write.
- ReadOnlyMany: Multiple pods can access it, but only in read-only mode.
- ReadWriteMany: Multiple pods can read and write simultaneously.
Host Path (hostPath)

Maps /data on the node as the volume’s storage location. type: DirectoryOrCreate ensures the directory exists.

Claiming the Persistent Volume

A Persistent Volume Claim (PVC) requests a PV from Kubernetes and ensures a pod can access it. Define this in host-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: host-pvc
spec:
  volumeName: host-pv
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  resources:
    requests:
      storage: 1Gi

Explanation

volumeName: host-pv

Directly references our previously created PV.
Access Modes (accessModes)

Specifies access rights, ensuring only one pod can write at a time.
Storage Class (storageClassName: standard)

Defines the underlying storage provisioner. If using cloud services like AWS or GCP, this would be different (e.g., gp2 for AWS).
Resource Requests (resources.requests.storage: 1Gi)

Requests 1GiB of storage from an available PV.

Integrating Persistent Volume into Kubernetes Deployment

Finally, update deployment.yaml to use the PVC:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: story-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: story
  template:
    metadata:
      labels:
        app: story
    spec:
      containers:
        - name: story
          image: mayankcse1/kub-data-01-starting-setup-stories:1
          volumeMounts:
            - name: story-volume
              mountPath: /app/story
      volumes:
        - name: story-volume
          persistentVolumeClaim:
            claimName: host-pvc

Key Enhancements

We increased replicas to 2, ensuring multiple instances of our application run.
The PVC (host-pvc) is mounted inside the container at /app/story, making persistent storage accessible.

Why Persistent Volumes Matter in Multi-Node Clusters

Unlike hostPath, which binds storage to a single node, Persistent Volumes ensure data availability across multiple pods and nodes. Even if a pod fails and Kubernetes reschedules it on a new node, the data remains intact.

Summary

Moving from standard Docker Volumes to Kubernetes Volumes and finally to Persistent Volumes is essential for building scalable, cloud-ready applications.

For further exploration, refer to the official Kubernetes storage documentation.

With Persistent Volumes, your application's data survives pod failures, rescheduling, and multi-node deployments, making it reliable for production environments. Now you’re ready to manage stateful applications at scale.

MANAGING DATA & VOLUMES IN KUBERNETES

Mayank Gupta — Mon, 19 May 2025 16:29:34 +0000

Imagine you’re running an application that lets users store text, but every time the container crashes or restarts, their data is lost. Frustrating, right? That’s exactly the problem Kubernetes solves with volumes—ensuring data survives beyond container lifecycles. In this blog, we’ll walk through a practical example that demonstrates how Kubernetes can persist data effectively, even in failure scenarios.

The Problem: Where Did My Data Go?

Containers are lightweight, flexible, and easily restartable, but they come with a challenge: they are stateless by default. This means that every time a container crashes or is redeployed, any data stored inside it vanishes.

Practice Resource

Consider a simple application where users submit text, and the text is stored in a file inside a container. Here's how you can run it using Docker Compose:

docker compose up --build

Once the container is running, test it using curl:

Retrieve stored text:

  curl --location 'localhost/story'

Add new text:

  curl --location 'localhost/story' \
    --header 'Content-Type: application/json' \
    --data '{
        "text": "My text!"
    }'

Now, if the container crashes and restarts, all previously submitted text is gone! This happens because the container filesystem resets on every startup.

So, how do we persist the data beyond container failures? Kubernetes volumes provide the answer.

Solution: Adding Volumes in Kubernetes

Kubernetes allows containers to mount volumes—storage spaces that persist beyond container lifecycles. Compared to Docker volumes, Kubernetes volumes offer greater flexibility and resilience.

Setting Up Kubernetes Deployment

First, push the Docker image to a registry:

docker tag kub-data-01-starting-setup-stories mayankcse1/kub-data-01-starting-set
docker push mayankcse1/kub-data-01-starting-setup-stories

Then, create a deployment.yaml file to define how Kubernetes should manage our container:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: story-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: story
  template:
    metadata:
      labels:
        app: story
    spec:
      containers:
        - name: story
          image: mayankcse1/kub-data-01-starting-setup-stories
          ports:
            - containerPort: 3000
          volumeMounts:
            - mountPath: /app/story
              name: stories-volume
      volumes:
        - name: stories-volume
          persistentVolumeClaim:
            claimName: stories-pvc

Creating a Service to Expose the App

Now, define a Kubernetes service to allow external access to the application:

apiVersion: v1
kind: Service
metadata:
  name: story-service
spec:
  selector:
    app: story
  ports:
    - protocol: TCP
      port: 80
      targetPort: 3000
  type: LoadBalancer

Deploy and Test the Application in Kubernetes

Start Minikube if it's not running:

minikube status
minikube start --driver=docker

Then, apply the Kubernetes configuration:

kubectl apply -f=service.yaml -f=deployment.yaml
kubectl get pods
kubectl get deployments
minikube service story-service

Test the service:

curl --location 'http://<Host Address>/story'
curl --location 'http://<Host Address>/story' \
--header 'Content-Type: application/json' \
--data '{
    "text": "mayank"
}'

At this point, the service is running—but our data can still disappear when the pod itself is removed. To solve this, let’s improve our volume strategy.

Ensuring Data Persistence with Volumes

Kubernetes supports various volume types, but a simple way to persist data within a pod is using emptyDir:

volumes:
  - name: story-volume
    emptyDir: {}

This ensures data remains available as long as the pod is alive. However, if the pod is deleted, all data is lost.

Handling Multiple Pods & Node Failures

If you scale your application to multiple pods, data consistency becomes tricky. Suppose a pod crashes and traffic is redirected to another pod—the new pod won’t have the old data!

To store data across multiple pods running on the same node, we can use hostPath:

volumes:
  - name: story-volume
    hostPath:
      path: /data
      type: DirectoryOrCreate

However, this only works for pods on the same node—if a pod is scheduled on a different node, it won’t have access to the previous data.

For a more robust solution across multiple nodes, consider Persistent Volumes (PVs) that work with cloud storage or external databases.

Final Thoughts: Why Kubernetes Volumes Matter

Kubernetes volumes solve critical data loss issues in containerized applications. By implementing persistent storage solutions, developers ensure that user data survives container crashes, pod restarts, and even scaling across multiple nodes.

To explore all available volume storage options in Kubernetes, visit the official documentation:

https://kubernetes.io/docs/concepts/storage/volumes/

Understanding how stateful applications work within Kubernetes is essential for building scalable, resilient infrastructure. Whether deploying a simple text storage app or a large-scale distributed system, managing volumes effectively ensures reliable data persistence.

Forem: Mayank Gupta

Efficiency at Scale: Scaling, Scheduling, and Measuring Databricks SQL

1. Right-Sizing Your Warehouse

Sizing Strategies:

Cost Control Checklist:

2. Scheduling and Automation Patterns

The Three Modern Patterns:

3. The Power of Parameterization

4. Measuring Success: The Optimization Feedback Loop

The "Before & After" Audit

Best Practices Summary (The Do's and Don'ts)

✅ Do:

❌ Don't:

Final Takeaway

Interview Questions

Optimizing Delta Tables: From Maintenance to Managed Excellence

The Problem: "Small File Syndrome" and Data Scattering

1. File Compaction with OPTIMIZE

2. Z-ORDER: High-Performance Data Skipping

3. Liquid Clustering: The Modern Standard

Why Liquid Clustering Wins:

4. Reclaiming Storage with VACUUM

5. The "Set and Forget" Strategy: Predictive Optimization

Summary / Key Takeaways

Interview Questions

Hands-On Performance: Diagnosing and Fixing Databricks SQL Bottlenecks

The Problem: Inefficient "Brute Force" Queries

Core Concept: The Golden Rule of Big Data

Deep Dive: Join Strategies & Optimization

1. BroadcastHashJoin (The Winner)

2. SortMergeJoin (The Workhorse)

3. Adaptive Query Execution (AQE)

Technical Implementation: Writing Cache-Friendly SQL

The "Cache-Busters" to Avoid

Pre-Warming the Cache

System Design: The Performance Playbook

Step 1: Identify the "Heavy Hitters"

Step 2: Analyze the DAG

Step 3: Apply the Fix

Step 4: Verify & Alert

Best Practices & Pitfalls

Summary / Key Takeaways

Interview Questions

Master Your Queries: A Guide to Databricks SQL Performance Monitoring

The Problem: The "Black Box" Query

Core Concepts: History vs. Profile

1. Query History (The Macro View)

2. Query Profile (The Micro View)

Deep Dive: Anatomy of a Query Profile

The Top Operators Panel

Memory Spills: The Silent Killer

Predicate Pushdown

Technical Implementation: Monitoring via System Tables

Example: Identifying High-Cost Outliers

Example: User Cost Leaderboard

Real-World Applications

Performance Red Flags & Best Practices

Summary / Key Takeaways

Interview Questions

Demystifying Databricks SQL: How Your Queries Actually Run Under the Hood

The Problem: The "Black Box" of Latency

1. The Big Picture: Databricks SQL Architecture

Choosing Your Engine: SQL Warehouse Types

2. The Lifecycle of a Query: From SQL to Results

3. The Secret Sauce: The Photon Engine

4. Deep Dive: Decoding the Query Profile

Key Metrics to Watch:

Architecture of a Distributed Join

5. Practical Implementation: Querying with Best Practices

6. Best Practices & Pitfalls

✅ The "Do's"

❌ The "Don'ts"

Summary: Key Takeaways

Interview Questions

Building Resilient AI: Architectural Patterns for Event-Driven Agents

The Problem: The Fragility of Synchronous Agents

1. Choosing Your Backbone: Centralized vs. Federated

The Toolbelt: Kafka vs. NATS vs. Azure

2. Maintaining Consistency: Sagas and Outboxes

The Saga Pattern

1. File Compaction with `OPTIMIZE`

4. Reclaiming Storage with `VACUUM`

Our Battlefield: The Hugging Face `invoices-donut-data-v1` Dataset