<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mayank Gupta</title>
    <description>The latest articles on Forem by Mayank Gupta (@mayankcse).</description>
    <link>https://forem.com/mayankcse</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F749855%2Fcad306d7-ed2f-40dc-84d8-7076ed4611ee.png</url>
      <title>Forem: Mayank Gupta</title>
      <link>https://forem.com/mayankcse</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mayankcse"/>
    <language>en</language>
    <item>
      <title>Efficiency at Scale: Scaling, Scheduling, and Measuring Databricks SQL</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:07:32 +0000</pubDate>
      <link>https://forem.com/mayankcse/efficiency-at-scale-scaling-scheduling-and-measuring-databricks-sql-2cjg</link>
      <guid>https://forem.com/mayankcse/efficiency-at-scale-scaling-scheduling-and-measuring-databricks-sql-2cjg</guid>
      <description>&lt;p&gt;In our final look at Databricks SQL, we move beyond individual table tweaks to the broader architecture. Optimization isn't just about making one query fast; it’s about building a sustainable, cost-efficient system. This means picking the right warehouse size, automating recurring workloads, and—most importantly—proving your impact with hard data.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Right-Sizing Your Warehouse
&lt;/h2&gt;

&lt;p&gt;A common trap is assuming a larger warehouse is always better. While doubling a warehouse size (e.g., from Small to Medium) often cuts query time in half, it also doubles your DBU (Databricks Unit) spend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sizing Strategies:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2X-Small to X-Small:&lt;/strong&gt; Best for light exploratory queries and cost-sensitive, low-concurrency tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small to Medium:&lt;/strong&gt; The "sweet spot" for interactive dashboards and general ad-hoc analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large and Beyond:&lt;/strong&gt; Reserved for heavy ETL (Extract, Transform, Load) jobs, massive aggregations, and high-concurrency production environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost Control Checklist:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auto-Stop:&lt;/strong&gt; Set this to a low threshold (e.g., 1–10 minutes) to prevent paying for idle compute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serverless:&lt;/strong&gt; Use serverless warehouses to eliminate "cold starts." They spin up in 2–6 seconds, allowing you to be more aggressive with auto-stop settings.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Scheduling and Automation Patterns
&lt;/h2&gt;

&lt;p&gt;You shouldn't be running production workloads manually from the SQL editor. Databricks provides three ways to move from "manual" to "managed."&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three Modern Patterns:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Scheduled Queries:&lt;/strong&gt; Great for daily reports or cleaning tasks. Always save your query first, then use the &lt;strong&gt;Schedule&lt;/strong&gt; button to define the cadence.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Materialized Views (MVs):&lt;/strong&gt; These pre-compute expensive aggregations. Instead of re-scanning raw data every time, users query the MV and get instant results.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Streaming Tables:&lt;/strong&gt; These ingest data continuously, ensuring your dashboards are always fresh without the "spiky" load of scheduled batch jobs.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. The Power of Parameterization
&lt;/h2&gt;

&lt;p&gt;Stop hard-coding your &lt;code&gt;WHERE&lt;/code&gt; clauses! Using parameters (e.g., &lt;code&gt;:start_date&lt;/code&gt;) makes your SQL more secure and much more efficient.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Prevents SQL injection by separating the query logic from the input data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache Efficiency:&lt;/strong&gt; Databricks can reuse the same execution plan because the "text" of the query remains identical even when the parameter values change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reusability:&lt;/strong&gt; A single query can power multiple dashboard widgets by simply changing the input values.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Using parameters for a reusable, cache-friendly query&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_data&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;start_date&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;status_filter&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Measuring Success: The Optimization Feedback Loop
&lt;/h2&gt;

&lt;p&gt;Optimization is meaningless if you can't prove it. You need to established baselines and track four key metrics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Why it Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P95 Duration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Detects "outlier" queries that are frustrating your users.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBU Consumption&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The "Bottom Line"—tracks the literal cost of your SQL workloads.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bytes Scanned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Validates that your pruning and Z-ORDERing are actually working.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cache Hit Ratio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Measures how often you are getting "free" results from the result cache.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The "Before &amp;amp; After" Audit
&lt;/h3&gt;

&lt;p&gt;To prove your value, query the &lt;code&gt;system.query.history&lt;/code&gt; and &lt;code&gt;system.billing.usage&lt;/code&gt; tables. Compare a 24-hour window &lt;em&gt;before&lt;/em&gt; you applied Liquid Clustering vs. a 24-hour window &lt;em&gt;after&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check your top 20 most recent queries for scan efficiency&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;statement_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;total_duration_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;read_bytes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mb_scanned&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Best Practices Summary (The Do's and Don'ts)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ Do:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stagger Schedules:&lt;/strong&gt; Don't have 50 dashboards refresh exactly at 8:00 AM; space them out by 5 minutes to avoid resource contention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use CTEs:&lt;/strong&gt; Common Table Expressions (using the &lt;code&gt;WITH&lt;/code&gt; clause) make your logic readable and easier for the optimizer to handle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor Parallelism:&lt;/strong&gt; Use the warehouse monitoring tab to see if you are leaving compute capacity on the table.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ❌ Don't:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keep-Alive Queries:&lt;/strong&gt; Don't run "dummy" queries just to keep a warehouse from spinning down. Use Serverless and let Auto-Stop do its job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip ANALYZE:&lt;/strong&gt; Always run &lt;code&gt;ANALYZE TABLE&lt;/code&gt; after large loads so the cost-based optimizer (CBO) has fresh statistics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Function Wrapping:&lt;/strong&gt; Avoid &lt;code&gt;WHERE YEAR(date) = 2026&lt;/code&gt;; it breaks partition pruning.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;p&gt;Optimization is a &lt;strong&gt;cycle&lt;/strong&gt;, not a destination. By monitoring with Query History, diagnosing with Query Profiles, fixing with Liquid Clustering, and measuring with System Tables, you transform your Databricks environment into a high-performance, cost-effective data powerhouse.&lt;/p&gt;

&lt;h3&gt;
  
  
  Interview Questions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;How do you determine if a SQL Warehouse needs to be scaled up or if the queries need optimization?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;What are the benefits of using a Materialized View over a standard View in Databricks SQL?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;How does query parameterization improve the "Cache Hit Ratio"?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;How would you calculate the total DBU cost of a specific user's queries over the last 30 days?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now that you have the full toolkit, which of these optimization strategies will you implement first to lower your DBU burn?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Optimizing Delta Tables: From Maintenance to Managed Excellence</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:06:33 +0000</pubDate>
      <link>https://forem.com/mayankcse/optimizing-delta-tables-from-maintenance-to-managed-excellence-1bd</link>
      <guid>https://forem.com/mayankcse/optimizing-delta-tables-from-maintenance-to-managed-excellence-1bd</guid>
      <description>&lt;p&gt;If high-performance SQL queries are the engine of your data platform, then your Delta tables are the fuel. Even the best-written SQL can't overcome a poorly organized data layer. In this guide, we shift from query logic to the physical storage layer—exploring how to maintain, cluster, and automate your Delta tables for maximum efficiency.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: "Small File Syndrome" and Data Scattering
&lt;/h2&gt;

&lt;p&gt;Two main issues plague Delta table performance over time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Small File Problem:&lt;/strong&gt; Frequent streaming or incremental writes create thousands of tiny Parquet files. Each file requires a separate I/O task, leading to massive scheduling overhead.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Data Scattering:&lt;/strong&gt; Without organization, related records (e.g., all sales for a specific &lt;code&gt;user_id&lt;/code&gt;) are scattered across hundreds of files, forcing the engine to scan the entire table for a single lookup.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. File Compaction with &lt;code&gt;OPTIMIZE&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;OPTIMIZE&lt;/code&gt; command is your first line of defense. It solves the small file problem by physically rewriting many tiny files into large, efficient &lt;strong&gt;1 GB chunks&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Impact:&lt;/strong&gt; Reduces file open/close overhead by 10x to 100x.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best Practice:&lt;/strong&gt; Run &lt;code&gt;OPTIMIZE&lt;/code&gt; after large batch loads or frequent streaming updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-Compaction:&lt;/strong&gt; For streaming tables, set &lt;code&gt;optimizeWrite = true&lt;/code&gt;. This coalesces files during the write process so you don't have to manage manual maintenance jobs.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Manually compact small files in a table&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;sales_unoptimized&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enable auto-compaction for continuous maintenance&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_streaming&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;autoOptimize&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizeWrite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Z-ORDER: High-Performance Data Skipping
&lt;/h2&gt;

&lt;p&gt;Compaction makes files larger, but &lt;code&gt;Z-ORDER&lt;/code&gt; makes them &lt;strong&gt;smarter&lt;/strong&gt;. By co-locating related data in the same files, &lt;code&gt;Z-ORDER&lt;/code&gt; allows the engine to skip the majority of data using file-level Min/Max statistics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When to use:&lt;/strong&gt; On high-cardinality columns (e.g., &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;product_id&lt;/code&gt;) that appear frequently in &lt;code&gt;WHERE&lt;/code&gt; or &lt;code&gt;JOIN&lt;/code&gt; clauses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limit:&lt;/strong&gt; Stick to 1–4 columns. Each additional column reduces the clustering effectiveness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; A point-lookup that previously scanned 1,600 files might only touch 1 or 2 files after Z-ORDERing.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Optimize and cluster data by high-cardinality columns&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;fact_sales&lt;/span&gt; 
&lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Liquid Clustering: The Modern Standard
&lt;/h2&gt;

&lt;p&gt;While Z-ORDER is powerful, it is rigid (requiring full table rewrites if keys change). &lt;strong&gt;Liquid Clustering&lt;/strong&gt; is the modern replacement that simplifies everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Liquid Clustering Wins:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incremental:&lt;/strong&gt; It only re-clusters new data, avoiding expensive full-table rewrites.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible:&lt;/strong&gt; You can change your clustering keys with a simple &lt;code&gt;ALTER TABLE&lt;/code&gt; without migrating data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intelligent:&lt;/strong&gt; It automatically handles skew and data distribution.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Defining a table with Liquid Clustering&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;fact_sales&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="n"&gt;LONG&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;sale_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Reclaiming Storage with &lt;code&gt;VACUUM&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Delta Lake's "Time Travel" is a lifesaver, but keeping every version of every file forever will explode your storage costs. &lt;code&gt;VACUUM&lt;/code&gt; removes files no longer needed for time travel.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retention:&lt;/strong&gt; The default is 7 days (168 hours). &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety:&lt;/strong&gt; You cannot vacuum files newer than 7 days on Serverless SQL Warehouses to prevent breaking active queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro-Tip:&lt;/strong&gt; Always use &lt;code&gt;DRY RUN&lt;/code&gt; first to see what will be deleted!
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Preview files to be deleted&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt; &lt;span class="n"&gt;RETAIN&lt;/span&gt; &lt;span class="mi"&gt;168&lt;/span&gt; &lt;span class="n"&gt;HOURS&lt;/span&gt; &lt;span class="n"&gt;DRY&lt;/span&gt; &lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Execute the cleanup&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. The "Set and Forget" Strategy: Predictive Optimization
&lt;/h2&gt;

&lt;p&gt;The ultimate goal of a Data Engineer is to spend less time on maintenance. &lt;strong&gt;Predictive Optimization&lt;/strong&gt; is a managed service where Databricks monitors your tables and automatically runs &lt;code&gt;OPTIMIZE&lt;/code&gt;, &lt;code&gt;VACUUM&lt;/code&gt;, and &lt;code&gt;ANALYZE&lt;/code&gt; when needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Enable Databricks to manage maintenance automatically&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt; &lt;span class="n"&gt;ENABLE&lt;/span&gt; &lt;span class="n"&gt;PREDICTIVE&lt;/span&gt; &lt;span class="n"&gt;OPTIMIZATION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary / Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compact:&lt;/strong&gt; Use &lt;code&gt;OPTIMIZE&lt;/code&gt; to merge small files and reduce I/O overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster:&lt;/strong&gt; Use &lt;code&gt;Liquid Clustering&lt;/code&gt; for all new tables to enable massive data skipping with total flexibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean:&lt;/strong&gt; Use &lt;code&gt;VACUUM&lt;/code&gt; to keep storage costs down by removing stale data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate:&lt;/strong&gt; Enable &lt;code&gt;Predictive Optimization&lt;/code&gt; at the catalog or schema level to let the platform handle the heavy lifting.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;What is the "Small File Problem," and how does &lt;code&gt;OPTIMIZE&lt;/code&gt; resolve it?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;How does Z-ORDER improve performance for point-lookup queries?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Why is Liquid Clustering considered superior to traditional Hive partitioning or Z-ORDER?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;What are the risks of running &lt;code&gt;VACUUM&lt;/code&gt; with a very short retention period?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;How much manual maintenance are you currently doing on your Delta tables, or have you already moved toward Predictive Optimization?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>Hands-On Performance: Diagnosing and Fixing Databricks SQL Bottlenecks</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:05:20 +0000</pubDate>
      <link>https://forem.com/mayankcse/hands-on-performance-diagnosing-and-fixing-databricks-sql-bottlenecks-4e5</link>
      <guid>https://forem.com/mayankcse/hands-on-performance-diagnosing-and-fixing-databricks-sql-bottlenecks-4e5</guid>
      <description>&lt;p&gt;Once you know how to monitor your queries, the next step is taking action. In Databricks SQL, performance tuning isn't a "set it and forget it" task—it’s a hands-on process of reducing data scans, optimizing joins, and leveraging intelligent caching.&lt;/p&gt;

&lt;p&gt;This guide moves from theory to execution, showing you exactly how to identify bottlenecks and apply the right fixes to make your queries run faster and cheaper.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Inefficient "Brute Force" Queries
&lt;/h2&gt;

&lt;p&gt;A common mistake for SQL developers is writing "brute force" queries that scan entire tables to find a single row. While modern engines are fast, this approach is unsustainable at the petabyte scale. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inefficient queries lead to:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High Latency:&lt;/strong&gt; Users waiting minutes for simple dashboard refreshes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wasted Spend:&lt;/strong&gt; Paying for compute resources to read data that is immediately discarded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Contention:&lt;/strong&gt; One "heavy" query slowing down the entire warehouse for everyone else.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Core Concept: The Golden Rule of Big Data
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The fastest way to speed up a query is to read less data.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To achieve this, Databricks uses three primary scan-reduction techniques:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Partition Pruning:&lt;/strong&gt; Skipping entire directories of files based on a filter (e.g., &lt;code&gt;WHERE region = 'North'&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Predicate Pushdown:&lt;/strong&gt; Using metadata (Min/Max statistics) within files to skip specific blocks of data before they are even read into memory.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Dynamic File Pruning:&lt;/strong&gt; Eliminating files at runtime based on values discovered from the &lt;em&gt;other&lt;/em&gt; side of a join.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Deep Dive: Join Strategies &amp;amp; Optimization
&lt;/h2&gt;

&lt;p&gt;Joins are often the most expensive part of a query execution. Understanding how Databricks handles them is key to fixing a slow DAG.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. BroadcastHashJoin (The Winner)
&lt;/h3&gt;

&lt;p&gt;The engine takes the smaller table and sends a full copy to every worker node. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why it's fast:&lt;/strong&gt; No data shuffle is required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Joining a massive fact table with a smaller dimension table.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. SortMergeJoin (The Workhorse)
&lt;/h3&gt;

&lt;p&gt;Both tables are shuffled across the network, sorted by the join key, and then merged.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why it's used:&lt;/strong&gt; It is the standard for joining two very large tables that don't fit in memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The downside:&lt;/strong&gt; Heavy network and I/O overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Adaptive Query Execution (AQE)
&lt;/h3&gt;

&lt;p&gt;Databricks isn't static. AQE can look at a query &lt;em&gt;during&lt;/em&gt; execution and say, "Wait, this table I thought was big is actually small. Let's switch from a SortMergeJoin to a BroadcastJoin on the fly."&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Implementation: Writing Cache-Friendly SQL
&lt;/h2&gt;

&lt;p&gt;Caching can make a query go from 30 seconds to 0.5 seconds, but only if you write your SQL correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Cache-Busters" to Avoid
&lt;/h3&gt;

&lt;p&gt;Certain functions make your query "non-deterministic," meaning the engine can't be sure the result will be the same next time, so it refuses to cache it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;❌ Always Misses Cache&lt;/th&gt;
&lt;th&gt;✅ Always Hits Cache (Recommended)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE date = CURRENT_DATE()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE date = '2024-05-20'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WHERE ts &amp;gt; NOW() - INTERVAL 1 DAY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE ts &amp;gt; '2024-05-19 08:00:00'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adding/Removing random whitespace&lt;/td&gt;
&lt;td&gt;Consistent, formatted SQL blocks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Pre-Warming the Cache
&lt;/h3&gt;

&lt;p&gt;If you have a high-priority dashboard, you can "pre-warm" the SSDs of your warehouse using the &lt;code&gt;CACHE SELECT&lt;/code&gt; command. This ensures the data is sitting on local fast storage before the first user even logs in.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Pre-warm the local Delta cache for high-priority tables&lt;/span&gt;
&lt;span class="k"&gt;CACHE&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_summary&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  System Design: The Performance Playbook
&lt;/h2&gt;

&lt;p&gt;To build a high-performance environment, follow this four-step cycle:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Identify the "Heavy Hitters"
&lt;/h3&gt;

&lt;p&gt;Query the system tables to find the top 10 most expensive queries by &lt;code&gt;total_task_duration_ms&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;statement_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;total_task_duration_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;read_bytes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mb_read&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;date_add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_task_duration_ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Analyze the DAG
&lt;/h3&gt;

&lt;p&gt;Open the &lt;strong&gt;Query Profile&lt;/strong&gt;. Look for the "Scan Table" node. If the &lt;strong&gt;Pruning Percentage&lt;/strong&gt; is low, you are reading too much data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Apply the Fix
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Missing Pruning?&lt;/strong&gt; Add a filter on a partitioned column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Massive Shuffles?&lt;/strong&gt; Use a &lt;code&gt;/*+ BROADCAST(small_table) */&lt;/code&gt; hint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow Scans?&lt;/strong&gt; Run &lt;code&gt;OPTIMIZE table_name ZORDER BY (frequent_filter_column)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Verify &amp;amp; Alert
&lt;/h3&gt;

&lt;p&gt;Compare the "Before" and "After" metrics in &lt;code&gt;system.query.history&lt;/code&gt;. If the &lt;code&gt;read_bytes&lt;/code&gt; dropped significantly, your fix worked.&lt;/p&gt;




&lt;h2&gt;
  
  
  Best Practices &amp;amp; Pitfalls
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Avoid Functions on Filter Columns:&lt;/strong&gt; Writing &lt;code&gt;WHERE YEAR(my_date) = 2023&lt;/code&gt; prevents the engine from using partition pruning. Use &lt;code&gt;WHERE my_date BETWEEN '2023-01-01' AND '2023-12-31'&lt;/code&gt; instead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-Size the Warehouse:&lt;/strong&gt; Don't use a Large warehouse for 2X-Small tasks. Use the smallest tier that meets your SLA to save money.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor Parallelism:&lt;/strong&gt; A low &lt;strong&gt;Parallelism Ratio&lt;/strong&gt; in your history logs means your query is running sequentially and not taking advantage of your cluster's power.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary / Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read Less:&lt;/strong&gt; Partition pruning and predicate pushdown are your best friends.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter Early:&lt;/strong&gt; The closer a filter is to the source scan, the less work every downstream join and aggregate has to do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be Deterministic:&lt;/strong&gt; Replace &lt;code&gt;NOW()&lt;/code&gt; and &lt;code&gt;CURRENT_DATE()&lt;/code&gt; with static parameters to unlock the Result Cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Serverless:&lt;/strong&gt; For bursty workloads, serverless warehouses prevent you from paying for idle compute time.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;What is the difference between Partition Pruning and Predicate Pushdown?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;How does wrapping a column in a function (like &lt;code&gt;TO_DATE()&lt;/code&gt;) affect query performance?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;When would you choose a ShuffleHashJoin over a BroadcastHashJoin?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;What metrics in the Query Profile indicate that a table needs Z-Ordering or better partitioning?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>Master Your Queries: A Guide to Databricks SQL Performance Monitoring</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Wed, 22 Apr 2026 12:03:31 +0000</pubDate>
      <link>https://forem.com/mayankcse/master-your-queries-a-guide-to-databricks-sql-performance-monitoring-1mdm</link>
      <guid>https://forem.com/mayankcse/master-your-queries-a-guide-to-databricks-sql-performance-monitoring-1mdm</guid>
      <description>&lt;p&gt;Optimization isn't just about writing cleaner SQL; it's about knowing exactly where your compute dollars are going. In a world of auto-scaling warehouses and serverless compute, a single "bad" query can silently inflate your monthly cloud bill.&lt;/p&gt;

&lt;p&gt;Whether you are a Data Engineer trying to slash execution times or a Platform Architect managing costs, Databricks provides a powerful duo of tools to help you: &lt;strong&gt;Query History&lt;/strong&gt; and &lt;strong&gt;Query Profile&lt;/strong&gt;. In this post, we’ll explore how to move from reactive "firefighting" to proactive performance management.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: The "Black Box" Query
&lt;/h2&gt;

&lt;p&gt;We've all been there: you trigger a SQL statement, and the loading spinner just keeps turning. Is the warehouse overloaded? Is your join causing a massive data shuffle? Or is the engine simply struggling to read millions of unpartitioned files?&lt;/p&gt;

&lt;p&gt;Without visibility, optimization is just guesswork. Monitoring these queries is essential because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost Control:&lt;/strong&gt; Reducing "Scan Volume" directly lowers DBU (Databricks Unit) consumption.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User Experience:&lt;/strong&gt; Faster dashboards mean happier business stakeholders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Allocation:&lt;/strong&gt; Identifying if you need a larger warehouse or simply better SQL.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Core Concepts: History vs. Profile
&lt;/h2&gt;

&lt;p&gt;Before we dive into the code, let's distinguish between the two primary diagnostic layers in Databricks SQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Query History (The Macro View)
&lt;/h3&gt;

&lt;p&gt;Think of this as your &lt;strong&gt;Flight Log&lt;/strong&gt;. It shows every query executed over a period. It is your first stop for isolating "slow performers."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key Insight:&lt;/strong&gt; Wall-clock breakdown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Breakdown:&lt;/strong&gt; It splits time into &lt;strong&gt;Scheduling&lt;/strong&gt;, &lt;strong&gt;Compilation&lt;/strong&gt;, and &lt;strong&gt;Execution&lt;/strong&gt;. 

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;High Scheduling Time?&lt;/em&gt; Your warehouse is likely queued up and needs more clusters.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;High Execution Time?&lt;/em&gt; Your SQL logic or data layout is the bottleneck.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Query Profile (The Micro View)
&lt;/h3&gt;

&lt;p&gt;Think of this as the &lt;strong&gt;X-Ray&lt;/strong&gt;. It provides a &lt;strong&gt;Directed Acyclic Graph (DAG)&lt;/strong&gt; of the execution plan.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key Insight:&lt;/strong&gt; Operator-level metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Breakdown:&lt;/strong&gt; It shows exactly how many rows went into a filter and how many came out, which operators "spilled" to disk, and how much data was shuffled across the network.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Deep Dive: Anatomy of a Query Profile
&lt;/h2&gt;

&lt;p&gt;When you open a Query Profile, you are looking at a visual representation of the Spark engine at work.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Top Operators Panel
&lt;/h3&gt;

&lt;p&gt;This panel ranks operations by time. If a &lt;strong&gt;FileScan&lt;/strong&gt; takes 80% of the time, your issue is IO-bound (likely missing partitioning). If a &lt;strong&gt;Join&lt;/strong&gt; takes 80%, you have a compute/shuffle issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Spills: The Silent Killer
&lt;/h3&gt;

&lt;p&gt;Keep a sharp eye on &lt;strong&gt;Spill to Disk&lt;/strong&gt;. This occurs when an operation (like a large Sort or Join) exceeds the available RAM in the executor. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Fix:&lt;/strong&gt; Increase the warehouse size or optimize the query to handle less data at once (e.g., using a &lt;code&gt;BROADCAST&lt;/code&gt; hint for smaller tables).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Predicate Pushdown
&lt;/h3&gt;

&lt;p&gt;The "Rows In vs. Rows Out" ratio is the most underrated metric. If an operator reads 10 million rows only to filter out 9.9 million, that filter should have happened earlier (at the Scan level). This is known as &lt;strong&gt;Predicate Pushdown&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Implementation: Monitoring via System Tables
&lt;/h2&gt;

&lt;p&gt;The UI is great for one-offs, but for long-term governance, you should query the &lt;strong&gt;System Tables&lt;/strong&gt; directly. This allows you to build automated dashboards and alert on SLA breaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Identifying High-Cost Outliers
&lt;/h3&gt;

&lt;p&gt;The following query identifies queries running longer than 60 seconds and calculates the "Data Scanned" to help you find inefficient full-table scans.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Identify long-running queries with high data scan volume&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;statement_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;statement_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;executed_as&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;duration_ms&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;duration_seconds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;external_links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query_profile&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;profile_link&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;-- Convert bytes to GB for better readability&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_bytes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;gb_scanned&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="c1"&gt;-- Filter for queries longer than 1 minute&lt;/span&gt;
    &lt;span class="n"&gt;total_duration_ms&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt; 
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;statement_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SELECT'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="n"&gt;duration_ms&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example: User Cost Leaderboard
&lt;/h3&gt;

&lt;p&gt;To see which users or teams are driving the most load, you can aggregate metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;executed_as&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;user_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_duration_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_duration_sec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_gb_scanned&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="n"&gt;total_gb_scanned&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-World Applications
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retail/E-commerce:&lt;/strong&gt; Monitoring "Peak Season" dashboard performance to ensure sub-second latency for executive reports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FinTech:&lt;/strong&gt; Auditing query history for compliance to ensure only authorized service accounts are touching sensitive PII tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SaaS Providers:&lt;/strong&gt; Using System Tables to "Chargeback" compute costs to specific departments based on their DBU usage.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Red Flags &amp;amp; Best Practices
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Red Flag&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Recommended Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Skew&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One task takes 100x longer than others.&lt;/td&gt;
&lt;td&gt;Check join keys for nulls or highly frequent values. Use skew hints.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cartesian Product&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive row count explosion.&lt;/td&gt;
&lt;td&gt;Ensure all &lt;code&gt;JOIN&lt;/code&gt; statements have a proper &lt;code&gt;ON&lt;/code&gt; clause.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stale Statistics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optimizer making bad join choices.&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;ANALYZE TABLE [table_name] COMPUTE STATISTICS&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High Exchange Volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large amounts of data shuffling.&lt;/td&gt;
&lt;td&gt;Optimize &lt;code&gt;GROUP BY&lt;/code&gt; keys or use Z-Ordering on join columns.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary / Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query History&lt;/strong&gt; identifies &lt;em&gt;which&lt;/em&gt; queries are slow; &lt;strong&gt;Query Profile&lt;/strong&gt; explains &lt;em&gt;why&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Check &lt;strong&gt;Wall-clock breakdown&lt;/strong&gt; to see if the issue is the warehouse (Queuing) or the SQL (Execution).&lt;/li&gt;
&lt;li&gt;Watch for &lt;strong&gt;Disk Spills&lt;/strong&gt;—they are the primary cause of sudden slowdowns in large joins.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;System Tables&lt;/strong&gt; to move from reactive troubleshooting to a proactive observability practice.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Explain the difference between Scheduling Time and Execution Time in Databricks Query History.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What does a "Spill to Disk" indicate in a Query Profile, and how would you resolve it?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What is Predicate Pushdown, and how can you verify it is working using the Query Profile?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How would you use the &lt;code&gt;system.query.history&lt;/code&gt; table to find the top 5 most expensive queries by data volume?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What’s the most common performance bottleneck you’ve run into—is it usually the SQL logic or the underlying warehouse configuration?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Demystifying Databricks SQL: How Your Queries Actually Run Under the Hood</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Fri, 17 Apr 2026 12:45:29 +0000</pubDate>
      <link>https://forem.com/mayankcse/demystifying-databricks-sql-how-your-queries-actually-run-under-the-hood-4c83</link>
      <guid>https://forem.com/mayankcse/demystifying-databricks-sql-how-your-queries-actually-run-under-the-hood-4c83</guid>
      <description>&lt;p&gt;In the world of big data, writing a SQL query is the easy part. The real challenge—and the mark of a great data engineer—is understanding &lt;strong&gt;how&lt;/strong&gt; that query executes. If you’ve ever stared at a "Running..." status for ten minutes, you know that the "black box" of query execution can be frustrating.&lt;/p&gt;

&lt;p&gt;In this guide, based on insights from industry experts we’re going to peel back the layers of &lt;strong&gt;Databricks SQL (DBBSQL)&lt;/strong&gt;. We’ll explore the architecture, the engine, and the lifecycle of a query so you can stop guessing and start optimizing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: The "Black Box" of Latency
&lt;/h2&gt;

&lt;p&gt;When a query is slow, most developers reflexively increase the cluster size. While "throwing hardware at the problem" sometimes works, it’s expensive and often masks underlying issues like &lt;strong&gt;data skew&lt;/strong&gt;, &lt;strong&gt;poor pruning&lt;/strong&gt;, or &lt;strong&gt;shuffle spills&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Understanding the execution pipeline allows you to identify exactly where the bottleneck lies: is it taking too long to find the files (Metadata), too long to read them (I/O), or too long to process the math (CPU)?&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Big Picture: Databricks SQL Architecture
&lt;/h2&gt;

&lt;p&gt;Before a single row is read, your query travels through a specific ecosystem. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Interface:&lt;/strong&gt; You write your query in the SQL Editor or an external tool (Tableau, Power BI).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Governance Layer:&lt;/strong&gt; &lt;strong&gt;Unity Catalog&lt;/strong&gt; checks if you actually have permission to see that data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Compute Layer:&lt;/strong&gt; The &lt;strong&gt;SQL Warehouse&lt;/strong&gt; (the "engine room") receives the request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Storage Layer:&lt;/strong&gt; Your data lives in &lt;strong&gt;Delta Lake&lt;/strong&gt; on cloud storage (S3, ADLS, or GCS).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choosing Your Engine: SQL Warehouse Types
&lt;/h3&gt;

&lt;p&gt;Not all warehouses are created equal. Your choice dictates performance and cost:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Serverless:&lt;/strong&gt; The gold standard. Instant start, auto-managed, and uses the high-performance Photon engine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro:&lt;/strong&gt; Offers Photon engine benefits but gives you more manual control over configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classic:&lt;/strong&gt; The legacy option. Cheaper per unit but lacks modern optimizations like Predictive I/O.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. The Lifecycle of a Query: From SQL to Results
&lt;/h2&gt;

&lt;p&gt;When you hit "Run," your query undergoes a 5-stage transformation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Parsing:&lt;/strong&gt; The engine checks your syntax. Are the commas in the right place? Does the table &lt;code&gt;orders&lt;/code&gt; actually exist in Unity Catalog?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Logical Planning:&lt;/strong&gt; The engine creates an abstract map of &lt;em&gt;what&lt;/em&gt; you want to do (e.g., "Join Table A and B, then filter").&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Physical Planning (The Optimizer):&lt;/strong&gt; This is where the &lt;strong&gt;Cost-Based Optimizer (CBO)&lt;/strong&gt; looks at table statistics. It decides the most efficient way to join tables—for example, broadcasting a small table to all nodes instead of a massive shuffle.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Execution:&lt;/strong&gt; The &lt;strong&gt;Driver Node&lt;/strong&gt; breaks the plan into small tasks and sends them to &lt;strong&gt;Worker Nodes&lt;/strong&gt;. This is where &lt;strong&gt;Adaptive Query Execution (AQE)&lt;/strong&gt; lives; if the engine notices the data is skewed during the run, it can change the plan on the fly.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Result Delivery &amp;amp; Caching:&lt;/strong&gt; Results are sent back and cached. If you run the exact same query again, Databricks pulls it from the cache instantly.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. The Secret Sauce: The Photon Engine
&lt;/h2&gt;

&lt;p&gt;One of the biggest differentiators in Databricks is &lt;strong&gt;Photon&lt;/strong&gt;. Unlike traditional Spark, which runs on the Java Virtual Machine (JVM), Photon is a native C++ engine.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vectorized Execution:&lt;/strong&gt; It processes data in batches (vectors) rather than row-by-row, which is significantly faster for modern CPUs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictive I/O:&lt;/strong&gt; It "guesses" which data blocks you'll need next, reducing the time the CPU spends waiting for data from the cloud.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro-Tip:&lt;/strong&gt; Look for the &lt;strong&gt;Lightning Bolt icon&lt;/strong&gt; in your Query Profile. This indicates the operation was handled by Photon. If it's missing, you've experienced a "Spark Fallback." This often happens if you use complex Python UDFs—try to stick to native SQL functions to keep things in Photon!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. Deep Dive: Decoding the Query Profile
&lt;/h2&gt;

&lt;p&gt;To optimize, you must learn to read the &lt;strong&gt;Query Profile&lt;/strong&gt;. It’s the "medical X-ray" of your query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Metrics to Watch:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it tells you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Files Pruned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How many files were skipped. High pruning = Great performance.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shuffle Spill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data was too big for RAM and spilled to disk. This is a massive speed killer.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scan Table&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Shows how much raw data was pulled from storage.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Architecture of a Distributed Join
&lt;/h3&gt;

&lt;p&gt;In a typical distributed join, data is "shuffled" across the network so that matching keys end up on the same worker node.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Practical Implementation: Querying with Best Practices
&lt;/h2&gt;

&lt;p&gt;Here is a real-world example of a query designed to perform well in Databricks SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Using native SQL functions to stay in the Photon Engine&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; 
    &lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tpch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; 
    &lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tpch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; 
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_custkey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;o_custkey&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; 
    &lt;span class="c1"&gt;-- Filtering on a partitioned column (Date) for better pruning&lt;/span&gt;
    &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;o_orderdate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'1995-01-01'&lt;/span&gt; 
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;o_orderdate&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="s1"&gt;'1995-12-31'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  6. Best Practices &amp;amp; Pitfalls
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ The "Do's"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Right-size your Warehouse:&lt;/strong&gt; Use &lt;strong&gt;2X-Small&lt;/strong&gt; for testing, but move to &lt;strong&gt;Large+&lt;/strong&gt; for heavy ETL to avoid memory pressure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Leverage Serverless:&lt;/strong&gt; It stops aggressively when not in use, saving you money on idle time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check Statistics:&lt;/strong&gt; Ensure your tables have updated statistics (&lt;code&gt;ANALYZE TABLE&lt;/code&gt;) so the Optimizer can make smart choices.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ❌ The "Don'ts"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Avoid Python UDFs in SQL:&lt;/strong&gt; These force the engine to leave the C++ Photon environment, slowing down execution significantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't ignore the Shuffle:&lt;/strong&gt; If you see high shuffle numbers, consider if your join keys are causing "Data Skew" (where one worker does 90% of the work).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary: Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQL Warehouses&lt;/strong&gt; are the compute power; Serverless is generally the best choice for speed and cost-efficiency.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Query Profile&lt;/strong&gt; is your best friend for identifying bottlenecks like low file pruning or shuffle spills.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Photon&lt;/strong&gt; is the high-performance C++ engine that powers modern Databricks SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Query Execution (AQE)&lt;/strong&gt; optimizes your query &lt;em&gt;while&lt;/em&gt; it is running.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;What is the difference between the Logical Plan and the Physical Plan in Databricks SQL?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;How does Predictive I/O improve query performance compared to standard cloud storage reads?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;What are "Spark Fallbacks," and how do they impact the performance of the Photon engine?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>datascience</category>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Building Resilient AI: Architectural Patterns for Event-Driven Agents</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Sun, 12 Apr 2026 11:05:08 +0000</pubDate>
      <link>https://forem.com/mayankcse/building-resilient-ai-architectural-patterns-for-event-driven-agents-3i71</link>
      <guid>https://forem.com/mayankcse/building-resilient-ai-architectural-patterns-for-event-driven-agents-3i71</guid>
      <description>&lt;p&gt;In the rush to build the next generation of "agentic" AI systems, developers often focus on the LLM's reasoning capabilities while neglecting the pipes that carry the data. But here is the hard truth: &lt;strong&gt;Most agentic systems fail or fly based on one decision—how you design your infrastructure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you move from a simple chatbot to an autonomous agent that can process orders, detect fraud, or triage support tickets, you are no longer just making API calls. You are managing state, concurrency, and reliability across a distributed landscape. &lt;/p&gt;

&lt;p&gt;In this guide, we’ll explore how to build a robust backbone for your AI agents using event-driven architecture (EDA).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: The Fragility of Synchronous Agents
&lt;/h2&gt;

&lt;p&gt;Traditional "request-response" architectures are brittle. If an agent calls a payment service and that service is down, the agent hangs. Even worse, if the agent completes a task but the network blips before it can save the result, you end up with "ghost actions"—money spent, but no record of the transaction.&lt;/p&gt;

&lt;p&gt;As we scale AI agents, we face three primary challenges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Blast Radii:&lt;/strong&gt; One failing component shouldn't crash the entire agent swarm.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;State Inconsistency:&lt;/strong&gt; Ensuring the agent's "brain" and the system's database always agree.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Throughput vs. Latency:&lt;/strong&gt; Balancing the need for speed with the reality of heavy processing loads.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Choosing Your Backbone: Centralized vs. Federated
&lt;/h2&gt;

&lt;p&gt;How you route events defines your system's DNA.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Event Bus:&lt;/strong&gt; A single backbone (like a corporate Kafka cluster) offers strong governance, consistent security, and a single place to observe everything. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Federated/Decentralized:&lt;/strong&gt; Each domain owns its own bus. This creates "failure domains," meaning a spike in your "Triage Agent" won't take down your "Payment Agent."&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Toolbelt: Kafka vs. NATS vs. Azure
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Key Feature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Kafka&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long-term durability &amp;amp; replay&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Consumer Groups:&lt;/strong&gt; Allows different teams to scale and process the same stream independently.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NATS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-performance "Walkie-Talkie"&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Fire-and-forget:&lt;/strong&gt; Ultra-low latency. Use &lt;strong&gt;JetStream&lt;/strong&gt; if you eventually need persistence.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure Trio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enterprise Cloud Ecosystem&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Event Hubs&lt;/strong&gt; (Streaming), &lt;strong&gt;Service Bus&lt;/strong&gt; (Messaging), &lt;strong&gt;Event Grid&lt;/strong&gt; (Serverless/SaaS).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. Maintaining Consistency: Sagas and Outboxes
&lt;/h2&gt;

&lt;p&gt;In an event-driven world, we don't use traditional distributed transactions (which lock databases and kill performance). Instead, we use the &lt;strong&gt;Saga Pattern&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Saga Pattern
&lt;/h3&gt;

&lt;p&gt;A Saga is a multi-step story told through events. If Step 3 fails, the system triggers "compensatory actions" to undo Step 1 and 2. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; An Order Agent charges a card but finds the item is out of stock. The Saga triggers a refund event automatically.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The Outbox Pattern
&lt;/h3&gt;

&lt;p&gt;To prevent the "Internal State Updated but Event Not Sent" bug, use an &lt;strong&gt;Outbox&lt;/strong&gt;. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Your service writes the business change and the event to the &lt;em&gt;same&lt;/em&gt; database in one transaction.&lt;/li&gt;
&lt;li&gt;A background publisher reads that "Outbox" table and pushes the event to the bus.&lt;/li&gt;
&lt;li&gt;This guarantees that state and events are always in sync.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. Implementation: Idempotency and Concurrency
&lt;/h2&gt;

&lt;p&gt;In distributed systems, &lt;strong&gt;"Exactly Once" delivery is a myth&lt;/strong&gt; (or at least, incredibly expensive). Aim for &lt;strong&gt;Effectively Once&lt;/strong&gt; by using &lt;strong&gt;Idempotency Keys&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Example: Idempotent Event Handler (Python)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Redis for effect logging
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_payment_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idempotency_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Check if we've already processed this specific event
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Duplicate event &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ignored.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;already_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 2. Perform the business logic
&lt;/span&gt;        &lt;span class="nf"&gt;execute_payment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Log the effect with an expiration (e.g., 24 hours)
&lt;/span&gt;        &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;86400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 4. Handle failure (allow for retry)
&lt;/span&gt;        &lt;span class="nf"&gt;log_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro Tip:&lt;/strong&gt; Use &lt;strong&gt;Optimistic Concurrency&lt;/strong&gt;. Instead of locking a row, use an &lt;code&gt;ETag&lt;/code&gt; or &lt;code&gt;version_number&lt;/code&gt;. If two agents try to update the same record, the second one will fail the version check and can retry with fresh data.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Avoiding the "2 AM" Pitfalls
&lt;/h2&gt;

&lt;p&gt;Real-world systems get "weird." Here is how to guard them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dead Letter Queues (DLQ):&lt;/strong&gt; When an agent fails to process a "poison message" (bad data), don't let it block the line. Route it to a DLQ for manual inspection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event Storms:&lt;/strong&gt; Sudden bursts of retries can act like a self-inflicted DDoS attack. Use &lt;strong&gt;Rate Limiting&lt;/strong&gt; (Token Buckets) at the edge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hot Partitions:&lt;/strong&gt; If all your events use the same ID (e.g., "User_1"), one server gets crushed while others sit idle. &lt;strong&gt;Hash your partition keys&lt;/strong&gt; to spread the load.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Performance: Batching and Backpressure
&lt;/h2&gt;

&lt;p&gt;Performance is a three-legged stool: Latency, Throughput, and Backpressure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batching:&lt;/strong&gt; Grouping 100 events into one network call trades a little latency for massive throughput gains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit Breakers:&lt;/strong&gt; If a downstream LLM provider is timing out, the circuit breaker "trips." The agent immediately fails-fast with a graceful message rather than making users wait 30 seconds for a timeout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-warming:&lt;/strong&gt; For serverless agents (like Azure Functions), use "Premium" plans or "Always-on" instances to avoid &lt;strong&gt;Cold Start&lt;/strong&gt; latency during critical paths.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Design for the "Bad Day":&lt;/strong&gt; Assume events will be duplicated, out of order, or delayed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency is King:&lt;/strong&gt; Every action an agent takes should be safe to repeat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the Right Tool:&lt;/strong&gt; Kafka for history, NATS for speed, Cloud-native buses for ease of integration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observe and Partition:&lt;/strong&gt; Keep your "junk drawer" clean by using well-defined topic schemas.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;What is the difference between a Saga and a distributed transaction (2PC)?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Answer:&lt;/em&gt; 2PC locks resources and can hinder throughput; Sagas use asynchronous local transactions and compensatory actions for better scalability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;How does the Outbox Pattern ensure atomicity?&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Answer:&lt;/em&gt; It uses a single database transaction to commit both the record update and the event message, ensuring they either both succeed or both fail.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Explain "Effectively Once" processing.&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Answer:&lt;/em&gt; It is the combination of "At Least Once" delivery and an idempotent consumer that filters out duplicates using unique keys.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>machinelearning</category>
      <category>automation</category>
    </item>
    <item>
      <title>Foundations of Event-Driven Agentic Systems: From Chatbots to Proactive Teammates</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Sun, 12 Apr 2026 10:42:30 +0000</pubDate>
      <link>https://forem.com/mayankcse/foundations-of-event-driven-agentic-systems-from-chatbots-to-proactive-teammates-2ji2</link>
      <guid>https://forem.com/mayankcse/foundations-of-event-driven-agentic-systems-from-chatbots-to-proactive-teammates-2ji2</guid>
      <description>&lt;p&gt;In the world of Generative AI, we often think of "agents" as sophisticated chatbots waiting for a user to type a prompt. But in a production environment, the world doesn't wait for a prompt. Systems are constantly whispering (or shouting) through a stream of data: "Payment declined," "Sensor spike detected," "Order shipped."&lt;/p&gt;

&lt;p&gt;To build AI that actually &lt;em&gt;works&lt;/em&gt; in the real world, we have to move away from request-response loops and toward &lt;strong&gt;Event-Driven Agentic Architecture&lt;/strong&gt;. In this post, we’ll explore how to build agents that don't just answer questions, but react to the heartbeat of your business in real-time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: The Latency &amp;amp; Context Gap
&lt;/h2&gt;

&lt;p&gt;Traditional AI applications suffer from two main issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;High Latency:&lt;/strong&gt; Users won't wait 10 seconds for an agent to "think" while a process hangs.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Stale Context:&lt;/strong&gt; If an agent isn't fed real-time data, it makes decisions based on yesterday’s news.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a customer’s payment fails, they expect an immediate notification or a retry. If an agent has to wait for a manual trigger to check the logs, the "magic" of AI evaporates. We need systems that &lt;strong&gt;push&lt;/strong&gt; context to agents the moment it exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Concepts: The Language of Events
&lt;/h2&gt;

&lt;p&gt;Before building, we must define the vocabulary of an event-driven world. These aren't just synonyms; they have specific technical implications.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Event&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An immutable record of the past.&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;UserLoggedIn&lt;/code&gt;, &lt;code&gt;SnoozeClicked&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Command&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An instruction to perform an action.&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;SendEmail&lt;/code&gt;, &lt;code&gt;ProcessRefund&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An event worth keeping forever for audit/memory.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Order_123_Shipped&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stream&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An append-only sequence of events.&lt;/td&gt;
&lt;td&gt;A Kafka topic or AWS Kinesis stream.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Saga&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A coordinator for long-running workflows.&lt;/td&gt;
&lt;td&gt;Managing a booking that spans 3 services.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The "Saga" Pattern
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Saga&lt;/strong&gt; is critical for agents. If an agent issues a &lt;code&gt;RefundCommand&lt;/code&gt; but the refund service is down, the Saga ensures a &lt;strong&gt;compensating action&lt;/strong&gt; occurs (like alerting a human or retrying with a different gateway) to keep the system consistent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deep Dive: System Architecture
&lt;/h2&gt;

&lt;p&gt;An Event-Driven Agentic system functions like a high-speed nervous system. Instead of the agent polling a database, the database (or service) emits a signal that "wakes up" the agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Messaging Patterns
&lt;/h3&gt;

&lt;p&gt;How do these signals reach our agents?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Webhooks:&lt;/strong&gt; The "doorbell." A third party (like Stripe) pings your URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pub/Sub (Publish/Subscribe):&lt;/strong&gt; The "bulletin board." One event (e.g., &lt;code&gt;NewPurchase&lt;/code&gt;) is broadcast to multiple agents—one for fraud detection, one for inventory, and one for a personalized thank-you note.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDC (Change Data Capture):&lt;/strong&gt; The "security camera." Every tiny update in your SQL or NoSQL database is turned into a stream of events for the agent to watch.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Code Example: Building a Reactive Agent
&lt;/h2&gt;

&lt;p&gt;Let's look at a Python-based example using a simple event-driven logic where an agent reacts to a &lt;code&gt;payment_failed&lt;/code&gt; event.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Simulated Event from a Message Queue (like RabbitMQ or NATS)
&lt;/span&gt;&lt;span class="n"&gt;event_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;U9921&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insufficient_funds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;49.99&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgenticSystem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;etype&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;etype&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process_recovery_logic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_recovery_logic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--- Agent Analysis Starting ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Step 1: Fact Gathering (Context)
&lt;/span&gt;        &lt;span class="c1"&gt;# In a real system, the agent might query a RAG store here
&lt;/span&gt;        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed a payment of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 2: Agent Action (Command)
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action: Issuing &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Offer_Alternative_Payment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; command to User &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 3: Emit new Event
&lt;/span&gt;        &lt;span class="n"&gt;new_event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recovery_flow_initiated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Running the system
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgenticSystem&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;handle_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-World Applications
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;RAG Refresh:&lt;/strong&gt; Instead of manually re-indexing your documents every night, an agent listens to your GitHub or Notion webhooks. The moment you save a doc, the agent updates your Vector Database.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Commerce Fraud:&lt;/strong&gt; Agents act as "store detectives," monitoring IP address spikes or rapid-fire purchases to freeze accounts before the money leaves the building.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Ops Runbooks:&lt;/strong&gt; When a server's disk hits 90%, an event triggers an agent to clear temp files, log the action, and summarize the incident for the dev team.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Best Practices &amp;amp; Pitfalls
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency is King:&lt;/strong&gt; Agents might receive the same event twice (network hiccups). Ensure that running the same event twice doesn't charge the customer twice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Loop" Trap:&lt;/strong&gt; Be careful. An agent's action could trigger an event that triggers the same agent. Use "Guardrails" to prevent infinite AI loops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify Signatures:&lt;/strong&gt; If you're using webhooks, always verify the cryptographic signature. Don't let unauthorized "doorbells" trigger your expensive AI workflows.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Events are History:&lt;/strong&gt; They are immutable and tell us what happened.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sagas provide Safety:&lt;/strong&gt; They handle failures in multi-step agent workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push over Pull:&lt;/strong&gt; Use Webhooks or Pub/Sub to reduce latency and keep agents "live."&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Interview Questions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;What is the difference between an Event and a Command in an agentic system?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;How does Change Data Capture (CDC) help in maintaining a Retrieval-Augmented Generation (RAG) system?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Explain the concept of a 'Compensating Action' within a Saga.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Why is MQTT preferred over HTTP for IoT-based agents?&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>eventdriven</category>
      <category>agents</category>
    </item>
    <item>
      <title>🚀 Swiggy System Design: How a Food Delivery Giant Scales to Millions</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Sat, 31 Jan 2026 08:46:32 +0000</pubDate>
      <link>https://forem.com/mayankcse/swiggy-system-design-how-a-food-delivery-giant-scales-to-millions-2mlb</link>
      <guid>https://forem.com/mayankcse/swiggy-system-design-how-a-food-delivery-giant-scales-to-millions-2mlb</guid>
      <description>&lt;p&gt;Food delivery apps like &lt;strong&gt;Swiggy&lt;/strong&gt; look deceptively simple on the surface—search, order, track, eat 😄&lt;br&gt;
But behind the scenes, they operate one of the &lt;strong&gt;most complex real-time distributed systems&lt;/strong&gt; in production today.&lt;/p&gt;

&lt;p&gt;In this blog, we’ll break down &lt;strong&gt;Swiggy’s system design&lt;/strong&gt; step by step—from requirements to APIs and high-level architecture—based on this excellent walkthrough:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Design Video (Reference)&lt;/strong&gt;&lt;br&gt;
👉 &lt;a href="https://youtu.be/xQnY-DDhEBw" rel="noopener noreferrer"&gt;https://youtu.be/xQnY-DDhEBw&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem Statement
&lt;/h2&gt;

&lt;p&gt;Design a &lt;strong&gt;food delivery platform&lt;/strong&gt; similar to &lt;strong&gt;Swiggy / Zomato / Uber Eats&lt;/strong&gt; that allows users to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discover nearby restaurants&lt;/li&gt;
&lt;li&gt;Place orders&lt;/li&gt;
&lt;li&gt;Make payments&lt;/li&gt;
&lt;li&gt;Track delivery partners in real time&lt;/li&gt;
&lt;li&gt;Receive notifications at every stage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system must scale to &lt;strong&gt;millions of users and restaurants&lt;/strong&gt; while remaining fast and reliable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Functional Requirements
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqocr6argzf3gi484p88m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqocr6argzf3gi484p88m.png" alt="Swiggy-Zomato-System-Design-Functional-Requirement" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  User Side
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;User registration &amp;amp; authentication&lt;/li&gt;
&lt;li&gt;Profile management and order history&lt;/li&gt;
&lt;li&gt;Location-based restaurant discovery&lt;/li&gt;
&lt;li&gt;Search restaurants by name and menu&lt;/li&gt;
&lt;li&gt;View dynamic menus (price, availability, images)&lt;/li&gt;
&lt;li&gt;Add multiple items to cart&lt;/li&gt;
&lt;li&gt;Secure payments&lt;/li&gt;
&lt;li&gt;Real-time order tracking&lt;/li&gt;
&lt;li&gt;Notifications for every order state&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Restaurant Side
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Accept or reject orders&lt;/li&gt;
&lt;li&gt;Update menu availability&lt;/li&gt;
&lt;li&gt;Manage incoming orders&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Delivery Partner Side
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Driver discovery &amp;amp; assignment&lt;/li&gt;
&lt;li&gt;Location updates&lt;/li&gt;
&lt;li&gt;Optimized delivery routing&lt;/li&gt;
&lt;li&gt;ETA calculation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 2: Non-Functional Requirements
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffx8daxspcrv9yto5smtx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffx8daxspcrv9yto5smtx.png" alt="Swiggy-Zomato-System-Design-Non-Functional-Requirement" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Expectation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~50M users, ~1M restaurants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Search &amp;amp; discovery must always work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consistency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Payments &amp;amp; inventory must be accurate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low latency for search &amp;amp; tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reliability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fault-tolerant order flow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Step 3: API Design (High Level)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmlwzhulqqycl89m7cb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmlwzhulqqycl89m7cb8.png" alt="Swiggy-Zomato-API-Design" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Authentication &amp;amp; User
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /api/v1/auth/register
POST /api/v1/auth/login
GET  /api/v1/users/me
POST /api/v1/users/location
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Restaurants &amp;amp; Search
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /api/v1/restaurants/nearby
GET /api/v1/restaurants/{restaurantId}
GET /api/v1/restaurants/search
GET /api/v1/restaurants/{restaurantId}/menu
GET /api/v1/menu/search
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cart &amp;amp; Orders
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST   /api/v1/cart/items
DELETE /api/v1/cart/items/{itemId}
POST   /api/v1/orders
GET    /api/v1/orders/{orderId}/tracking
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Delivery &amp;amp; Tracking
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST  /api/v1/delivery/assign
PATCH /api/v1/delivery/orders/{orderId}/accept
POST  /api/v1/delivery/location
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Notifications
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /api/v1/notifications/send
GET  /api/v1/notifications
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 4: High-Level Design (HLD)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Entry Layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Load Balancer&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;API Gateway&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Authentication &amp;amp; Authorization&lt;/li&gt;
&lt;li&gt;Rate limiting&lt;/li&gt;
&lt;li&gt;Request routing&lt;/li&gt;
&lt;li&gt;Load balancing (Round Robin)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  Microservices Architecture
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;th&gt;Database&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auth, profile, location&lt;/td&gt;
&lt;td&gt;User DB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Search Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nearby restaurants, filtering&lt;/td&gt;
&lt;td&gt;Restaurant DB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cart Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cart operations&lt;/td&gt;
&lt;td&gt;Cart DB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Order Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Order lifecycle&lt;/td&gt;
&lt;td&gt;Order DB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Payment Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Payment processing&lt;/td&gt;
&lt;td&gt;External PG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delivery Matching Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Driver assignment&lt;/td&gt;
&lt;td&gt;Driver DB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Location Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time driver tracking&lt;/td&gt;
&lt;td&gt;Geo Store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Notification Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Push &amp;amp; in-app alerts&lt;/td&gt;
&lt;td&gt;Event Store&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4lt9152za8qt4337g49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4lt9152za8qt4337g49.png" alt="Swiggy-Zomato-High-Level-Design" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-Time Delivery Tracking (Key Insight)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Drivers continuously send GPS updates&lt;/li&gt;
&lt;li&gt;Location Service processes updates&lt;/li&gt;
&lt;li&gt;Users poll or subscribe via WebSockets&lt;/li&gt;
&lt;li&gt;ETA recalculated dynamically&lt;/li&gt;
&lt;li&gt;Notifications triggered on status changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where &lt;strong&gt;event-driven architecture&lt;/strong&gt; and &lt;strong&gt;async messaging&lt;/strong&gt; (Kafka / SQS / PubSub) shine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Design Trade-offs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Search = High Availability&lt;/strong&gt;
提醒 even stale data is acceptable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Payments = Strong Consistency&lt;/strong&gt;
No double charges, no missing orders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tracking = Eventual Consistency&lt;/strong&gt;
Minor delays are acceptable&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Designing Swiggy isn’t about cramming features—it’s about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fault isolation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency optimization&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Correctness where it matters&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re preparing for &lt;strong&gt;system design interviews&lt;/strong&gt; or building &lt;strong&gt;large-scale distributed systems&lt;/strong&gt;, this architecture is a goldmine.&lt;/p&gt;

&lt;p&gt;Don’t forget to check out the original video:&lt;br&gt;
👉 &lt;a href="https://youtu.be/xQnY-DDhEBw" rel="noopener noreferrer"&gt;https://youtu.be/xQnY-DDhEBw&lt;/a&gt;&lt;/p&gt;




</description>
      <category>webdev</category>
      <category>programming</category>
      <category>distributedsystems</category>
      <category>ai</category>
    </item>
    <item>
      <title>Clustering News Articles for Topic Detection: A Technical Deep Dive</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Sun, 22 Jun 2025 11:22:31 +0000</pubDate>
      <link>https://forem.com/mayankcse/clustering-news-articles-for-topic-detection-a-technical-deep-dive-2692</link>
      <guid>https://forem.com/mayankcse/clustering-news-articles-for-topic-detection-a-technical-deep-dive-2692</guid>
      <description>&lt;p&gt;With the explosive growth of digital journalism, news readers and analysts often find themselves overwhelmed by an avalanche of information from numerous sources. Imagine a journalist trying to keep up with evolving stories across platforms like Times of India, CNN, and BBC, where the same events are covered from different angles and styles. This creates a dire need for systems that can &lt;em&gt;automatically detect and group related news stories&lt;/em&gt; — a challenge the research paper tackles head-on using clustering-based topic detection techniques.&lt;/p&gt;

&lt;p&gt;In this blog, we break down their methodology, discuss alternative approaches, and explain why &lt;strong&gt;agglomerative hierarchical clustering&lt;/strong&gt; was chosen as the foundation for the topic detection system.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Problem: Making Sense of News Floods
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1.1 What Is Topic Detection?
&lt;/h3&gt;

&lt;p&gt;Topic Detection is the unsupervised process of identifying distinct subjects or themes within a collection of text — here, news articles. The aim is to detect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;New topics&lt;/strong&gt; (e.g., breaking news)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subsequent articles&lt;/strong&gt; covering those topics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationships between different articles&lt;/strong&gt; on the same event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables systems to identify story boundaries and link news content semantically, even when they come from different publishers or regions.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Why Is It Important?
&lt;/h3&gt;

&lt;p&gt;A few applications include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;News aggregators&lt;/strong&gt; like Google News wanting to show “related stories”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Media analysts&lt;/strong&gt; tracking how stories evolve&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprises&lt;/strong&gt; monitoring press mentions of their competitors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governments&lt;/strong&gt; watching for sudden geopolitical shifts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Available Methods for Topic Detection
&lt;/h2&gt;

&lt;p&gt;Before zooming into the chosen method, let’s look at the landscape of available techniques for detecting topics in unstructured text.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8ffpncqx9493p973uyx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8ffpncqx9493p973uyx.png" alt="Topic-Detection-Methods" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Rule-based and Heuristic Methods
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use keyword matching, regex rules, and metadata (tags, categories)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drawback&lt;/strong&gt;: Brittle and inflexible to language evolution or phrasing variations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.2 Supervised Learning Approaches
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use labeled datasets to train classifiers (e.g., SVM, Naïve Bayes, Decision Trees)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drawback&lt;/strong&gt;: Need labeled examples for each topic; fails with unseen events&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.3 Deep Learning Methods
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Models like LDA2Vec, BERT-topic, or LSTM-based classifiers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strength&lt;/strong&gt;: Capture contextual semantics well&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drawback&lt;/strong&gt;: Computationally expensive, harder to interpret, and require large training sets&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.4 Clustering Techniques (Chosen by the Researchers)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unsupervised&lt;/strong&gt;: No labeled data required&lt;/li&gt;
&lt;li&gt;Finds naturally occurring groupings in text based on similarity&lt;/li&gt;
&lt;li&gt;Suitable when new, unknown topics may emerge dynamically&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Why Agglomerative Hierarchical Clustering?
&lt;/h2&gt;

&lt;p&gt;The researchers specifically opted for &lt;strong&gt;Agglomerative Hierarchical Clustering (AHC)&lt;/strong&gt; with &lt;strong&gt;average linkage&lt;/strong&gt;, due to the following reasons:&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 No Need for Predefined Cluster Count
&lt;/h3&gt;

&lt;p&gt;Unlike K-means (which requires specifying k in advance), AHC builds a &lt;strong&gt;tree of clusters (dendrogram)&lt;/strong&gt; from the bottom up—each document starts in its own cluster, and clusters are merged based on similarity.&lt;/p&gt;

&lt;p&gt;This is ideal for unpredictable, real-world news data where the number of topics is not known beforehand.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Handles Multi-topic Overlaps and Duplicates
&lt;/h3&gt;

&lt;p&gt;The dataset contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different articles covering the same event (with different styles)&lt;/li&gt;
&lt;li&gt;Near-duplicates from press agencies re-used by various outlets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AHC, with &lt;strong&gt;average linkage&lt;/strong&gt;, balances between complete and single linkage to handle such redundancies and overlaps effectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Outlier Robustness
&lt;/h3&gt;

&lt;p&gt;Using &lt;strong&gt;average distance&lt;/strong&gt; (rather than minimum or maximum) mitigates sensitivity to noisy or outlier articles—important for large, heterogeneous news datasets.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Preprocessing Pipeline
&lt;/h2&gt;

&lt;p&gt;Before clustering, textual data undergoes a series of NLP preprocessing steps:&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 Tokenization
&lt;/h3&gt;

&lt;p&gt;Splits text into individual words (tokens) for processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
Input: "Text mining extracts useful information."&lt;br&gt;
Output: [Text, mining, extracts, useful, information]&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Stopword Removal
&lt;/h3&gt;

&lt;p&gt;Eliminates common but uninformative words like &lt;em&gt;the&lt;/em&gt;, &lt;em&gt;is&lt;/em&gt;, &lt;em&gt;and&lt;/em&gt;, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Stemming
&lt;/h3&gt;

&lt;p&gt;Reduces words to their root form for better matching.&lt;br&gt;
&lt;strong&gt;Example&lt;/strong&gt;: walking, walks, walked → walk&lt;/p&gt;

&lt;p&gt;This reduces vocabulary sparsity and improves similarity calculations.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Similarity Calculation Using TF-IDF + Cosine Distance
&lt;/h2&gt;

&lt;p&gt;Each news article is vectorized using &lt;strong&gt;TF-IDF (Term Frequency – Inverse Document Frequency)&lt;/strong&gt;, which emphasizes terms that are important within a document but rare across documents.&lt;/p&gt;

&lt;p&gt;Then, &lt;strong&gt;Cosine Similarity&lt;/strong&gt; is used to measure document closeness:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vj91vioazva4k5ausrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vj91vioazva4k5ausrh.png" alt="Cosine Similarity" width="614" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This ensures that similarity is based on direction (not magnitude) of the document vectors—ideal when documents vary in length.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Algorithm: Agglomerative Hierarchical Clustering
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Treat each document as its own cluster.&lt;/li&gt;
&lt;li&gt;Calculate pairwise distances between all clusters.&lt;/li&gt;
&lt;li&gt;Merge the two closest clusters using average linkage:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzbm3v8w5fmyieue2wy2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzbm3v8w5fmyieue2wy2.png" alt="Agglomerative clustering" width="632" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Repeat until one global cluster remains (or stop early based on threshold).&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  7. Practical Example
&lt;/h2&gt;

&lt;p&gt;Consider a paragraph like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Congratulations! You are selected for the interview. You can visit our office after 11 AM."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokenizes and stems the sentences.&lt;/li&gt;
&lt;li&gt;Computes word probabilities (unigram, bigram).&lt;/li&gt;
&lt;li&gt;Assigns the paragraph a &lt;strong&gt;label (topic)&lt;/strong&gt; by checking the dominance of topic scores among predefined categories like &lt;em&gt;educational&lt;/em&gt;, &lt;em&gt;entertainment&lt;/em&gt;, &lt;em&gt;personal&lt;/em&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For classification, &lt;strong&gt;Hidden Markov Models (HMM)&lt;/strong&gt; are used to label sequences of statements in the paragraph and associate the whole paragraph to the most likely category.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Evaluation and Future Scope
&lt;/h2&gt;

&lt;h3&gt;
  
  
  8.1 Initial Focus
&lt;/h3&gt;

&lt;p&gt;The paper proposes initial experiments on news from the &lt;strong&gt;sports domain&lt;/strong&gt;, with plans to extend to &lt;strong&gt;politics&lt;/strong&gt;, &lt;strong&gt;education&lt;/strong&gt;, and &lt;strong&gt;entertainment&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.2 Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No formal evaluation metrics (e.g., Precision, Recall) are presented&lt;/li&gt;
&lt;li&gt;Scalability to real-time streams or multilingual content is not addressed&lt;/li&gt;
&lt;li&gt;The use of HMM for classification could be modernized with transformer-based models&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8.3 Future Enhancements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Add &lt;strong&gt;Topic Tracking&lt;/strong&gt; (supervised component) to monitor evolving topics&lt;/li&gt;
&lt;li&gt;Integrate &lt;strong&gt;Named Entity Recognition (NER)&lt;/strong&gt; for enhanced similarity&lt;/li&gt;
&lt;li&gt;Experiment with &lt;strong&gt;semantic vector models&lt;/strong&gt; (e.g., Word2Vec, BERT)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The paper presents a well-structured and computationally reasonable approach to the complex problem of topic detection from news articles. By leveraging &lt;strong&gt;Agglomerative Hierarchical Clustering&lt;/strong&gt; with &lt;strong&gt;TF-IDF-based cosine similarity&lt;/strong&gt;, the researchers offer a robust framework for discovering story boundaries and organizing large-scale news data without needing manual labels.&lt;/p&gt;

&lt;p&gt;For practitioners, the key takeaway is this: &lt;strong&gt;when dealing with dynamic, unlabeled news data, hierarchical clustering remains a practical, explainable, and extensible foundation.&lt;/strong&gt;&lt;/p&gt;




</description>
    </item>
    <item>
      <title>Evaluating Google Gemini for Document OCR Using Hugging Face Invoice Dataset</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Thu, 19 Jun 2025 13:40:21 +0000</pubDate>
      <link>https://forem.com/mayankcse/evaluating-google-gemini-for-document-ocr-using-hugging-face-invoice-dataset-567i</link>
      <guid>https://forem.com/mayankcse/evaluating-google-gemini-for-document-ocr-using-hugging-face-invoice-dataset-567i</guid>
      <description>&lt;p&gt;In the digital age, invoices are the lifeblood of businesses, but processing them manually can be a monumental task, prone to errors and inefficiency. This is where Optical Character Recognition (OCR) shines, transforming scanned documents into structured, usable data. With the rise of advanced AI models like Google's Gemini, the promise of highly accurate and intelligent OCR has never been closer.&lt;/p&gt;

&lt;p&gt;But how well does Gemini actually perform on real-world documents like invoices? And how can we systematically evaluate its accuracy? This blog post dives into just that, demonstrating a practical approach to benchmark Gemini's OCR capabilities using the widely accessible Hugging Face &lt;code&gt;invoices-donut-data-v1&lt;/code&gt; dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge of Invoice OCR: More Than Just Reading Text
&lt;/h3&gt;

&lt;p&gt;Imagine an invoice. It's not just a block of text; it contains crucial, structured information: invoice numbers, dates, vendor details, line items with descriptions, quantities, and prices, and of course, the grand total. A truly effective OCR solution for invoices needs to do more than just extract raw text; it needs to understand the &lt;em&gt;meaning&lt;/em&gt; of that text within the document's context, identify these specific fields, and present them in a structured format, typically JSON.&lt;/p&gt;

&lt;p&gt;Traditional OCR might give you a jumbled string of all the words on the page. Advanced, intelligent OCR, like what Gemini aims to provide, should be able to tell you, "This is the invoice number," "This is the total amount," and so on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Our Battlefield: The Hugging Face &lt;code&gt;invoices-donut-data-v1&lt;/code&gt; Dataset
&lt;/h3&gt;

&lt;p&gt;For our evaluation, we turn to a fantastic resource: the &lt;code&gt;katanaml-org/invoices-donut-data-v1&lt;/code&gt; dataset available on Hugging Face. This dataset is specifically designed for document understanding tasks, offering a collection of invoice images paired with their "ground truth" – the perfect, manually extracted JSON representation of the invoice data. This "ground truth" is our gold standard against which we'll compare Gemini's output.&lt;/p&gt;

&lt;p&gt;Each sample in this dataset provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;code&gt;image&lt;/code&gt;: The invoice document itself.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ground_truth&lt;/code&gt;: A JSON string containing the accurately extracted fields, often with a nested &lt;code&gt;gt_parse&lt;/code&gt; key holding the structured data we care about.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Gemini Advantage: Multimodal Power for Document Understanding
&lt;/h3&gt;

&lt;p&gt;Gemini models, especially versions like Gemini 1.5 Pro and Flash, are inherently multimodal. This means they can process and understand information from various modalities simultaneously – text, images, and even audio or video. For OCR, this is a game-changer. Instead of just "seeing" pixels, Gemini can leverage its understanding of visual layout, textual patterns, and even common invoice structures to more accurately extract and interpret information.&lt;/p&gt;

&lt;p&gt;While the exact API call for Gemini's specialized document parsing might vary, the core principle remains: you send an image, and you receive a structured response. For this demonstration, we'll assume an API endpoint (&lt;code&gt;API_URL&lt;/code&gt;) that takes an image and returns a JSON object containing the OCR'd data. Your &lt;code&gt;API_KEY&lt;/code&gt; will, of course, be required for authentication.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Up the Evaluation Pipeline (Code Walkthrough)
&lt;/h3&gt;

&lt;p&gt;Let's break down the Python code used for this evaluation.&lt;/p&gt;

&lt;p&gt;First, we install necessary libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="n"&gt;fsspec&lt;/span&gt; &lt;span class="n"&gt;huggingface_hub&lt;/span&gt; &lt;span class="n"&gt;jiwer&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;apt&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;lfs&lt;/span&gt; &lt;span class="c1"&gt;# For potential git large file storage needs, though not strictly required for this dataset
&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt; &lt;span class="n"&gt;lfs&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt; &lt;span class="n"&gt;clone&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;huggingface&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;openthaigpt&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;thai&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ocr&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;evaluation&lt;/span&gt; &lt;span class="c1"&gt;# Not directly used in this script but good for context
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we load the &lt;code&gt;invoices-donut-data-v1&lt;/code&gt; dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# Load dataset
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;katanaml-org/invoices-donut-data-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ground_truth_json_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ground_truth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Renamed to avoid shadowing
&lt;/span&gt;
    &lt;span class="c1"&gt;# Convert PIL image to byte stream
&lt;/span&gt;    &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PNG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Prepare request for Gemini OCR API
&lt;/span&gt;    &lt;span class="c1"&gt;# The actual API call for Gemini might look different,
&lt;/span&gt;    &lt;span class="c1"&gt;# often involving `google.generativeai.GenerativeModel.generate_content`
&lt;/span&gt;    &lt;span class="c1"&gt;# and structuring your prompt to ask for structured data extraction.
&lt;/span&gt;    &lt;span class="c1"&gt;# For this example, we're simulating a generic OCR API call.
&lt;/span&gt;    &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image/png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;template&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;benchmark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# This could be a prompt for Gemini to extract invoice data
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-API-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;API_KEY&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Send to your OCR API (simulating Gemini API call)
&lt;/span&gt;    &lt;span class="c1"&gt;# In a real Gemini integration, you'd use the `google.generativeai` client
&lt;/span&gt;    &lt;span class="c1"&gt;# and craft a prompt like:
&lt;/span&gt;    &lt;span class="c1"&gt;# response = model.generate_content([image, "Extract all invoice details as a JSON object, including invoice_number, total_amount, date, and line_items with description, quantity, and price."])
&lt;/span&gt;    &lt;span class="c1"&gt;# ocr_output = response.text or response.parts[0].text if it's text-based output
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;API_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c1"&gt;# Adjust 'result' based on your actual Gemini API response structure
&lt;/span&gt;        &lt;span class="n"&gt;ocr_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response_json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ocr_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# We need a unique ID for each image, typically from the dataset itself or a generated one.
&lt;/span&gt;    &lt;span class="c1"&gt;# For simplicity, let's use the loop index or assume a unique ID field exists in `sample`.
&lt;/span&gt;    &lt;span class="c1"&gt;# As the original code didn't define image_id, let's use a simple index.
&lt;/span&gt;    &lt;span class="n"&gt;image_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;image_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ground_truth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ground_truth_json_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Keep as string for initial storage
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ocr_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key modification for Gemini:&lt;/strong&gt; The &lt;code&gt;requests.post&lt;/code&gt; call is a placeholder. In a real-world scenario, you would use the &lt;code&gt;google-generativeai&lt;/code&gt; library. Your prompt to Gemini would be crucial, guiding it to extract the specific invoice fields in a structured (e.g., JSON) format.&lt;/p&gt;

&lt;p&gt;For example, a conceptual Gemini integration might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.generativeai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;

&lt;span class="c1"&gt;# Configure your Gemini API key
&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize the model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gemini-pro-vision&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Or 'gemini-1.5-flash', 'gemini-1.5-pro'
&lt;/span&gt;
&lt;span class="c1"&gt;# Inside your loop:
# image is a PIL Image object
# Craft a detailed prompt for invoice extraction
&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract the following details from this invoice and provide them in a JSON format:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;gt_parse&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: {&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;invoice_number&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;vendor_name&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;line_items&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: [&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;      {&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;        &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;        &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;        &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;unit_price&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;        &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"\"\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;      }&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    ]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  }&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ensure all values are extracted as strings. If a field is not present, leave its value as an empty string.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# Gemini's response.text contains the extracted JSON string
&lt;/span&gt;    &lt;span class="n"&gt;ocr_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ocr_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error during Gemini processing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This conceptual integration highlights how Gemini's multi-modal capabilities allow you to provide both the image and a specific instruction (the prompt) to guide its OCR and information extraction process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measuring Success: Beyond Simple Text Comparison
&lt;/h3&gt;

&lt;p&gt;Evaluating OCR for structured documents requires more than just a simple string match. We need to assess how accurately individual fields are extracted. For this, we'll use the Character Error Rate (CER) and field-level accuracy.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;jiwer&lt;/code&gt; library is excellent for calculating CER, which measures the minimum number of edits (insertions, deletions, substitutions) needed to change one string into another, divided by the length of the ground truth string. A lower CER indicates higher accuracy.&lt;/p&gt;

&lt;p&gt;We'll also calculate "accuracy" as the proportion of fields that are &lt;em&gt;exactly&lt;/em&gt; matched between the ground truth and the prediction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jiwer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections.abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Mapping&lt;/span&gt;

&lt;span class="c1"&gt;# Utility: flatten nested dicts with compound keys
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;flatten_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parent_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;new_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;parent_key&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parent_key&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Mapping&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;flatten_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;flatten_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;new_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;new_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Compare
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Load ground truth and prediction JSONs
&lt;/span&gt;        &lt;span class="n"&gt;gt_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ground_truth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid GT JSON in ID &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="n"&gt;pred_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;pred_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid Prediction JSON in ID &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Prediction: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pred_json&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Mapping&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="c1"&gt;# Ensure it's a dictionary for flattening
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prediction for ID &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is not a valid JSON object or dict. Prediction: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pred_json&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract nested gt_parse only for both ground truth and prediction
&lt;/span&gt;    &lt;span class="n"&gt;gt_flat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;flatten_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gt_json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gt_parse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}))&lt;/span&gt;
    &lt;span class="n"&gt;pred_flat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;flatten_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gt_parse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}))&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total_fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gt_flat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;correct_matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;total_cer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;gt_flat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;gt_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gt_flat&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;pred_val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pred_flat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Get predicted value, default to empty string if not found
&lt;/span&gt;
        &lt;span class="n"&gt;field_cer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gt_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred_val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;total_cer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;field_cer&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;gt_val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pred_val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;correct_matches&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: CER=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;field_cer&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | GT=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gt_val&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; | Pred=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pred_val&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;avg_cer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_cer&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_fields&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;total_fields&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
    &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;correct_matches&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_fields&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;total_fields&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Accuracy (Exact Match): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Avg CER: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_cer&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of Evaluation Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;flatten_dict&lt;/code&gt;&lt;/strong&gt;: This helper function is crucial for comparing nested JSON structures. It converts dictionaries like &lt;code&gt;{"gt_parse": {"invoice_number": "123", "line_items": [{"description": "Item A"}]}}&lt;/code&gt; into a flat dictionary with compound keys: &lt;code&gt;{"gt_parse.invoice_number": "123", "gt_parse.line_items[0].description": "Item A"}&lt;/code&gt;. This allows for straightforward field-by-field comparison.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Character Error Rate (CER)&lt;/strong&gt;: Calculated for each field, it tells us how "close" the predicted text is to the ground truth at a character level. A CER of 0.00 means a perfect match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy (Exact Match)&lt;/strong&gt;: This metric specifically counts how many fields were extracted &lt;em&gt;perfectly&lt;/em&gt;, meaning the predicted value exactly matches the ground truth value after stripping whitespace. This is particularly important for critical fields like invoice numbers or total amounts where even a single character error can invalidate the data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Expected Outcomes and Why This Matters
&lt;/h3&gt;

&lt;p&gt;When running this evaluation with a robust OCR model like Gemini, you would ideally observe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low Average CER&lt;/strong&gt;: Indicating that Gemini is highly accurate at recognizing individual characters and words across the invoice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Accuracy (Exact Match)&lt;/strong&gt;: Especially for key fields like &lt;code&gt;invoice_number&lt;/code&gt;, &lt;code&gt;date&lt;/code&gt;, and &lt;code&gt;total_amount&lt;/code&gt;. These fields are critical for automated processing and downstream systems. For example, if Gemini consistently extracts "12345" as the invoice number when the ground truth is "12345", that's a perfect exact match and a CER of 0.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intelligent Extraction&lt;/strong&gt;: Beyond just character accuracy, Gemini's multimodal understanding should enable it to correctly map extracted text to the right fields, even if the layout varies across invoices. For instance, correctly identifying the total amount even if it's styled differently on different invoices.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's consider an example for a single invoice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ground Truth (&lt;code&gt;gt_parse&lt;/code&gt;):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"invoice_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INV-2025-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-06-15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"150.75"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"line_items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Consulting Services"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"quantity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unit_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100.00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100.00"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Travel Expenses"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"quantity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unit_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50.75"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50.75"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Gemini Prediction (&lt;code&gt;gt_parse&lt;/code&gt;):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"invoice_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INV-2025-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-06-15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"150.75"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"line_items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Consulting Services"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"quantity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unit_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100.00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100.00"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Travel Expenses"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"quantity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unit_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50.75"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50.75"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this ideal scenario, all fields would have a CER of 0.00 and contribute to 100% exact match accuracy.&lt;/p&gt;

&lt;p&gt;Now consider a less ideal scenario:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini Prediction (&lt;code&gt;gt_parse&lt;/code&gt;):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"invoice_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INV-2025-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Missing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-06-15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"total_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"150.75"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"line_items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Consulting Sercices"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Typo&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"quantity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unit_price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100.00"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100.00"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, "invoice_number" and "line_items[0].description" would show a non-zero CER, and would not count towards exact match accuracy. The "total_amount" and "date" fields, if correctly extracted, would still contribute to exact match accuracy and have a CER of 0.00. This granular evaluation helps pinpoint areas where the OCR model might need further refinement or where certain document layouts pose greater challenges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion: Unlocking Automation with Intelligent OCR
&lt;/h3&gt;

&lt;p&gt;Evaluating OCR models like Gemini against structured datasets such as &lt;code&gt;invoices-donut-data-v1&lt;/code&gt; is not just an academic exercise. It's a critical step in building robust, automated document processing workflows. By systematically measuring performance using metrics like CER and exact match accuracy, we can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Validate Model Performance&lt;/strong&gt;: Objectively determine how well Gemini handles invoice OCR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identify Strengths and Weaknesses&lt;/strong&gt;: Pinpoint specific fields or document variations where Gemini excels or struggles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drive Improvement&lt;/strong&gt;: Use the insights to refine prompts, fine-tune models, or implement post-processing steps to achieve even higher accuracy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ability of multimodal AI models like Gemini to not just "read" text but to "understand" documents is transformative for business automation. By rigorously testing and evaluating these capabilities, we move closer to a future where manual data entry from invoices becomes a relic of the past, freeing up human potential for more strategic and creative endeavors.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>opensource</category>
      <category>learning</category>
    </item>
    <item>
      <title>From Volume to Persistent Volume in Kubernetes</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Mon, 19 May 2025 17:03:45 +0000</pubDate>
      <link>https://forem.com/mayankcse/from-volume-to-persistent-volume-in-kubernetes-3kog</link>
      <guid>https://forem.com/mayankcse/from-volume-to-persistent-volume-in-kubernetes-3kog</guid>
      <description>&lt;p&gt;In our &lt;a href="https://dev.to/mayankcse/managing-data-volumes-in-kubernetes-1d93"&gt;previous blog&lt;/a&gt;, we explored how Kubernetes volumes help preserve data across container restarts. We worked with &lt;code&gt;emptyDir&lt;/code&gt;, which retains data as long as the &lt;strong&gt;pod&lt;/strong&gt; is running but loses it when the pod is deleted. Then, we improved this setup using &lt;code&gt;hostPath&lt;/code&gt;, which allows a container to persist data at a specified directory on the node.&lt;/p&gt;

&lt;p&gt;This worked seamlessly in &lt;strong&gt;Minikube&lt;/strong&gt; because it runs on a &lt;strong&gt;single-node cluster&lt;/strong&gt;. But what happens in &lt;strong&gt;multi-node cloud environments&lt;/strong&gt;? The same solution will fail because Kubernetes dynamically schedules pods across multiple nodes, meaning the data stored in &lt;code&gt;hostPath&lt;/code&gt; on one node &lt;strong&gt;won’t be available to a pod running on another node&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To solve this, we need &lt;strong&gt;Persistent Volumes (PVs)&lt;/strong&gt; and &lt;strong&gt;CSI (Container Storage Interface)&lt;/strong&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Introducing Persistent Volumes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Persistent Volume (PV)&lt;/strong&gt; is independent of individual pods and nodes, providing a stable and reusable storage location across the cluster. &lt;/p&gt;

&lt;p&gt;To configure it, we create a new file &lt;code&gt;host-pv.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolume&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host-pv&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;capacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
  &lt;span class="na"&gt;volumeMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Block&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/data&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DirectoryOrCreate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Breaking Down the Configuration&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;apiVersion: v1&lt;/code&gt; &amp;amp; &lt;code&gt;kind: PersistentVolume&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Defines the resource type as a Persistent Volume.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;metadata.name: host-pv&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Assigns a unique name to the PV, making it identifiable when claimed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Storage Capacity (&lt;code&gt;capacity.storage: 1Gi&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Specifies the storage capacity, set to &lt;strong&gt;1GiB&lt;/strong&gt; in this example.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Volume Mode (&lt;code&gt;volumeMode: Block&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Declares the volume mode. &lt;code&gt;Block&lt;/code&gt; is useful for low-level storage needs, but many use &lt;code&gt;Filesystem&lt;/code&gt; for general applications.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Access Modes (&lt;code&gt;accessModes&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Controls how pods can interact with the volume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ReadWriteOnce&lt;/code&gt;: Only one pod can mount it as read-write.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ReadOnlyMany&lt;/code&gt;: Multiple pods can access it, but only in read-only mode.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ReadWriteMany&lt;/code&gt;: Multiple pods can read and write simultaneously.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Host Path (&lt;code&gt;hostPath&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Maps &lt;code&gt;/data&lt;/code&gt; on the node as the volume’s storage location. &lt;code&gt;type: DirectoryOrCreate&lt;/code&gt; ensures the directory exists.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Claiming the Persistent Volume&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Persistent Volume Claim (PVC)&lt;/strong&gt; requests a PV from Kubernetes and ensures a pod can access it. Define this in &lt;code&gt;host-pvc.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host-pvc&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;volumeName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host-pv&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Explanation&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;volumeName: host-pv&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Directly references our previously created PV.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Access Modes (&lt;code&gt;accessModes&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Specifies access rights, ensuring only one pod can write at a time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Storage Class (&lt;code&gt;storageClassName: standard&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Defines the underlying storage provisioner. If using cloud services like AWS or GCP, this would be different (e.g., &lt;code&gt;gp2&lt;/code&gt; for AWS).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Resource Requests (&lt;code&gt;resources.requests.storage: 1Gi&lt;/code&gt;)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Requests &lt;strong&gt;1GiB&lt;/strong&gt; of storage from an available PV.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Integrating Persistent Volume into Kubernetes Deployment&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Finally, update &lt;code&gt;deployment.yaml&lt;/code&gt; to use the &lt;strong&gt;PVC&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story-deployment&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mayankcse1/kub-data-01-starting-setup-stories:1&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story-volume&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app/story&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story-volume&lt;/span&gt;
          &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;host-pvc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Key Enhancements&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;We increased &lt;strong&gt;replicas to 2&lt;/strong&gt;, ensuring multiple instances of our application run.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;PVC (host-pvc)&lt;/strong&gt; is mounted inside the container at &lt;code&gt;/app/story&lt;/code&gt;, making persistent storage accessible.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Persistent Volumes Matter in Multi-Node Clusters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Unlike &lt;code&gt;hostPath&lt;/code&gt;, which binds storage &lt;strong&gt;to a single node&lt;/strong&gt;, Persistent Volumes ensure &lt;strong&gt;data availability across multiple pods and nodes&lt;/strong&gt;. Even if a pod fails and Kubernetes reschedules it on a new node, the data remains intact.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Moving from standard &lt;strong&gt;Docker Volumes&lt;/strong&gt; to &lt;strong&gt;Kubernetes Volumes&lt;/strong&gt; and finally to &lt;strong&gt;Persistent Volumes&lt;/strong&gt; is essential for building scalable, cloud-ready applications. &lt;/p&gt;

&lt;p&gt;For further exploration, refer to the &lt;a href="https://kubernetes.io/docs/concepts/storage/volumes/" rel="noopener noreferrer"&gt;official Kubernetes storage documentation&lt;/a&gt;. &lt;/p&gt;




&lt;p&gt;With Persistent Volumes, your application's data &lt;strong&gt;survives pod failures, rescheduling, and multi-node deployments&lt;/strong&gt;, making it &lt;strong&gt;reliable for production environments&lt;/strong&gt;. Now you’re ready to manage &lt;strong&gt;stateful applications at scale&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>devops</category>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>MANAGING DATA &amp; VOLUMES IN KUBERNETES</title>
      <dc:creator>Mayank Gupta</dc:creator>
      <pubDate>Mon, 19 May 2025 16:29:34 +0000</pubDate>
      <link>https://forem.com/mayankcse/managing-data-volumes-in-kubernetes-1d93</link>
      <guid>https://forem.com/mayankcse/managing-data-volumes-in-kubernetes-1d93</guid>
      <description>&lt;p&gt;Imagine you’re running an application that lets users store text, but every time the container crashes or restarts, their data is lost. Frustrating, right? That’s exactly the problem Kubernetes solves with volumes—ensuring data survives beyond container lifecycles. In this blog, we’ll walk through a practical example that demonstrates how Kubernetes can persist data effectively, even in failure scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Problem: Where Did My Data Go?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Containers are lightweight, flexible, and easily restartable, but they come with a challenge: they are &lt;strong&gt;stateless by default&lt;/strong&gt;. This means that every time a container crashes or is redeployed, any data stored inside it vanishes. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/mayank-cse1/docker-kubernetes-the-practical-guide/tree/main/kub-data-01-starting-setup" rel="noopener noreferrer"&gt;Practice Resource&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Consider a simple application where users submit text, and the text is stored in a file inside a container. Here's how you can run it using Docker Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the container is running, test it using &lt;code&gt;curl&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieve stored text:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  curl &lt;span class="nt"&gt;--location&lt;/span&gt; &lt;span class="s1"&gt;'localhost/story'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Add new text:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  curl &lt;span class="nt"&gt;--location&lt;/span&gt; &lt;span class="s1"&gt;'localhost/story'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{
        "text": "My text!"
    }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, if the container crashes and restarts, all previously submitted text is gone! This happens because the container filesystem resets on every startup.&lt;/p&gt;

&lt;p&gt;So, how do we &lt;strong&gt;persist&lt;/strong&gt; the data beyond container failures? Kubernetes volumes provide the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Solution: Adding Volumes in Kubernetes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Kubernetes allows containers to mount &lt;strong&gt;volumes&lt;/strong&gt;—storage spaces that persist beyond container lifecycles. Compared to Docker volumes, Kubernetes volumes offer greater flexibility and resilience.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Setting Up Kubernetes Deployment&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;First, push the Docker image to a registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker tag kub-data-01-starting-setup-stories mayankcse1/kub-data-01-starting-set
docker push mayankcse1/kub-data-01-starting-setup-stories
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, create a &lt;code&gt;deployment.yaml&lt;/code&gt; file to define how Kubernetes should manage our container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story-deployment&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mayankcse1/kub-data-01-starting-setup-stories&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app/story&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stories-volume&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stories-volume&lt;/span&gt;
          &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stories-pvc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Creating a Service to Expose the App&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now, define a Kubernetes service to allow external access to the application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LoadBalancer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Deploy and Test the Application in Kubernetes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Start Minikube if it's not running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;minikube status
minikube start &lt;span class="nt"&gt;--driver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, apply the Kubernetes configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;service.yaml &lt;span class="nt"&gt;-f&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;deployment.yaml
kubectl get pods
kubectl get deployments
minikube service story-service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test the service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--location&lt;/span&gt; &lt;span class="s1"&gt;'http://&amp;lt;Host Address&amp;gt;/story'&lt;/span&gt;
curl &lt;span class="nt"&gt;--location&lt;/span&gt; &lt;span class="s1"&gt;'http://&amp;lt;Host Address&amp;gt;/story'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{
    "text": "mayank"
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, the service is running—but our data can still &lt;strong&gt;disappear&lt;/strong&gt; when the pod itself is removed. To solve this, let’s improve our volume strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Ensuring Data Persistence with Volumes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Kubernetes supports various volume types, but a simple way to persist data within a pod is using &lt;code&gt;emptyDir&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story-volume&lt;/span&gt;
    &lt;span class="na"&gt;emptyDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures data remains available as long as the &lt;strong&gt;pod&lt;/strong&gt; is alive. However, if the pod is &lt;strong&gt;deleted&lt;/strong&gt;, all data is lost.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Handling Multiple Pods &amp;amp; Node Failures&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you scale your application to multiple pods, data consistency becomes tricky. Suppose a pod crashes and traffic is redirected to another pod—the new pod won’t have the old data!&lt;/p&gt;

&lt;p&gt;To store data across multiple pods running on the same &lt;strong&gt;node&lt;/strong&gt;, we can use &lt;code&gt;hostPath&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;story-volume&lt;/span&gt;
    &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/data&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DirectoryOrCreate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, &lt;strong&gt;this only works for pods on the same node&lt;/strong&gt;—if a pod is scheduled on a different node, it won’t have access to the previous data.&lt;/p&gt;

&lt;p&gt;For a more robust solution across multiple nodes, consider &lt;strong&gt;Persistent Volumes (PVs)&lt;/strong&gt; that work with cloud storage or external databases.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts: Why Kubernetes Volumes Matter&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Kubernetes volumes solve &lt;strong&gt;critical data loss issues&lt;/strong&gt; in containerized applications. By implementing persistent storage solutions, developers ensure that user data survives container crashes, pod restarts, and even scaling across multiple nodes.&lt;/p&gt;

&lt;p&gt;To explore all available volume storage options in Kubernetes, visit the official documentation:&lt;br&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/storage/volumes/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/storage/volumes/&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Understanding how &lt;strong&gt;stateful applications&lt;/strong&gt; work within Kubernetes is essential for building scalable, resilient infrastructure. Whether deploying a simple text storage app or a large-scale distributed system, &lt;strong&gt;managing volumes effectively ensures reliable data persistence&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
