<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kushal Nagrani</title>
    <description>The latest articles on Forem by Kushal Nagrani (@kushalnagrani).</description>
    <link>https://forem.com/kushalnagrani</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3701942%2F9fec9144-c56e-4155-a50e-5079700e5b32.png</url>
      <title>Forem: Kushal Nagrani</title>
      <link>https://forem.com/kushalnagrani</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kushalnagrani"/>
    <language>en</language>
    <item>
      <title>Open Tables, Shared Truth: Architecting a Multi-Engine Lakehouse</title>
      <dc:creator>Kushal Nagrani</dc:creator>
      <pubDate>Tue, 31 Mar 2026 07:35:27 +0000</pubDate>
      <link>https://forem.com/kushalnagrani/open-tables-shared-truth-architecting-a-multi-engine-lakehouse-28jo</link>
      <guid>https://forem.com/kushalnagrani/open-tables-shared-truth-architecting-a-multi-engine-lakehouse-28jo</guid>
      <description>&lt;p&gt;In the modern data landscape, we often hear the phrase "single source of truth." But as many data engineers know, the reality behind that phrase is often a complex web of data copies, inconsistent metrics, and redundant governance.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;The problem isn’t processing data. It’s where truth lives.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;For over a decade, we’ve been solving the wrong problem in data engineering.&lt;/p&gt;

&lt;p&gt;We’ve optimized compute. We’ve scaled storage. We’ve built faster pipelines.&lt;/p&gt;

&lt;p&gt;And yet—&lt;br&gt;
we still don’t trust our data.&lt;/p&gt;

&lt;p&gt;This blog is not about introducing another analytics engine or tool. It’s about challenging a design flaw we’ve collectively normalized—and showing how modern lakehouse architectures are finally fixing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem We Stopped Questioning
&lt;/h2&gt;

&lt;p&gt;Let’s start with an uncomfortable truth.Most organizations today have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The same dataset copied multiple times&lt;/li&gt;
&lt;li&gt;The same metric producing different results&lt;/li&gt;
&lt;li&gt;Governance logic re-implemented across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet, we confidently say:&lt;/p&gt;

&lt;p&gt;“We have a single source of truth.”&lt;/p&gt;

&lt;p&gt;But do we really?&lt;/p&gt;

&lt;p&gt;In reality, what we have is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A warehouse copy&lt;/li&gt;
&lt;li&gt;A lake copy&lt;/li&gt;
&lt;li&gt;A serving copy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each slightly different. Each “correct” in its own context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49wiibdmwnsbecchsvv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F49wiibdmwnsbecchsvv3.png" alt=" " width="344" height="193"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  From Data Copies to Data Contracts
&lt;/h2&gt;

&lt;p&gt;Why did we end up here? Because historically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compute engines couldn’t agree on formats&lt;/li&gt;
&lt;li&gt;Storage systems lacked transactional guarantees&lt;/li&gt;
&lt;li&gt;Governance was tied to specific platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we did what engineers do best:&lt;br&gt;
👉 We built pipelines. Lots of them.&lt;/p&gt;

&lt;p&gt;Pipelines became the glue holding together fragmented truth. But pipelines don’t scale truth — they multiply it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why “Querying the Lake” Wasn’t Enough
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhg7ggobgzdgun9dwfbl9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhg7ggobgzdgun9dwfbl9.png" alt=" " width="326" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The industry tried to fix this with data lakes. We said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Let’s store everything in one place”&lt;br&gt;
“Let’s allow multiple engines to query it”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And yes, we achieved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster access&lt;/li&gt;
&lt;li&gt;Fewer ETL jobs&lt;/li&gt;
&lt;li&gt;Flexible analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But we missed something critical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Ownership didn’t change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The lake became accessible—but not authoritative.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Shift: From Engines → Tables
&lt;/h2&gt;

&lt;p&gt;Here’s the mindset shift that changes everything:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The unit of ownership is not the engine. It’s the table.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Historically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Engines owned data&lt;/li&gt;
&lt;li&gt;Pipelines moved data between engines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tables become shared, governed, authoritative assets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where open table formats like Apache Iceberg, Delta Lake, and Apache Hudi come in.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes Open Tables Different?
&lt;/h2&gt;

&lt;p&gt;Open table formats bring database-like guarantees to object storage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACID transactions&lt;/li&gt;
&lt;li&gt;Schema evolution&lt;/li&gt;
&lt;li&gt;Time travel&lt;/li&gt;
&lt;li&gt;Snapshot isolation&lt;/li&gt;
&lt;li&gt;Concurrent reads and writes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the real magic is this:&lt;/p&gt;

&lt;p&gt;👉 Multiple engines can read and write to the same table reliably&lt;/p&gt;

&lt;p&gt;Engines like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache Spark&lt;/li&gt;
&lt;li&gt;Trino&lt;/li&gt;
&lt;li&gt;Amazon Athena&lt;/li&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…can all operate on the same dataset.&lt;/p&gt;

&lt;p&gt;No duplication. No translation layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing Back to the Lake
&lt;/h2&gt;

&lt;p&gt;This is where things truly evolve. &lt;/p&gt;

&lt;p&gt;In traditional architectures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transformations write to new systems&lt;/li&gt;
&lt;li&gt;Each system maintains its own “truth”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a modern lakehouse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transformations write back to shared tables&lt;/li&gt;
&lt;li&gt;These tables become data products&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 The lake is no longer just storage. It’s the system of record.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Control Actually Lives?
&lt;/h2&gt;

&lt;p&gt;Another critical shift:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Governance moves from engines → to the table layer&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of duplicating policies across systems, you define them once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access controls&lt;/li&gt;
&lt;li&gt;Column-level security&lt;/li&gt;
&lt;li&gt;Schema ownership&lt;/li&gt;
&lt;li&gt;Audit trails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Technologies like AWS Lake Formation enable centralized governance across engines.&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Engines come and go&lt;/li&gt;
&lt;li&gt;Policies stay consistent&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Reference Architecture: Multi-Engine Lakehouse
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuxhlw84cyrhkc1aq4czw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuxhlw84cyrhkc1aq4czw.png" alt=" " width="490" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A modern architecture looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Storage Layer: Object storage (e.g., Amazon S3)&lt;/li&gt;
&lt;li&gt;Table Layer: Open table formats (Iceberg/Delta/Hudi)&lt;/li&gt;
&lt;li&gt;Compute Layer: Multiple engines (Spark, Trino, Athena, etc.)&lt;/li&gt;
&lt;li&gt;Governance Layer: Centralized policy enforcement&lt;/li&gt;
&lt;li&gt;Consumption Layer: BI, ML, APIs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And here’s what’s missing:&lt;br&gt;
❌ No “gold copy”&lt;br&gt;
❌ No duplicated datasets&lt;br&gt;
❌ No pipeline sprawl&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pivot: Two Architectures, Same Problem
&lt;/h2&gt;

&lt;p&gt;Traditional Approach&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is copied&lt;/li&gt;
&lt;li&gt;Pipelines everywhere&lt;/li&gt;
&lt;li&gt;Multiple versions of truth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern Approach&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is shared&lt;/li&gt;
&lt;li&gt;Minimal pipelines&lt;/li&gt;
&lt;li&gt;One consistent truth&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Same problem.&lt;br&gt;
Two very different philosophies.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Common Anti-Patterns (That Still Exist)
&lt;/h2&gt;

&lt;p&gt;Even with open tables, teams often fall into traps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treating tables like CSV files&lt;/li&gt;
&lt;li&gt;Having no ownership model&lt;/li&gt;
&lt;li&gt;Allowing every engine to write freely&lt;/li&gt;
&lt;li&gt;Ignoring cost and compaction strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Technology alone doesn’t solve the problem. Design discipline does.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Design Trade-offs You Must Consider
&lt;/h2&gt;

&lt;p&gt;This architecture isn’t “free”.&lt;/p&gt;

&lt;p&gt;You need to think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concurrent writers → Conflict resolution strategies&lt;/li&gt;
&lt;li&gt;Compaction ownership → Who maintains table performance?&lt;/li&gt;
&lt;li&gt;Performance tuning → Partitioning, indexing&lt;/li&gt;
&lt;li&gt;Failure domains → What breaks, and where?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are platform decisions—not just engineering ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rethinking Data Ownership
&lt;/h2&gt;

&lt;p&gt;This shift is bigger than technology.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It’s an organizational change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You move from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline ownership → Data product ownership&lt;/li&gt;
&lt;li&gt;System silos → Shared contracts&lt;/li&gt;
&lt;li&gt;Tool-centric thinking → Agreement-centric thinking&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Stop copying data&lt;/li&gt;
&lt;li&gt;Start sharing truth&lt;/li&gt;
&lt;li&gt;Design tables as products&lt;/li&gt;
&lt;li&gt;Let engines be interchangeable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The most scalable analytics platforms are built around agreements, not tools.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;We’ve spent years optimizing how fast we process data.&lt;/p&gt;

&lt;p&gt;Now it’s time to ask a better question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Where does truth live? and who owns it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because until that’s solved,&lt;br&gt;
no amount of compute will fix your data platform.&lt;/p&gt;

</description>
      <category>lakehouse</category>
      <category>dataengineering</category>
      <category>iceberg</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
