<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alex Merced</title>
    <description>The latest articles on Forem by Alex Merced (@alexmercedcoder).</description>
    <link>https://forem.com/alexmercedcoder</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F288069%2Fb20116a9-b178-4ab1-bcb0-8aa28ed732b0.png</url>
      <title>Forem: Alex Merced</title>
      <link>https://forem.com/alexmercedcoder</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alexmercedcoder"/>
    <language>en</language>
    <item>
      <title>AI Weekly: Free Web Tools, MCP Production Wins, Trusted-Compute Models (April 30–May 6, 2026)</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 06 May 2026 16:22:22 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/ai-weekly-free-web-tools-mcp-production-wins-trusted-compute-models-april-30-may-6-2026-325h</link>
      <guid>https://forem.com/alexmercedcoder/ai-weekly-free-web-tools-mcp-production-wins-trusted-compute-models-april-30-may-6-2026-325h</guid>
      <description>&lt;p&gt;This week pushed three concrete lines forward at once. Vercel open-sourced an AI security harness, TinyFish made paid web search and fetch APIs free for AI agents, and Jama Software shipped the first Model Context Protocol server for engineering management. Underneath that, Z.ai's GLM-5.1 went live inside a trusted execution environment, and Anthropic previewed a proactive assistant for its work product. Here is what shipped and why each piece matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Coding Tools: Vercel Ships deepsec, TinyFish Drops Search Behind a Paywall
&lt;/h2&gt;

&lt;p&gt;Vercel &lt;a href="https://www.cryptointegrat.com/p/ai-news-may-5-2026" rel="noopener noreferrer"&gt;open-sourced deepsec&lt;/a&gt;, an AI-powered security harness that uses Claude and Codex to scan large codebases for vulnerabilities. The tool runs CLI-first, scales to over 1,000 concurrent sandboxes, and works with any pluggable coding agent through Vercel's AI Gateway or your own subscription. The pitch is straight: most coding agents handle one repo at a time, but security audits need to fan out across dozens of services in parallel. deepsec treats the agent as a worker pool and scales horizontally.&lt;/p&gt;

&lt;p&gt;TinyFish made its &lt;a href="https://www.cryptointegrat.com/p/ai-news-may-5-2026" rel="noopener noreferrer"&gt;Web Search and Fetch APIs free for all developers and AI agents&lt;/a&gt; on May 5. The free tier supports Claude Code, Cursor, Codex, and other major agent frameworks, with no credit card required and what TinyFish calls generous rate limits. Web access has been a paid bottleneck for agent workflows since 2024, and a free tier from a vendor that already serves the agent ecosystem will pull pricing pressure across the rest of the search-API market.&lt;/p&gt;

&lt;p&gt;Anthropic also previewed a proactive assistant called &lt;a href="https://www.cryptointegrat.com/p/ai-news-may-5-2026" rel="noopener noreferrer"&gt;Orbit&lt;/a&gt; for Claude Cowork. Orbit will pull insights from Gmail, Slack, GitHub, Calendar, Drive, and Figma, then surface them on its own without the user asking. The product is reportedly a Max-tier feature, and Orbit Apps were also referenced in the leaks. The combination of always-on context and proactive surface area is the next step beyond chat-only agent products.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Processing: GLM-5.1 Runs FP8 Inside a Trusted Execution Environment
&lt;/h2&gt;

&lt;p&gt;Z.ai's GLM-5.1 &lt;a href="https://www.cryptointegrat.com/p/ai-news-may-5-2026" rel="noopener noreferrer"&gt;went live on the 0G Private Computer&lt;/a&gt; running FP8 inside a Trusted Execution Environment on May 5. The model is a 754-billion-parameter Mixture-of-Experts release with 40 billion active parameters per token, shipped under the MIT license on April 7. Running it inside a TEE means the weights and prompts stay encrypted from the host operating system and cloud provider, which closes the residual trust gap that has slowed enterprise self-hosting of large open-weight models.&lt;/p&gt;

&lt;p&gt;The 0G deployment matters for a specific reason. GLM-5.1 was &lt;a href="https://winbuzzer.com/2026/04/09/z-ai-releases-glm-5-1-754b-model-tops-swe-bench-pro-xcxwbn/" rel="noopener noreferrer"&gt;trained entirely on Huawei Ascend 910B chips&lt;/a&gt; with no Nvidia or AMD GPUs, scores 58.4 percent on SWE-Bench Pro, and sustains autonomous task execution for over 8 hours. Putting that capability behind a TEE on a third-party serving platform is the first time a frontier-tier open-weight model has been delivered with hardware-backed confidentiality outside a hyperscaler.&lt;/p&gt;

&lt;p&gt;Air Street's State of AI report for May noted that the UK AI Security Institute now estimates &lt;a href="https://press.airstreet.com/p/state-of-ai-may-2026" rel="noopener noreferrer"&gt;frontier cyber-offence capability is doubling every four months&lt;/a&gt;, with both Anthropic's Claude Mythos Preview and OpenAI's GPT-5.5 clearing a 32-step end-to-end cyber-attack range in a single month. The compute side of the picture stayed dense as well. Anthropic raised an &lt;a href="https://press.airstreet.com/p/state-of-ai-may-2026" rel="noopener noreferrer"&gt;additional $40 billion from Google plus $5 billion from Amazon&lt;/a&gt;, packaged with $100 billion of AWS spend and chip deals with Google and Broadcom reportedly worth hundreds of billions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards &amp;amp; Protocols: First MCP Server for Engineering Management
&lt;/h2&gt;

&lt;p&gt;Jama Software &lt;a href="https://itbusinessnet.com/2026/05/jama-software-launches-model-context-protocol-mcp-server/" rel="noopener noreferrer"&gt;launched the first MCP Server for engineering management software&lt;/a&gt; on May 4. Jama Connect 9.35 lets engineers work in Claude, Codex, Cursor, GitHub Copilot, Visual Studio, or any MCP-compatible tool while keeping the existing Traceability Information Model, permissions, lifecycle workflows, and audit trails intact. The pitch from CTO Jim Davidson is that AI engineering agents need Spec Driven Development to deliver compliant velocity gains, and MCP is now the standard pipe for that integration.&lt;/p&gt;

&lt;p&gt;Unity AI also entered open beta this week with built-in MCP Server support, alongside the AI Gateway for third-party AI integrations. Game studios get a built-in agent tuned for Unity workflows plus the option to plug in any MCP-compatible client. The pattern across both releases is clear: vertical product vendors are no longer asking whether to support MCP. They are shipping it as a default integration surface alongside their native UIs.&lt;/p&gt;

&lt;p&gt;The Model Context Protocol's &lt;a href="https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/" rel="noopener noreferrer"&gt;2026 roadmap&lt;/a&gt; sets the priorities behind these adoptions. Lead Maintainer David Soria Parra named four focus areas: stateless transport for horizontal scaling, server discovery through .well-known URLs, task lifecycle for retry semantics and result expiry, and enterprise-readiness work covering audit trails, SSO, and gateway behavior. The June 2026 specification cycle is targeted for the stateless transport changes. Agentic AI Foundation governance, two specification releases through 2025, and a 500-plus public server count have moved MCP from an experiment into the production layer it was designed to be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources to Go Further
&lt;/h2&gt;

&lt;p&gt;The AI landscape changes fast. Here are tools and resources to help you keep pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try Dremio Free&lt;/strong&gt;: Experience agentic analytics and an Apache Iceberg-powered lakehouse. &lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=05-06-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Start your free trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn Agentic AI with Data&lt;/strong&gt;: Dremio's agentic analytics features let your AI agents query and act on live data. &lt;a href="https://www.dremio.com/use-cases/agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=05-06-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Explore Dremio Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join the Community&lt;/strong&gt;: Connect with data engineers and AI practitioners building on open standards. &lt;a href="https://developer.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=05-06-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Join the Dremio Developer Community&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: The 2026 Guide to AI-Assisted Development&lt;/strong&gt;: Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. &lt;a href="https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: Using AI Agents for Data Engineering and Data Analysis&lt;/strong&gt;: A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. &lt;a href="https://www.amazon.com/Using-Agents-Data-Engineering-Analysis-ebook/dp/B0GR6PYJT9/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>news</category>
      <category>security</category>
    </item>
    <item>
      <title>Apache Data Lakehouse Weekly: April 30–May 6, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 06 May 2026 15:20:55 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/apache-data-lakehouse-weekly-april-30-may-6-2026-1ibl</link>
      <guid>https://forem.com/alexmercedcoder/apache-data-lakehouse-weekly-april-30-may-6-2026-1ibl</guid>
      <description>&lt;p&gt;The release wave that defined late April carried straight into early May, with Arrow shipping two more votes in seven days, Polaris settling into post-1.4.0 stabilization mode, and the Iceberg dev list staying focused on V4 design follow-ups from the summit. The clearest story of the week is Arrow's release engineering: the arrow-rs 58.2.0 vote that opened on April 28 closed cleanly on May 2, and the Arrow .NET 23.0.0 vote opened the same day and passed by May 5. Two votes, two passing results, four days apart — a cadence that would have been unimaginable a year ago when the project was still navigating its full-stack release cycle. Iceberg's design lists stayed in absorption mode as contributors continued to translate post-summit alignments into formal specification work, and Parquet's dev list remained dense with format-level threads that have been simmering since the ALP encoding vote closed in April.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Iceberg
&lt;/h2&gt;

&lt;p&gt;Iceberg's dev list ran quieter this week than the Arrow and Polaris lists, but the design conversations that have anchored 2026 continued to advance in the background. The V4 metadata.json optionality direction — the proposal to treat catalog-managed metadata as a first-class supported mode while preserving static-table portability through explicit opt-in semantics — is still the project's defining specification conversation, with Anton Okolnychyi, Yufei Gu, Shawn Chang, Steven Wu, and Russell Spitzer continuing to push edge cases on portability guarantees and Spark driver behavior. The single-file commits proposal that Russell Spitzer and Amogh Jahagirdar have been advancing remains on track for a formal write-up that should land on the dev list in the coming weeks.&lt;/p&gt;

&lt;p&gt;Péter Váry's &lt;a href="https://www.mail-archive.com/dev@iceberg.apache.org/msg12972.html" rel="noopener noreferrer"&gt;efficient column updates proposal&lt;/a&gt; for wide tables continues to attract collaboration. Anurag Mantripragada and Gábor Kaszab are working alongside Péter on POC benchmarks for both the Iceberg-native and Parquet-native approaches, with the latency and metadata footprint improvements making this one of the more practically grounded V4 proposals on the list. The design — write only the columns that change on each commit, then stitch the result at read time — is squarely aimed at petabyte-scale feature stores with thousands of embedding and model-score columns, and that workload pressure is precisely what's pulling the V4 spec design forward.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.mail-archive.com/dev@iceberg.apache.org/msg13144.html" rel="noopener noreferrer"&gt;labels in LoadTableResponse proposal&lt;/a&gt; that Andrei Tserakhau drove through March continues to anchor the catalog-managed metadata conversation. The design lets each catalog (Polaris, Unity Catalog, Lakekeeper) surface internal metadata such as ownership, cost attribution, and semantic context through a standard optional field on table loads, without forcing requirements onto catalogs that don't track that data. The cross-implementation POCs that Andrei published — Polaris (PR #4048), Unity Catalog (PR #1417), Lakekeeper (PR #1676), and the PyIceberg client (PR #3191) — remain useful reference points as the spec change progresses through review. Iceberg Summit 2026 session recordings continued rolling out on the project's YouTube channel, and the published AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March remains the next concrete deliverable to track.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris
&lt;/h2&gt;

&lt;p&gt;Polaris transitioned from release-week intensity into stabilization mode this week. The 1.4.0 release that Adnan Hemani &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04499.html" rel="noopener noreferrer"&gt;announced on April 23&lt;/a&gt;, followed by the &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04551.html" rel="noopener noreferrer"&gt;Python CLI 1.4.0 release&lt;/a&gt; on April 28, gave the project its first major release pair as a graduated top-level project. The post-launch issues that Alexandre Dutra surfaced — the &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04512.html" rel="noopener noreferrer"&gt;Helm chart repo inconsistency&lt;/a&gt;, the &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04513.html" rel="noopener noreferrer"&gt;release workflow failure in step 4&lt;/a&gt;, the &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04514.html" rel="noopener noreferrer"&gt;Artifact Hub request&lt;/a&gt;, and the &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04544.html" rel="noopener noreferrer"&gt;KMS-related upgrade bug&lt;/a&gt; — are exactly the kind of friction a project surfaces in its first independent release cycle. Yufei Gu has continued to triage most of the upgrade-path issues, and the Helm packaging questions are converging toward resolution.&lt;/p&gt;

&lt;p&gt;Design discussions stayed active alongside the post-release stabilization. EJ Wang's &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04485.html" rel="noopener noreferrer"&gt;DISCUSS thread on AGENTS.md for Polaris&lt;/a&gt; — the proposal to add agent-readable repository metadata so coding agents can pick up the project conventions consistently — continued building toward a concrete implementation proposal, which the previous newsletter flagged as the next deliverable to watch. ITing Lee's &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04430.html" rel="noopener noreferrer"&gt;proposal to add OpenLineage to Polaris&lt;/a&gt; has accumulated the volume of review feedback from Adnan Hemani, Jean-Baptiste Onofré, Yufei Gu, and Michael Collado that it needs to move toward an implementation RFC. Yufei's &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04486.html" rel="noopener noreferrer"&gt;thread on narrowing the scope of SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION&lt;/a&gt; drew further engagement from Dmitri Bourlatchkov and Dennis Huo, and Alexandre Dutra's URL path decoding and PolarisPrivilege grant validation threads continued to be active points of discussion.&lt;/p&gt;

&lt;p&gt;Jean-Baptiste Onofré's confirmation that Polaris is back on a &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04476.html" rel="noopener noreferrer"&gt;monthly release cadence&lt;/a&gt; means a 1.4.1 patch release or 1.5.0 planning email is the natural next step. Given the volume of upgrade-path issues that surfaced after 1.4.0, a quick 1.4.1 to address the KMS bug and Helm packaging fixes seems the more likely path before the project moves on to 1.5.0 feature scoping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;Arrow's release engine kept running. Andrew Lamb's &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34631.html" rel="noopener noreferrer"&gt;arrow-rs 58.2.0 RC1 vote&lt;/a&gt; that opened on April 28 closed on May 2, with &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34637.html" rel="noopener noreferrer"&gt;the release approved&lt;/a&gt; by 6 +1 votes (4 binding) and immediately published to crates.io. Bryce Mecum, Ed Seidl, Jeffrey Vo, Raúl Cumplido, and L. C. Hsieh carried the verification work, with L. C. Hsieh casting the final binding +1 from an Intel Mac on April 29. The 58.2.0 release continues the monthly arrow-rs cadence that has held since 58.1.0 shipped in March, and 59.0.0 remains scheduled as a major release that may include breaking changes.&lt;/p&gt;

&lt;p&gt;Curt Hagenlocher opened the &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34638.html" rel="noopener noreferrer"&gt;Arrow .NET 23.0.0 RC0 vote&lt;/a&gt; on May 2 — the same day arrow-rs 58.2.0 was approved — and &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34653.html" rel="noopener noreferrer"&gt;the vote passed&lt;/a&gt; on May 5 with 5 binding +1s from Bryce Mecum, Adam Reeve, Raúl Cumplido, Sutou Kouhei, and Curt himself. Sutou Kouhei verified on Debian sid with .NET SDK 8.0.413, and Curt ported verify_rc.sh to Powershell as part of the validation. Curt is now working through the post-vote release tasks, including a 401 issue with the GitHub release download script that he flagged for follow-up. The .NET 23.0.0 release continues the steady cadence the .NET implementation has settled into post the 22.0.0 cycle.&lt;/p&gt;

&lt;p&gt;Beyond releases, the design conversations stayed lively. The &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34576.html" rel="noopener noreferrer"&gt;pyarrow-stubs donation vote&lt;/a&gt; that Rok Mihevc opened on April 14 continued building toward a final tally. Emil Sadek's &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34619.html" rel="noopener noreferrer"&gt;ADBC Logo Proposal&lt;/a&gt; drew further engagement from Nic Crane, Julian Hyde, and Rusty Conover, and Benjamin Philip's &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34628.html" rel="noopener noreferrer"&gt;Arrow Erlang grant documents thread&lt;/a&gt; continued the project's expansion into more language ecosystems. Andrew Lamb's &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34610.html" rel="noopener noreferrer"&gt;arrow-rs security policy discussion&lt;/a&gt; and Mandukhai Alimaa's &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34604.html" rel="noopener noreferrer"&gt;canonical BigDecimal extension type proposal&lt;/a&gt; both continued to draw input as the project tightens its production posture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;Parquet's lists stayed dense. Manu Zhang's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27212.html" rel="noopener noreferrer"&gt;DISCUSS thread on a new parquet-java release&lt;/a&gt; continued attracting input from Steve Loughran, Aaron Niskode-Dossett, Fokko Driesprong, Julien Le Dem, Gang Wu, and Rahil C, with the conversation now narrowing on a target version and ship date for what would be the next parquet-java release after 1.17.0. Ismaël Mejía's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27247.html" rel="noopener noreferrer"&gt;thread soliciting code reviews for Java performance optimization work&lt;/a&gt; continued with Steve Loughran picking up the review load.&lt;/p&gt;

&lt;p&gt;The format-level proposals continued evolving. Will Edwards's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27142.html" rel="noopener noreferrer"&gt;DISCUSS thread on an alternative to the FlatBuffer footer with a lightweight byte-offset index&lt;/a&gt; kept drawing design feedback from Andrew Lamb, Ed Seidl, Jan Finis, Alkis Evlogimenos, Raphael Taylor-Davies, and Andrew Bell. Ed Seidl's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27197.html" rel="noopener noreferrer"&gt;proposal to make path_in_schema optional&lt;/a&gt; continued attracting commentary from Gang Wu, Steve Loughran, and Micah Kornfield. Andrew Lamb's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27192.html" rel="noopener noreferrer"&gt;thread on where VariantJsonParser should live&lt;/a&gt; — the cross-project boundary question between Parquet and Iceberg's variant tooling — kept moving with input from Steve Loughran and Gang Wu.&lt;/p&gt;

&lt;p&gt;The Geospatial work continued threading toward closure. Milan Stefanovic's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27136.html" rel="noopener noreferrer"&gt;Geospatial CRS string format clarification&lt;/a&gt; drew further input from Dewey Dunnington and Micah Kornfield, and Jan Finis's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27214.html" rel="noopener noreferrer"&gt;question on RLE bitpack page-edge validity&lt;/a&gt; continued the kind of spec-edge clarification work that matters for cross-implementation interoperability. The Parquet sync that Julien Le Dem ran on April 22 set the agenda for the design work that's now playing out across the dev list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Project Themes
&lt;/h2&gt;

&lt;p&gt;This week's clearest pattern is the rhythm of post-graduation Polaris finding its operational footing alongside Arrow's well-established release cadence. Two Arrow votes in four days plus the Polaris 1.4.x stabilization wave plus Iceberg's quiet absorption of summit alignments plus Parquet's dense format-level work make the lakehouse stack feel less like four separate projects and more like one coordinated platform. The arrow-rs 58.2.0 release in particular landed inside a single five-day vote window — proposed April 28, approved May 2, published to crates.io the same day — which is a useful benchmark for how tight Apache release engineering can run when the verification community is engaged.&lt;/p&gt;

&lt;p&gt;The second pattern is the continued translation of post-summit alignments into spec work. The V4 metadata.json optionality direction, the labels-in-LoadTableResponse proposal, the AGENTS.md thread for Polaris, the OpenLineage RFC, the Parquet footer redesign work, and the Geospatial spec clarifications are all converging on the same broader question: what does the lakehouse stack look like when the workloads it powers shift from analytical SQL to AI agents and ML feature engineering? Each design conversation makes more sense if you assume the next decade's workload mix looks meaningfully different from the last decade's.&lt;/p&gt;

&lt;p&gt;The third pattern is enterprise-readiness work surfacing in real time. Polaris's KMS upgrade bug, Helm packaging issues, OAuth2 Manager v2 design, and credential-subscoping scope discussion are all the work of a project being deployed at scale rather than a project being built. The visible triage on the dev list rather than behind closed doors is a healthy signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;Watch for a Polaris 1.4.1 patch release vote to address the KMS bug and Helm packaging issues that surfaced after 1.4.0, with 1.5.0 planning to follow. The AGENTS.md discussion should firm into a concrete implementation proposal, and the Polaris OpenLineage RFC has the volume of feedback it needs to move toward an implementation. On the Iceberg side, the formal V4 single-file commits write-up, the V4 metadata.json optionality direction, and the published AI contribution policy remain the next concrete deliverables to track. The labels-in-LoadTableResponse spec PR (apache/iceberg#15750) should converge toward merge as the cross-catalog POCs validate the design.&lt;/p&gt;

&lt;p&gt;On the Arrow side, the pyarrow-stubs donation vote should close in the coming days, and arrow-go and arrow-cpp release planning will shape what ships in May and June. For Parquet, Manu Zhang's parquet-java release thread should converge on a target version, the path_in_schema optionality proposal looks ready for a formal vote, and the FlatBuffer-footer alternative is on track for a more formal design document. Iceberg Summit 2026 session recordings will continue rolling out on YouTube — the V4 design talks and production case studies from Apple, Bloomberg, and Pinterest are particularly worth catching as they land.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Get Started with Dremio&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-05-06&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Free&lt;/a&gt; — Build your lakehouse on Iceberg with a free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/use-cases/lake-to-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-05-06&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Build a Lakehouse with Iceberg, Parquet, Polaris &amp;amp; Arrow&lt;/a&gt; — Learn how Dremio brings the open lakehouse stack together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Free Downloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html" rel="noopener noreferrer"&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-polaris-guide-reg.html" rel="noopener noreferrer"&gt;Apache Polaris: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Books by Alex Merced&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Enabling-Agentic-Analytics-Apache-Iceberg-ebook/dp/B0GQXT6W3N/" rel="noopener noreferrer"&gt;Enabling Agentic Analytics with Apache Iceberg and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/" rel="noopener noreferrer"&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Book-Using-Apache-Iceberg-Python/dp/B0GNZ454FF/" rel="noopener noreferrer"&gt;The Book on Using Apache Iceberg with Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Performance and Apache Iceberg's Metadata</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 06 May 2026 15:00:09 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/performance-and-apache-icebergs-metadata-2eka</link>
      <guid>https://forem.com/alexmercedcoder/performance-and-apache-icebergs-metadata-2eka</guid>
      <description>&lt;p&gt;This is Part 3 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt; covered the metadata structures of all five table formats. This article focuses on exactly how query engines use Iceberg's metadata to avoid reading data they don't need.&lt;/p&gt;

&lt;p&gt;The single biggest performance advantage of Iceberg over raw data lakes is not a clever algorithm or a faster codec. It is metadata-driven data skipping. By the time a query engine begins scanning actual Parquet files, Iceberg's metadata has already eliminated 90-99% of the files from consideration. Understanding this process explains why Iceberg tables with billions of rows can return query results in seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Scan Planning Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9iqxdtzyjelnm7xo4ji4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9iqxdtzyjelnm7xo4ji4.png" alt="Iceberg scan planning cascade showing how metadata progressively eliminates files at each stage" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When a query engine like &lt;a href="https://www.dremio.com/blog/apache-iceberg-metadata-for-performance/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt;, Spark, or Trino receives a query against an Iceberg table, it executes a four-stage planning pipeline before reading any data:&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: Snapshot Resolution
&lt;/h3&gt;

&lt;p&gt;The engine contacts the catalog to get the current metadata file location. It reads &lt;code&gt;metadata.json&lt;/code&gt; and identifies the current snapshot. This tells the engine which manifest list represents the table's current state.&lt;/p&gt;

&lt;p&gt;If the query includes a time travel clause (&lt;code&gt;AS OF TIMESTAMP '2024-03-01'&lt;/code&gt;), the engine scans the snapshot list in &lt;code&gt;metadata.json&lt;/code&gt; to find the snapshot that was current at that timestamp. This is a metadata-only operation; no data files are touched.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Manifest List Pruning
&lt;/h3&gt;

&lt;p&gt;The manifest list contains one entry per manifest file. Each entry includes partition-level summary statistics: the minimum and maximum values of the partition columns across all data files tracked by that manifest.&lt;/p&gt;

&lt;p&gt;The engine evaluates query predicates against these summaries. If a query filters on &lt;code&gt;order_date = '2024-03-15'&lt;/code&gt; and a manifest's partition summary shows its date range is &lt;code&gt;2024-01 to 2024-02&lt;/code&gt;, that entire manifest is skipped. This single check can eliminate hundreds of manifest files and the thousands of data files they reference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Manifest File Pruning (File Skipping)
&lt;/h3&gt;

&lt;p&gt;For each surviving manifest, the engine reads the individual file entries. Each entry contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File path and size&lt;/li&gt;
&lt;li&gt;Row count&lt;/li&gt;
&lt;li&gt;Partition values for this specific file&lt;/li&gt;
&lt;li&gt;Column-level min/max values for each column&lt;/li&gt;
&lt;li&gt;Null counts and NaN counts per column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engine evaluates query predicates against these per-file statistics. A query filtering on &lt;code&gt;amount &amp;gt; 500&lt;/code&gt; can skip every file whose &lt;code&gt;amount&lt;/code&gt; column has a maximum value below 500. A query filtering on &lt;code&gt;status = 'shipped'&lt;/code&gt; can skip files where the min and max of the &lt;code&gt;status&lt;/code&gt; column are both &lt;code&gt;'pending'&lt;/code&gt; (alphabetically before 'shipped' in some encodings, though string pruning depends on sort order).&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 4: Parquet Internal Pruning
&lt;/h3&gt;

&lt;p&gt;After Iceberg's metadata has identified the relevant files, the engine reads each Parquet file's footer. Parquet stores its own row-group-level min/max statistics. The engine can skip individual row groups within a file if their statistics exclude the query's filter values.&lt;/p&gt;

&lt;p&gt;If bloom filters are configured (available in Iceberg v2+), the engine can also check probabilistic membership tests for equality filters, skipping row groups where the bloom filter says the value definitely does not exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Concrete Example
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdzn9ukyk76vgprnqsub.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdzn9ukyk76vgprnqsub.png" alt="Three layers of data skipping showing partition pruning, file pruning, and the final result" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Consider a table &lt;code&gt;orders&lt;/code&gt; partitioned by month with 12 months of data, 20 files per month (240 total files):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-03-15'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Manifest list pruning&lt;/strong&gt;: The engine checks partition summaries. 11 of 12 monthly manifests have date ranges that do not include March 2024. They are skipped. Only the March manifest is read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File pruning&lt;/strong&gt;: The March manifest contains 20 file entries. The engine checks each file's &lt;code&gt;amount&lt;/code&gt; column statistics. 15 files have &lt;code&gt;max(amount) &amp;lt; 500&lt;/code&gt;, so they cannot contain any rows matching &lt;code&gt;amount &amp;gt; 500&lt;/code&gt;. They are skipped. 5 files remain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 5 out of 240 files are scanned. The engine eliminated 98% of I/O through metadata alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes Statistics Effective
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2hlq7im0c76h2vc5vgq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe2hlq7im0c76h2vc5vgq.png" alt="Per-file statistics tracked in Iceberg manifest entries" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The effectiveness of file skipping depends entirely on how tight the min/max ranges are per file. Two factors determine this:&lt;/p&gt;

&lt;h3&gt;
  
  
  Sort Order
&lt;/h3&gt;

&lt;p&gt;If the &lt;code&gt;amount&lt;/code&gt; column is sorted within each file (or approximately sorted through &lt;a href="https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/" rel="noopener noreferrer"&gt;clustering&lt;/a&gt;), each file contains a narrow range of values. File 1 might have &lt;code&gt;amount&lt;/code&gt; from 10 to 200, File 2 from 200 to 400, and so on. A filter on &lt;code&gt;amount &amp;gt; 500&lt;/code&gt; can skip the first several files completely.&lt;/p&gt;

&lt;p&gt;If the column is randomly distributed, every file has a range of roughly &lt;code&gt;min(amount)&lt;/code&gt; to &lt;code&gt;max(amount)&lt;/code&gt; across the entire dataset. No file can be skipped because every file's range overlaps every filter. Sort order turns file skipping from theoretical to practical.&lt;/p&gt;

&lt;p&gt;Iceberg supports declaring a &lt;a href="https://iceberg.apache.org/spec/#sorting" rel="noopener noreferrer"&gt;sort order&lt;/a&gt; at the table level. When engines compact data (rewrite files), they can apply this sort order to produce files with tight column ranges. Dremio's &lt;a href="https://www.dremio.com/blog/table-optimization-in-dremio/" rel="noopener noreferrer"&gt;automatic table optimization&lt;/a&gt; handles this without manual intervention for tables managed by Open Catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Size and Count
&lt;/h3&gt;

&lt;p&gt;Smaller files mean tighter statistics per file but more manifest entries to manage. Larger files reduce metadata overhead but produce wider min/max ranges (less effective pruning). The typical recommendation is 128 MB to 512 MB per file for analytical workloads.&lt;/p&gt;

&lt;p&gt;Too many small files (the "small file problem") bloat manifests and slow down planning. Regular &lt;a href="https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/" rel="noopener noreferrer"&gt;compaction&lt;/a&gt; merges small files into optimally-sized ones while preserving or improving sort order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Min/Max: Other Statistics
&lt;/h2&gt;

&lt;p&gt;Iceberg's spec supports several statistical measures per column per file:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Statistic&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Pruning Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Min/Max values&lt;/td&gt;
&lt;td&gt;Range-based filtering&lt;/td&gt;
&lt;td&gt;High (if sorted)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Null count&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;IS NOT NULL&lt;/code&gt; filters&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NaN count&lt;/td&gt;
&lt;td&gt;Float NaN filtering&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Value count&lt;/td&gt;
&lt;td&gt;Row count estimation&lt;/td&gt;
&lt;td&gt;Used by optimizer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distinct count&lt;/td&gt;
&lt;td&gt;Cardinality estimation&lt;/td&gt;
&lt;td&gt;Used by cost-based optimizer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Engines like &lt;a href="https://www.dremio.com/platform/reflections/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt; and Spark use the value counts and distinct counts for cost-based optimization decisions (choosing join strategies, selecting scan parallelism) even when they do not directly prune files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metadata Caching
&lt;/h2&gt;

&lt;p&gt;Reading metadata from object storage on every query adds latency. Production engines cache metadata aggressively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metadata file cache&lt;/strong&gt;: The &lt;code&gt;metadata.json&lt;/code&gt; and manifest list are typically cached in memory. They change only when the table is updated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifest cache&lt;/strong&gt;: Manifest files are immutable (they are never modified, only replaced). Once read, they can be cached indefinitely until they are no longer referenced by any snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parquet footer cache&lt;/strong&gt;: Since Parquet files are immutable, their footers (which contain row-group statistics and schema) can be cached permanently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dremio's &lt;a href="https://www.dremio.com/platform/reflections/" rel="noopener noreferrer"&gt;Columnar Cloud Cache (C3)&lt;/a&gt; caches both metadata and data on local NVMe drives at executor nodes, turning cloud storage latency into local-disk speed for frequently-accessed tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Metadata Is Not Enough
&lt;/h2&gt;

&lt;p&gt;Metadata-driven pruning has limits. If a filter column is not in the partition spec and the data is not sorted by that column, min/max ranges will overlap across all files and no pruning occurs. In these cases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add the column to the sort order&lt;/strong&gt; and compact the table. This is the most effective fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider partition evolution&lt;/strong&gt; (covered in &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Part 4&lt;/a&gt;) to add a partition transform on the column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable bloom filters&lt;/strong&gt; for high-cardinality equality filters like user IDs or transaction IDs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The metadata is only as good as the physical organization of the data. Well-organized tables skip 95%+ of I/O. Poorly organized tables with random data distribution skip nothing, and the metadata overhead becomes pure cost with no benefit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>performance</category>
    </item>
    <item>
      <title>What is Dremio? The Unified Lakehouse and AI Platform</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Tue, 05 May 2026 18:30:15 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/what-is-dremio-the-unified-lakehouse-and-ai-platform-1n2g</link>
      <guid>https://forem.com/alexmercedcoder/what-is-dremio-the-unified-lakehouse-and-ai-platform-1n2g</guid>
      <description>&lt;p&gt;If you manage a modern data stack, you likely spend the majority of your time and compute budget moving data around. You pull data from an operational database, stage it in object storage, transform it, load it into a data warehouse, and finally extract it into BI extracts. This DIY approach creates fragile pipelines, delayed insights, and vendor lock-in.&lt;/p&gt;

&lt;p&gt;Dremio exists to eliminate this complexity. As a mature platform with 11 years of engineering development behind it, it is a unified analytics solution that allows you to query data where it lives, govern it securely, and interact with it using built-in Agentic AI. &lt;/p&gt;

&lt;p&gt;To understand what Dremio does, you must view it as a three-part platform: a Federated Query Engine, an Iceberg Lakehouse Platform, and an Agentic AI Layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foq5mkh6q0atubrtru5yi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foq5mkh6q0atubrtru5yi.png" alt="Dremio's Three-Part Platform Overview: Federated Query Engine, Iceberg Lakehouse, and Agentic AI" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pillar 1: The Federated Query Engine
&lt;/h2&gt;

&lt;p&gt;At its core, Dremio is an execution engine built on the principle of "Query, Don't Move." &lt;/p&gt;

&lt;p&gt;Instead of forcing you to centralize all your data into a single proprietary warehouse, Dremio acts as a logical abstraction layer. When a user or BI dashboard submits a SQL query, Dremio parses the request, identifies the underlying data sources, and generates optimized sub-queries. It pushes down filters and aggregations to the source systems, retrieves the minimal necessary data, and executes the final joins in memory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbzuo9eqh10xln8mzftkh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbzuo9eqh10xln8mzftkh.png" alt="Federated Query Engine splitting a single query to Amazon S3, PostgreSQL, and Oracle" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This architecture eliminates the serialization tax and allows for &lt;strong&gt;Zero-Copy Data Movement&lt;/strong&gt;. While many other platforms have historically struggled to scale query federation, Dremio is able to scale it effortlessly. This is because of Apache Arrow's high-speed in-memory columnar execution, Dremio's intelligent pushdowns, and Iceberg-based Reflections. These features give Dremio a massive performance advantage over other query federation tools that do not leverage them. You bypass complex, multi-stage ETL pipelines entirely while maintaining interactive analytics speed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froccv6nbtkjxr9vntkdi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Froccv6nbtkjxr9vntkdi.png" alt="Comparison of a massive ETL pipeline against a direct zero-copy pointer to raw storage" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pillar 2: The Iceberg Lakehouse Platform
&lt;/h2&gt;

&lt;p&gt;While federation is a great starting place to operationalize your data analytics rapidly, you ideally want the majority of your analytics to operate directly from your data lake using Apache Iceberg tables. Shifting workloads to Iceberg provides three major advantages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reduction in costs:&lt;/strong&gt; You rely on cheaper object storage (like Amazon S3, ADLS, or Google Cloud Storage) while eliminating the need for duplicative storage and expensive ETL pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool interoperability:&lt;/strong&gt; Open standards ensure better collaboration between teams, allowing data engineers, analysts, and data scientists to interact with the exact same data using different compute engines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous performance management:&lt;/strong&gt; Dremio automatically optimizes your Iceberg tables and accelerates their performance with background Reflections. This makes a lakehouse feel as fast and easy to use as a traditional warehouse, but without the premium costs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By natively supporting Apache Parquet and Apache Iceberg, Dremio brings relational database capabilities (like ACID transactions, schema evolution, and time travel) directly to your object storage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fox2rioexbs05sb8qwnsx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fox2rioexbs05sb8qwnsx.png" alt="Iceberg Lakehouse Architecture showing the hierarchy from catalog to metadata to Parquet files" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To manage this open ecosystem securely, Dremio integrates tightly with Apache Polaris. Polaris serves as a neutral, open catalog that provides centralized governance, role-based access control (RBAC), and credential vending. It ensures that whether you query data using Dremio, Apache Spark, or Apache Flink, every engine respects the same security policies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyyo1kgdzu5lybengnxr8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyyo1kgdzu5lybengnxr8.png" alt="Apache Polaris Governance acting as an umbrella over multiple query engines" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, querying raw files on object storage can occasionally bottleneck at large scales. Dremio solves this with &lt;strong&gt;Autonomous Reflections&lt;/strong&gt;. Instead of relying on data engineers to manually build and maintain materialized views or OLAP cubes, Dremio monitors query patterns and automatically materializes optimized data structures in the background. When a user runs a query, the engine transparently routes it to the Reflection, delivering sub-second BI performance directly on the lakehouse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5otxov5jjfihwubb2elt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5otxov5jjfihwubb2elt.png" alt="Autonomous Reflections Lifecycle: Query Monitoring, Background Materialization, and Instant Acceleration" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pillar 3: The Agentic AI Layer
&lt;/h2&gt;

&lt;p&gt;A fast query engine is useless if users cannot find or understand the data. Dremio bridges this gap by integrating artificial intelligence deeply into the platform. &lt;/p&gt;

&lt;p&gt;The foundation of this layer is the AI-powered semantic layer. It maps raw tables and columns into clean, business-friendly concepts through SQL Views, tags, wikis, lineage and a knowledge graph with built-in semantic search capabilities to leverage it. This governed semantic layer ensures that both human analysts and AI agents interpret the data identically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2im7xegqun08kxrv44qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2im7xegqun08kxrv44qr.png" alt="Agentic AI Layer Overview showing the Semantic Layer feeding both Human Analysts and AI Agents" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For human users, Dremio includes a built-in AI Agent. Users simply type a natural language request, such as "Show top customers by revenue," and the agent instantly translates it into a highly optimized SQL query based on the context embedded in the semantic layer. But it goes beyond just translation (the agent immediately executes the query and can automatically generates interactive data visualizations or insightsbased on the results).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5xo6mrjgfd7qw7rbenai.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5xo6mrjgfd7qw7rbenai.png" alt="Built-in AI Agent Flow translating natural language into SQL, executing it, and generating a visual chart" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For system automation, Dremio provides a Model Context Protocol (MCP) Server. The Dremio MCP Server allows external AI assistants and local IDEs to securely connect to the lakehouse with already built in ability to leverage Dremio's semantic layer. The server registers tools for semantic discovery and query execution, enabling AI agents to autonomously research and analyze data on your behalf.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgy1kxz95yelsr9r13jx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgy1kxz95yelsr9r13jx.png" alt="Dremio MCP Server Architecture connecting a Local AI Assistant to the Lakehouse" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, Dremio brings Generative AI directly into your data pipelines through Native AI SQL Functions. Functions like &lt;code&gt;AI_COMPLETE&lt;/code&gt;, &lt;code&gt;AI_GENERATE&lt;/code&gt;, and &lt;code&gt;AI_CLASSIFY&lt;/code&gt; allow you to process unstructured data directly within a &lt;code&gt;SELECT&lt;/code&gt; statement. You can extract structured fields from raw PDF blobs or classify customer sentiment without ever moving the data to an external machine learning service.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhcfobik3v5uco0vf1jw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhcfobik3v5uco0vf1jw.png" alt="Native AI SQL Functions extracting structured data from a raw PDF document" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Dremio is not a traditional data warehouse. It is a unified platform that eliminates data silos through a federated query engine, secures your object storage with an Iceberg-based lakehouse, and accelerates insights with an Agentic AI layer. &lt;/p&gt;

&lt;p&gt;By building on open standards like Apache Iceberg, Apache Parquet, Apache Arrow, and Apache Polaris, you maintain full control of your data. You achieve interactive BI performance without vendor lock-in.&lt;/p&gt;

&lt;p&gt;Ready to build your open data architecture? Take the next step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.dremio.com/get-started" rel="noopener noreferrer"&gt;Try the free trial&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn more about Dremio at a workshop or webinar&lt;/strong&gt; (&lt;a href="https://www.dremio.com/events" rel="noopener noreferrer"&gt;Events&lt;/a&gt; and &lt;a href="https://www.dremio.com/workshops" rel="noopener noreferrer"&gt;Workshops&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download free books:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1" rel="noopener noreferrer"&gt;FREE - The Apache Iceberg Digest: Vol1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>analytics</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Semantic Layer: The Definitive Guide</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Fri, 01 May 2026 13:13:54 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/semantic-layer-the-definitive-guide-5h6i</link>
      <guid>https://forem.com/alexmercedcoder/semantic-layer-the-definitive-guide-5h6i</guid>
      <description>&lt;p&gt;The term "semantic layer" has been part of the data industry's vocabulary for over 35 years. It first appeared in a 1991 patent filing by Business Objects, and it has since been reinvented, abandoned, and reinvented again across three distinct eras of data architecture. Today, it sits at the center of one of the most consequential design debates in the industry: should the semantic layer be a standalone product you bolt onto your stack, or a native capability of the platform that already manages your data?&lt;/p&gt;

&lt;p&gt;This guide covers the full arc: what a semantic layer is, where it came from, how it split into two competing architectural approaches, and why the choice between them determines whether your AI agents produce accurate answers or hallucinated nonsense.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Semantic Layer Actually Is
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9hbvo6s9mu46w8k30xn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9hbvo6s9mu46w8k30xn.png" alt="The semantic layer sits between raw data sources and consumers, providing metric consistency, access governance, and query abstraction" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A semantic layer is an abstraction that maps the physical structure of your data (table names, column names, join logic, filter conditions) to the business terms that people actually use (revenue, churn rate, active customer, cost per acquisition). It sits between the raw data and every consumer of that data: BI dashboards, AI agents, Python notebooks, Excel spreadsheets, and custom applications.&lt;/p&gt;

&lt;p&gt;The semantic layer has three core responsibilities:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metric consistency.&lt;/strong&gt; When the finance team says "revenue," they mean recognized revenue net of refunds. When the sales team says "revenue," they mean bookings including pending deals. Without a semantic layer, both teams write their own SQL, get different numbers, and spend the next two weeks arguing about which dashboard is right. A semantic layer defines "revenue" once, in one place, and every downstream consumer uses that definition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access governance.&lt;/strong&gt; The semantic layer controls who sees what. A marketing analyst querying customer data should not see Social Security numbers. A regional manager should only see data for their region. These rules (row-level security, column masking, role-based access) are defined at the semantic layer and enforced consistently regardless of which tool is doing the querying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query abstraction.&lt;/strong&gt; Business users and AI agents should not need to know that "customer churn rate" requires joining three tables, filtering out test accounts, calculating a 90-day rolling window, and dividing by the active customer count from the prior period. The &lt;a href="https://www.dremio.com/platform/unified-analytics/ai-semantic-layer/" rel="noopener noreferrer"&gt;semantic layer&lt;/a&gt; encapsulates that logic in a reusable definition. Consumers ask for "churn rate" and get the right answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Origin Story: Business Objects, 1991
&lt;/h2&gt;

&lt;p&gt;The semantic layer was invented to solve a simple problem: business users could not write SQL.&lt;/p&gt;

&lt;p&gt;In 1991, Business Objects filed a patent for a "relational database access system using semantically dynamic objects." The product feature was called "Universes." It worked like this: a data architect would build a metadata model that mapped physical database tables and join paths into business-friendly objects ("Customer," "Product," "Sales Amount"). Report builders could then drag and drop these objects to create queries without touching SQL.&lt;/p&gt;

&lt;p&gt;This was a significant advance. Before Universes, generating a report from a relational database required either a developer who understood the schema or a business user willing to learn SQL. Business Objects eliminated that requirement entirely.&lt;/p&gt;

&lt;p&gt;IBM's Cognos followed with "Framework Manager," which served the same purpose: map the physical database into a logical, business-friendly model. SAP built InfoProviders and BEx queries on top of SAP BW. Microsoft introduced SQL Server Analysis Services.&lt;/p&gt;

&lt;p&gt;Every major enterprise BI vendor in the 1990s built some version of a semantic layer. But they all shared the same fundamental limitation: &lt;strong&gt;the semantic layer was proprietary and locked to a single vendor's BI tool.&lt;/strong&gt; If you built your metrics in Business Objects Universes, those definitions did not carry over to Cognos. If you modeled your data in SSAS, Tableau could not read it. The semantic layer existed, but it was a walled garden.&lt;/p&gt;

&lt;h2&gt;
  
  
  OLAP Cubes: The Implicit Semantic Layer
&lt;/h2&gt;

&lt;p&gt;Running parallel to the relational semantic layer was the OLAP (Online Analytical Processing) cube. Products like SQL Server Analysis Services, Cognos TM1, and Oracle Essbase pre-computed data into multidimensional structures: dimensions (Customer, Product, Time), measures (Revenue, Quantity, Profit), and hierarchies (Year &amp;gt; Quarter &amp;gt; Month &amp;gt; Day).&lt;/p&gt;

&lt;p&gt;The cube itself functioned as a semantic layer. Business users did not query tables; they navigated dimensions. They did not write SQL; they used MDX (Multidimensional Expressions) or simply clicked through pivot-table interfaces. The business logic was baked into the cube's structure.&lt;/p&gt;

&lt;p&gt;OLAP cubes worked well for their era. Pre-computing aggregations meant that analytical queries returned in seconds, even on the hardware of the early 2000s. But they had three fatal weaknesses:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rigidity.&lt;/strong&gt; Adding a new dimension or changing a hierarchy required rebuilding the cube, which could take hours for large datasets. Business requirements change faster than cubes can be rebuilt.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Cubes stored pre-aggregated copies of data. For large organizations, this meant maintaining terabytes of redundant, pre-computed data on expensive storage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Specialization.&lt;/strong&gt; Operating an OLAP cube required specialized skills (MDX, cube design, aggregation strategies) that most data teams did not have.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As cloud data warehouses like Snowflake, BigQuery, and Redshift made raw compute cheap and fast, the need for pre-aggregation declined. You could run the analytical query directly against the detail data and get results in seconds. The cube's primary value proposition, speed through pre-computation, was no longer unique.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Self-Service Era and the Loss of the Semantic Layer
&lt;/h2&gt;

&lt;p&gt;The 2010s brought a dramatic shift. Self-service BI tools like Tableau and Power BI connected directly to databases, bypassing the semantic layer entirely. This was marketed as democratization: give every analyst direct access to the data, and they will find their own insights.&lt;/p&gt;

&lt;p&gt;For small teams, this worked. For organizations with more than a handful of analysts, it created a problem that the industry calls "metric drift." Without a centralized semantic layer, each analyst wrote their own SQL. Each SQL query embedded its own business logic. Revenue was calculated five different ways by five different people, and no one could agree on which number was correct.&lt;/p&gt;

&lt;p&gt;The first response to metric drift came from &lt;a href="https://cloud.google.com/looker" rel="noopener noreferrer"&gt;Looker&lt;/a&gt;, founded in 2012, which introduced LookML as a code-based semantic layer. You defined your metrics, dimensions, and relationships in version-controlled modeling files. This was a meaningful evolution because it separated the semantic logic from the BI tool's proprietary report format. Google acquired Looker for $2.6 billion in 2019, validating that the semantic layer was worth owning. But LookML was still tied to Looker's ecosystem. If you used Tableau or Power BI as your primary BI tool, LookML did not help.&lt;/p&gt;

&lt;p&gt;The broader industry realization was clear: &lt;strong&gt;skipping the semantic layer does not eliminate the need for one. It just distributes the problem across every team and every dashboard, where it becomes harder to find and harder to fix.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dremio: The Semantic Layer Built Into the Query Engine From Day One
&lt;/h2&gt;

&lt;p&gt;While Looker was coupling the semantic layer to a BI tool, a different approach was emerging. Dremio was founded in 2015 by Tomer Shiran and Jacques Nadeau, creators and contributors to the Apache Drill project. When Dremio publicly launched in July 2017, it introduced what it called a "governed, self-service semantic layer" as a core architectural component, not an add-on.&lt;/p&gt;

&lt;p&gt;The key difference: Dremio's semantic layer was integrated directly into a high-performance query engine. From its first release, the platform shipped with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Virtual Datasets (Views).&lt;/strong&gt; SQL-defined business logic that users could create, share, and layer on top of any connected data source. No data movement required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reflections.&lt;/strong&gt; Patented, transparent materialized views that the query optimizer substitutes automatically. Users query the governed view; Dremio serves the fastest available Reflection behind the scenes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Federated access.&lt;/strong&gt; The semantic layer worked across data sources (S3, HDFS, relational databases) from the start, not just against a single warehouse.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dremio added Wikis and Labels (Tags) in subsequent releases, providing Markdown-formatted documentation and classification metadata directly attached to datasets in the catalog. This meant the semantic layer was not just a set of views; it included the context that made those views discoverable and understandable.&lt;/p&gt;

&lt;p&gt;This was architecturally distinct from every other semantic layer on the market. AtScale (founded 2013) and Cube (open-sourced 2019) built the semantic layer as a separate product. Dremio built it into the same platform that executed the queries and managed the catalog. That design decision would become increasingly important as AI agents entered the picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Modern Resurgence: Two Divergent Paths
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6nc7zxbgjz4vp1gnj96s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6nc7zxbgjz4vp1gnj96s.png" alt="The semantic layer evolved from 1991 through OLAP cubes and self-service BI into two divergent paths: standalone products and platform-integrated semantic layers" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the early 2020s, the semantic layer was firmly back. dbt Labs acquired Transform (the creators of MetricFlow) in February 2023 to build a code-based metrics layer. Cube had open-sourced its API-first semantic layer in 2019 and launched Cube Cloud commercially in 2021. AtScale had been building its enterprise virtualization layer since 2013.&lt;/p&gt;

&lt;p&gt;The market had split into two fundamentally different architectural forms, and the choice between them has significant consequences for how your data platform operates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path 1: The semantic layer as a standalone product.&lt;/strong&gt; Companies like AtScale (2013), Cube (2019), and dbt (MetricFlow, 2023) built the semantic layer as a separate service that sits between your data warehouse and your BI tools. You deploy it as its own infrastructure, manage it as its own system, and integrate it with your existing stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path 2: The semantic layer as a platform feature.&lt;/strong&gt; &lt;a href="https://www.dremio.com/blog/agentic-analytics-semantic-layer/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt; (2017) integrated the semantic layer directly into its query engine and data catalog from the start. There is no separate service to deploy. The semantic layer is a native capability of the same platform that stores, governs, and queries your data.&lt;/p&gt;

&lt;p&gt;Both approaches solve the metric consistency problem. They differ in how they solve it, what they require operationally, and how well they extend to AI-driven analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 1: The Semantic Layer as a Standalone Product
&lt;/h2&gt;

&lt;p&gt;Three standalone semantic layer products dominate the current market. Each targets a different architecture and team profile.&lt;/p&gt;

&lt;h3&gt;
  
  
  AtScale (Founded 2013)
&lt;/h3&gt;

&lt;p&gt;AtScale, founded by veterans of the Yahoo data team, positions itself as a "universal semantic layer" for large enterprises. It creates a virtualization layer across multiple data warehouses (Snowflake, BigQuery, Databricks), presenting a unified semantic model to BI tools. Its strongest feature is native MDX and DAX compatibility, which makes it the only option for organizations with heavy Excel and SSAS dependencies.&lt;/p&gt;

&lt;p&gt;AtScale excels when you have data spread across multiple warehouses and need a single semantic model that works across all of them. The tradeoff is infrastructure complexity and licensing cost. AtScale requires dedicated infrastructure, and its enterprise pricing model reflects its positioning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cube (Open-Sourced 2019)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cube.dev/" rel="noopener noreferrer"&gt;Cube&lt;/a&gt; started as Statsbot in 2016 before pivoting to become an open-source, API-first semantic layer in 2019. It provides REST, GraphQL, SQL, MDX, and DAX APIs, making it the most flexible option for embedded analytics and customer-facing dashboards. Cube Cloud launched commercially in 2021. Cube's pre-aggregation engine can deliver sub-second responses for complex queries by pre-computing results and caching them.&lt;/p&gt;

&lt;p&gt;Cube excels when your primary consumer is a custom application, not a BI tool. The tradeoff is operational overhead: Cube runs as a separate server, requires its own infrastructure, and demands expertise in designing pre-aggregation strategies to achieve optimal performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  dbt Semantic Layer (MetricFlow, 2023)
&lt;/h3&gt;

&lt;p&gt;The dbt Semantic Layer is powered by MetricFlow, the technology dbt Labs acquired when it purchased Transform in February 2023. It lets teams define metrics as code in YAML files within their existing dbt projects. Metrics are version-controlled, reviewed via pull requests, and deployed alongside your dbt transformations. In late 2025, dbt Labs moved MetricFlow to an Apache 2.0 license, signaling a commitment to open, portable metric definitions.&lt;/p&gt;

&lt;p&gt;The dbt Semantic Layer excels when your team is already a dbt shop and wants metrics managed in the same Git-based workflow as transformations. The tradeoff is that it requires dbt Cloud for the serving layer, lacks native caching (relying on the underlying warehouse for query execution), and is less suited for high-concurrency embedded applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Structural Tradeoff of Standalone Products
&lt;/h3&gt;

&lt;p&gt;All three standalone products share the same architectural limitation: they exist as a separate layer of infrastructure that must integrate with your data catalog, your governance system, and your query engine.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Another system to operate.&lt;/strong&gt; You deploy it, monitor it, upgrade it, and debug it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance is a separate concern.&lt;/strong&gt; Access policies defined in your catalog or warehouse must be replicated or synced with the semantic layer. Any gap is a security risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No native execution.&lt;/strong&gt; Standalone semantic layers define metrics but do not execute queries. They translate user requests into SQL and send that SQL to an external engine. If the engine and the semantic layer disagree on the data model, you get wrong results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sync lag.&lt;/strong&gt; When you change a table schema, add a column, or update governance rules, the semantic layer must be updated separately. Until it is, your definitions are stale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams with a single data warehouse, a strong DevOps practice, and a primary use case that matches one of these products, standalone semantic layers work well. For teams managing federated data across multiple sources, or teams building AI-driven analytics, the gap between "definition" and "execution" creates friction that compounds over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 2: The Semantic Layer as a Platform Feature
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbnjy2bm9l4e641f9l3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbnjy2bm9l4e641f9l3u.png" alt="Dremio's architecture integrating semantic layer, query engine, and open catalog in a single platform" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The alternative is to build the semantic layer into the same platform that manages your data catalog, governs access, and executes queries. This is the approach &lt;a href="https://www.dremio.com/blog/the-ai-foundation-of-the-agentic-lakehouse/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt; takes.&lt;/p&gt;

&lt;p&gt;In Dremio, the semantic layer is not a separate product you bolt on. It is a native set of capabilities (views, wikis, labels, lineage, knowledge graph) that are integrated with the &lt;a href="https://www.dremio.com/platform/enterprise-data-catalog/" rel="noopener noreferrer"&gt;Open Catalog&lt;/a&gt; (built on Apache Polaris), the MPP query engine (built on Apache Arrow), and the governance system (Fine-Grained Access Control, row-level security, column masking).&lt;/p&gt;

&lt;p&gt;This matters because the three activities that define a semantic layer, defining metrics, governing access, and executing queries, all happen in the same system. There is no handoff, no sync, no governance gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Dremio's Semantic Layer Works
&lt;/h2&gt;

&lt;p&gt;Dremio's &lt;a href="https://www.dremio.com/platform/unified-analytics/ai-semantic-layer/" rel="noopener noreferrer"&gt;AI Semantic Layer&lt;/a&gt; is built from five components that work together: views, wikis, labels, lineage, and the knowledge graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Views (Virtual Datasets)
&lt;/h3&gt;

&lt;p&gt;Views are the foundation. A view is a SQL-defined virtual dataset that encapsulates business logic: joins, filters, calculations, and transformations. You write the SQL once, and every consumer (BI tool, AI agent, Python notebook) queries the view instead of the raw tables.&lt;/p&gt;

&lt;p&gt;Dremio recommends a three-layer architecture for views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Preparation Layer.&lt;/strong&gt; One view per source table. Handles type casting, column renaming, null handling. A direct 1:1 mapping of raw data into clean, standardized form.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business Layer.&lt;/strong&gt; Shared business logic. This is where you define "active customer" (customers with at least one order in the last 90 days, excluding test accounts), "revenue" (order_amount minus refunds, in USD), and every other metric that needs a single definition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Layer.&lt;/strong&gt; Tailored datasets for specific consumers. A marketing dashboard view joins customer demographics with campaign performance. An AI agent view exposes the most commonly asked metrics with rich column-level documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because views are virtual, they do not copy or move data. They execute against the underlying data at query time, using Dremio's &lt;a href="https://www.dremio.com/blog/why-agentic-analytics-requires-federation-virtualization-and-the-lakehouse-how-dremio-delivers/" rel="noopener noreferrer"&gt;federated query engine&lt;/a&gt; to pull from S3, PostgreSQL, Snowflake, MongoDB, or any connected source. Change the underlying data, and the view reflects it immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wikis
&lt;/h3&gt;

&lt;p&gt;Wikis are Markdown-formatted documentation attached directly to spaces, sources, folders, tables, views, and columns. They serve two audiences: human analysts browsing the catalog, and AI agents generating SQL.&lt;/p&gt;

&lt;p&gt;A wiki for a view called &lt;code&gt;analytics.customer_health&lt;/code&gt; might contain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Customer Health Score&lt;/span&gt;

Composite metric combining purchase frequency, support ticket volume,
and NPS survey responses over the trailing 90 days.

&lt;span class="gs"&gt;**Owner:**&lt;/span&gt; Customer Success team
&lt;span class="gs"&gt;**Refresh:**&lt;/span&gt; Updated daily by the ETL pipeline
&lt;span class="gs"&gt;**Filters:**&lt;/span&gt; Excludes test accounts (account_type != 'test')
&lt;span class="gs"&gt;**Churn definition:**&lt;/span&gt; Score below 30 for two consecutive months
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dremio can also auto-generate wiki content. The platform samples table data, analyzes column distributions, and produces context-rich descriptions using generative AI. This is particularly valuable for large data estates where manually documenting hundreds of tables is impractical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Labels
&lt;/h3&gt;

&lt;p&gt;Labels classify and organize data assets. You tag a table as &lt;code&gt;PII&lt;/code&gt;, &lt;code&gt;Finance&lt;/code&gt;, &lt;code&gt;Marketing&lt;/code&gt;, &lt;code&gt;Approved&lt;/code&gt;, or &lt;code&gt;Draft&lt;/code&gt;. Labels serve two purposes: they improve discoverability (semantic search returns results filtered by label), and they integrate with governance rules (all &lt;code&gt;PII&lt;/code&gt;-labeled columns automatically apply masking policies).&lt;/p&gt;

&lt;p&gt;Like wikis, labels can be AI-suggested. Dremio analyzes column names, data patterns, and content to recommend labels like &lt;code&gt;contains-email&lt;/code&gt; or &lt;code&gt;likely-PII&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lineage
&lt;/h3&gt;

&lt;p&gt;Dremio automatically tracks the flow of data from source to view to consumer. You can see which raw tables feed into which business views, and which dashboards or AI queries consume those views.&lt;/p&gt;

&lt;p&gt;Lineage is critical for impact analysis. Before changing the schema of a source table, you can trace all downstream dependencies and understand exactly what will break. Without automated lineage, this analysis requires manually reading SQL definitions and hoping you did not miss one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Knowledge Graph
&lt;/h3&gt;

&lt;p&gt;The knowledge graph is the newest addition to Dremio's semantic layer. It operates at a higher level than individual wikis and labels, building a connected graph of entity relationships, metric definitions, and usage patterns.&lt;/p&gt;

&lt;p&gt;The knowledge graph works in three ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pattern detection.&lt;/strong&gt; It analyzes query patterns across your organization to detect implicit definitions. If 80% of queries that reference "active customers" use the same WHERE clause (&lt;code&gt;last_order_date &amp;gt; CURRENT_DATE - INTERVAL '90' DAY AND account_type != 'test'&lt;/code&gt;), the knowledge graph surfaces that pattern as a candidate definition.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;User-defined context.&lt;/strong&gt; You can provide business context as structured markdown files. These files define entities, relationships, and business rules that the knowledge graph ingests and makes available to AI agents.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Relationship mapping.&lt;/strong&gt; The knowledge graph connects related entities (customers are related to orders, orders contain products, products belong to categories) and exposes those relationships to AI agents, enabling more accurate multi-table SQL generation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Semantic Search
&lt;/h3&gt;

&lt;p&gt;Semantic search lets users and AI agents discover data assets using natural language. Instead of browsing a schema tree looking for a table called &lt;code&gt;dwh_fact_cust_ord_line_item&lt;/code&gt;, you search for "customer orders by product category" and find the relevant view, complete with its wiki documentation and labels.&lt;/p&gt;

&lt;p&gt;Semantic search indexes wikis, labels, column names, table descriptions, and view definitions. It is the entry point for both human exploration and AI agent data discovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Integrated Approach Changes Everything for AI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8tgus2xtevp5jevo7tq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8tgus2xtevp5jevo7tq.png" alt="How an AI agent uses the semantic layer to generate accurate SQL from a natural language question" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The reason the platform-versus-product distinction matters more now than it did five years ago is AI. Specifically, AI agents that generate SQL from natural language questions.&lt;/p&gt;

&lt;p&gt;An AI agent that receives the question "What was our customer churn rate by region last quarter?" needs three things to produce an accurate answer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context.&lt;/strong&gt; What does "churn rate" mean in this organization? What table contains the data? Which columns are relevant? What filters should be applied? The semantic layer's wikis, labels, views, and knowledge graph provide this context.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Access.&lt;/strong&gt; Can this user see the churn data? Are there row-level filters based on their role? Are any columns masked? The governance system enforces these rules.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Execution speed.&lt;/strong&gt; The user expects an answer in seconds, not minutes. The query engine needs to be fast enough for interactive use.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In a standalone semantic layer architecture, these three capabilities live in three different systems: the semantic layer product provides context, the data catalog or warehouse provides governance, and a separate query engine provides execution. The AI agent must coordinate across all three, and any mismatch between them produces wrong answers, security violations, or slow responses.&lt;/p&gt;

&lt;p&gt;In Dremio's architecture, all three are co-located. The &lt;a href="https://www.dremio.com/ai-agent/" rel="noopener noreferrer"&gt;AI Agent&lt;/a&gt; reads the wikis, labels, and view definitions from the semantic layer, generates SQL that respects governance rules, and executes the query on the built-in MPP engine. The entire loop happens within a single governed platform.&lt;/p&gt;

&lt;p&gt;Dremio's &lt;a href="https://docs.dremio.com/current/developer/mcp-server/" rel="noopener noreferrer"&gt;MCP Server&lt;/a&gt; extends this to external AI tools. ChatGPT, Claude Desktop, or any custom agent that supports the Model Context Protocol can connect to Dremio and query through the same governed semantic layer. The external AI agent receives the same business context, respects the same governance rules, and gets the same fast query execution as the built-in AI Agent.&lt;/p&gt;

&lt;p&gt;The semantic layer teaches the AI your business language so it generates the right SQL, not generic SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Platform vs Product: A Side-by-Side Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Standalone Products (AtScale, Cube, dbt)&lt;/th&gt;
&lt;th&gt;Platform-Integrated (Dremio)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate server or service to deploy&lt;/td&gt;
&lt;td&gt;Built into the platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Must integrate with external catalog and warehouse&lt;/td&gt;
&lt;td&gt;Native FGAC, row-level security, column masking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Depends on external warehouse or engine&lt;/td&gt;
&lt;td&gt;Built-in MPP engine (Apache Arrow)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metric definitions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;YAML files, code, or GUI-based models&lt;/td&gt;
&lt;td&gt;SQL views in the catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI readiness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requires separate MCP adapter or API integration&lt;/td&gt;
&lt;td&gt;Native AI Agent + MCP Server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single warehouse or requires federation setup&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.dremio.com/blog/why-agentic-analytics-requires-federation-virtualization-and-the-lakehouse-how-dremio-delivers/" rel="noopener noreferrer"&gt;Federated queries&lt;/a&gt; across 30+ sources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-aggregation (Cube) or warehouse-dependent (dbt)&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.dremio.com/blog/5-ways-dremio-reflections-outsmart-traditional-materialized-views/" rel="noopener noreferrer"&gt;Reflections&lt;/a&gt; (autonomous, transparent acceleration)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sync lag&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Possible lag between definition changes and enforcement&lt;/td&gt;
&lt;td&gt;Real-time; definitions and execution are the same system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Teams with a single warehouse and specific tooling needs&lt;/td&gt;
&lt;td&gt;Teams with diverse data sources, AI-driven analytics, or multi-engine requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When a Standalone Product Fits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You use a single data warehouse (Snowflake, BigQuery) and your semantic layer needs are limited to consistent BI metrics&lt;/li&gt;
&lt;li&gt;Your team is already deeply invested in dbt and wants metrics alongside transformations&lt;/li&gt;
&lt;li&gt;You are building customer-facing embedded analytics and need Cube's pre-aggregation performance&lt;/li&gt;
&lt;li&gt;You have heavy Excel/MDX requirements that only AtScale supports&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When the Platform Approach Fits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Your data lives across multiple sources (S3, PostgreSQL, Snowflake, MongoDB) and you need federated access&lt;/li&gt;
&lt;li&gt;You want governance rules defined once and enforced everywhere, including for AI agents&lt;/li&gt;
&lt;li&gt;You are building or planning AI-driven analytics (AI Agent, MCP, natural language querying)&lt;/li&gt;
&lt;li&gt;You want to eliminate the operational overhead of managing a separate semantic layer product&lt;/li&gt;
&lt;li&gt;You need the semantic layer, the catalog, and the query engine to operate as a single governed system&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Building Your Semantic Layer: A Practical Framework
&lt;/h2&gt;

&lt;p&gt;If you are starting from scratch or migrating from an ad-hoc metric landscape, here is a practical sequence:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Identify your top 10 metrics.&lt;/strong&gt; Not all metrics need to be in the semantic layer on day one. Start with the metrics that cause the most confusion: revenue, churn, active users, cost per acquisition, NPS. These are the metrics where two teams have two different SQL queries and two different numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Build the layered view architecture.&lt;/strong&gt; For each metric, create the three-layer view stack in Dremio. Preparation views clean the source data. Business views encode the agreed-upon logic. Application views tailor the output for specific consumers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Add wikis and labels.&lt;/strong&gt; Document each view and its columns. Define what the metric means, who owns it, how it is calculated, and what filters are applied. Tag columns with labels like &lt;code&gt;PII&lt;/code&gt;, &lt;code&gt;Finance&lt;/code&gt;, or &lt;code&gt;Approved&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Configure governance.&lt;/strong&gt; Apply Fine-Grained Access Control: row-level security for multi-tenant data, column masking for sensitive fields, role-based access for views. These rules are enforced at query time for every consumer, including AI agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Connect AI interfaces.&lt;/strong&gt; Enable the &lt;a href="https://www.dremio.com/blog/5-steps-to-supercharge-your-analytics-with-dremios-ai-agent-and-apache-iceberg/" rel="noopener noreferrer"&gt;Dremio AI Agent&lt;/a&gt; for your team. Set up the MCP Server for external AI tools. The wikis and labels you added in Step 3 become the context that makes AI-generated SQL accurate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Expand.&lt;/strong&gt; Add the next 10 metrics. Build knowledge graph definitions for complex entity relationships. Let autonomous Reflections learn from query patterns and accelerate the most common queries automatically.&lt;/p&gt;

&lt;p&gt;The semantic layer is not a one-time project. It is a living system that grows with your organization's data needs. Start small, prove value on the metrics that matter most, and expand from there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/get-started" rel="noopener noreferrer"&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; to build your semantic layer on top of your existing data sources with zero data movement and native AI agent support.&lt;/p&gt;

&lt;h3&gt;
  
  
  Free Resources to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The Metadata Structure of Modern Table Formats</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 30 Apr 2026 15:46:39 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/the-metadata-structure-of-modern-table-formats-i81</link>
      <guid>https://forem.com/alexmercedcoder/the-metadata-structure-of-modern-table-formats-i81</guid>
      <description>&lt;p&gt;This is Part 2 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt; covered why table formats exist. This article breaks down exactly how each format organizes its metadata.&lt;/p&gt;

&lt;p&gt;The metadata structure of a table format determines everything: how fast queries start planning, how efficiently concurrent writes are handled, how schema changes propagate, and how much overhead accumulates over time. Two formats can both claim "ACID support" and "time travel" while having fundamentally different mechanisms under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Apache Iceberg: The Metadata Tree
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5df8rbdlb2no5ink2fr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5df8rbdlb2no5ink2fr.png" alt="Iceberg's three-layer metadata architecture from catalog to metadata.json to manifest list to manifest files to data files" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Iceberg organizes metadata into a tree with four levels. Each level adds specificity and enables pruning at query planning time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Catalog pointer.&lt;/strong&gt; The catalog (a REST catalog, &lt;a href="https://www.dremio.com/platform/open-catalog/" rel="noopener noreferrer"&gt;Dremio Open Catalog&lt;/a&gt;, AWS Glue, or Hive Metastore) stores a pointer to the current &lt;code&gt;metadata.json&lt;/code&gt; file. This pointer is the single source of truth for the table's current state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Metadata file (&lt;code&gt;metadata.json&lt;/code&gt;).&lt;/strong&gt; A JSON file containing the table's schema (with column IDs), partition spec, sort order, table properties, and a list of snapshots. Each snapshot represents a complete, immutable version of the table. When the table is updated, a new &lt;code&gt;metadata.json&lt;/code&gt; is created with the new snapshot appended to the list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Manifest list (Avro).&lt;/strong&gt; Each snapshot points to exactly one manifest list. The manifest list is a table of contents: it lists all the manifest files that make up this snapshot and includes partition-level summary statistics for each manifest. These summaries let the query planner skip entire manifests that cannot contain data matching the query filter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 4: Manifest files (Avro).&lt;/strong&gt; Each manifest file tracks a set of data files and delete files. For each file, the manifest stores the file path, file size, row count, partition values, and column-level statistics (min value, max value, null count, NaN count, distinct count). These per-file statistics enable file-level pruning during query planning.&lt;/p&gt;

&lt;p&gt;The key insight is that each level progressively narrows the search space. A query engine using &lt;a href="https://www.dremio.com/blog/apache-iceberg-metadata-for-performance/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt; or Spark reads the catalog pointer (1 request), loads the metadata file (1 read), checks the manifest list to skip irrelevant manifests (1 read, many skips), then reads only the relevant manifests to find the actual data files to scan. For a petabyte table, this can reduce planning from minutes of directory listing to milliseconds of metadata traversal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delta Lake: The Sequential Transaction Log
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4yzkef5bi833tuqck72.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4yzkef5bi833tuqck72.png" alt="Delta Lake's transaction log structure with JSON commits, Parquet checkpoints, and the reader process" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Delta Lake uses a simpler, linear structure. All metadata lives in the &lt;code&gt;_delta_log/&lt;/code&gt; directory alongside the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON commit files&lt;/strong&gt; (&lt;code&gt;000001.json&lt;/code&gt;, &lt;code&gt;000002.json&lt;/code&gt;, ...) record each transaction as a set of actions: &lt;code&gt;add&lt;/code&gt; (a new data file), &lt;code&gt;remove&lt;/code&gt; (a file marked for deletion), &lt;code&gt;metaData&lt;/code&gt; (schema or property change), and &lt;code&gt;protocol&lt;/code&gt; (version requirements). Each commit file is sequentially numbered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parquet checkpoint files&lt;/strong&gt; are created every 10 commits (by default). A checkpoint is a Parquet file that summarizes the cumulative state of the table at that version, essentially a snapshot of all currently-active &lt;code&gt;add&lt;/code&gt; actions. This prevents readers from having to replay hundreds of small JSON files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;_last_checkpoint&lt;/code&gt;&lt;/strong&gt; is a small file pointing to the most recent checkpoint. The read process is: find the latest checkpoint, load it, then replay any JSON commits after it.&lt;/p&gt;

&lt;p&gt;The tradeoff: Delta's log is simple and easy to reason about, but it does not have the multi-level pruning that Iceberg's manifest tree provides. File-level statistics exist in the add actions but are not organized hierarchically. For very large tables (millions of files), the planning phase can be slower because there is no intermediate pruning layer equivalent to Iceberg's manifest list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Hudi: The Timeline
&lt;/h2&gt;

&lt;p&gt;Hudi stores metadata in the &lt;code&gt;.hoodie/&lt;/code&gt; directory as a sequence of "instants" on a timeline. Each instant represents an operation (commit, compaction, rollback, clean) and transitions through three states: &lt;code&gt;REQUESTED&lt;/code&gt;, &lt;code&gt;INFLIGHT&lt;/code&gt;, and &lt;code&gt;COMPLETED&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The timeline is split into two parts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active timeline&lt;/strong&gt; contains recent instants that are needed for current read and write operations. The file naming pattern is &lt;code&gt;&amp;lt;timestamp&amp;gt;.&amp;lt;action_type&amp;gt;.&amp;lt;state&amp;gt;&lt;/code&gt;. For example, &lt;code&gt;20250429010500.commit.completed&lt;/code&gt; indicates a completed write operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Archived timeline&lt;/strong&gt; contains older instants that have been moved to &lt;code&gt;.hoodie/archived/&lt;/code&gt; to keep the active timeline lean. Hudi 1.0 introduced an LSM-based timeline that compacts archived instants into Parquet files for efficient long-term storage.&lt;/p&gt;

&lt;p&gt;Hudi's timeline tracks more granular operation types than other formats: &lt;code&gt;commit&lt;/code&gt; (COW write), &lt;code&gt;delta_commit&lt;/code&gt; (MOR write), &lt;code&gt;compaction&lt;/code&gt;, &lt;code&gt;clean&lt;/code&gt; (garbage collection), &lt;code&gt;rollback&lt;/code&gt;, &lt;code&gt;savepoint&lt;/code&gt;, and &lt;code&gt;replace&lt;/code&gt; (clustering). This granularity reflects Hudi's focus on complex write patterns like CDC pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Paimon: Snapshots and LSM Trees
&lt;/h2&gt;

&lt;p&gt;Paimon's metadata is organized around snapshots and buckets. Each partition is divided into a fixed number of buckets, and each bucket contains an independent LSM (Log-Structured Merge) tree.&lt;/p&gt;

&lt;p&gt;The snapshot metadata tracks which data files and changelog files belong to each bucket at each point in time. Inside each bucket, the LSM tree structure contains multiple "sorted runs" (levels) of Parquet files. When data is written, it lands in level 0 as a small sorted file. Background compaction merges small files into larger ones at higher levels.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from the other formats because Paimon's metadata structure is designed for continuous streaming writes rather than batch commits. The LSM tree handles high-frequency inserts and updates efficiently by buffering writes in memory and flushing them as sorted runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  DuckLake: SQL Database as Metadata
&lt;/h2&gt;

&lt;p&gt;DuckLake takes the most radical departure. Instead of storing metadata as files in object storage, all metadata lives in a traditional SQL database (PostgreSQL, MySQL, SQLite, or DuckDB itself).&lt;/p&gt;

&lt;p&gt;The metadata database contains tables for: schemas, snapshots, data files, column statistics, and table properties. When a query engine needs to plan a query, it issues a single SQL query against the metadata database instead of reading multiple metadata files from object storage.&lt;/p&gt;

&lt;p&gt;The tradeoff is a dependency on a running database process for metadata management. The benefit is dramatically simpler metadata access patterns and the ability to use SQL for metadata operations like listing snapshots, finding files, and checking statistics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Side-by-Side Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mnuqgoy9v85w76z9vka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mnuqgoy9v85w76z9vka.png" alt="Five approaches to table metadata from file-based to database-backed" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Iceberg&lt;/th&gt;
&lt;th&gt;Delta Lake&lt;/th&gt;
&lt;th&gt;Hudi&lt;/th&gt;
&lt;th&gt;Paimon&lt;/th&gt;
&lt;th&gt;DuckLake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JSON + Avro files&lt;/td&gt;
&lt;td&gt;JSON + Parquet files&lt;/td&gt;
&lt;td&gt;Avro instant files&lt;/td&gt;
&lt;td&gt;Snapshot + LSM files&lt;/td&gt;
&lt;td&gt;SQL database tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata location&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Object storage&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;_delta_log/&lt;/code&gt; directory&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.hoodie/&lt;/code&gt; directory&lt;/td&gt;
&lt;td&gt;Table directory&lt;/td&gt;
&lt;td&gt;External database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-level pruning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (manifest list + manifests)&lt;/td&gt;
&lt;td&gt;No (flat file list)&lt;/td&gt;
&lt;td&gt;Partial (index-based)&lt;/td&gt;
&lt;td&gt;No (bucket-level)&lt;/td&gt;
&lt;td&gt;Via SQL queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Planning overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (tree traversal)&lt;/td&gt;
&lt;td&gt;Moderate (checkpoint + replay)&lt;/td&gt;
&lt;td&gt;Moderate (timeline scan)&lt;/td&gt;
&lt;td&gt;Low (snapshot lookup)&lt;/td&gt;
&lt;td&gt;Lowest (single SQL query)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata growth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Controlled (manifest reuse)&lt;/td&gt;
&lt;td&gt;Requires checkpointing&lt;/td&gt;
&lt;td&gt;Requires archiving&lt;/td&gt;
&lt;td&gt;Requires compaction&lt;/td&gt;
&lt;td&gt;Database manages it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engine independence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (spec-defined)&lt;/td&gt;
&lt;td&gt;Moderate (Spark-oriented)&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Low (Flink-oriented)&lt;/td&gt;
&lt;td&gt;Low (DuckDB-oriented)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For teams building on multiple engines, Iceberg's metadata structure provides the best combination of planning efficiency and engine independence. &lt;a href="https://www.dremio.com/blog/apache-iceberg-delta-lake-apache-hudi-a-comparison/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt; uses Iceberg's metadata tree to achieve fast query planning even on tables with millions of files, and its &lt;a href="https://www.dremio.com/platform/reflections/" rel="noopener noreferrer"&gt;Columnar Cloud Cache&lt;/a&gt; caches frequently-accessed metadata locally to further reduce planning latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Part 3&lt;/a&gt; covers how query engines use Iceberg's metadata specifically for performance optimization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What Are Table Formats and Why Were They Needed?</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Thu, 30 Apr 2026 15:17:52 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/what-are-table-formats-and-why-were-they-needed-4f9k</link>
      <guid>https://forem.com/alexmercedcoder/what-are-table-formats-and-why-were-they-needed-4f9k</guid>
      <description>&lt;p&gt;This is Part 1 of a 15-part &lt;a href="https://iceberglakehouse.com/posts/" rel="noopener noreferrer"&gt;Apache Iceberg Masterclass&lt;/a&gt;. This article covers the fundamental question: what problem do table formats solve, and why does the choice between them matter?&lt;/p&gt;

&lt;p&gt;A data lake without a table format is a collection of files. It has no concept of a transaction, no mechanism to prevent two writers from producing corrupted state, and no efficient way to determine which files belong to the current version of a table. Table formats exist because the gap between "a pile of Parquet files" and "a reliable analytical table" is enormous, and bridging it requires a formal metadata specification.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-01/" rel="noopener noreferrer"&gt;What Are Table Formats and Why Were They Needed?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;The Metadata Structure of Current Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-03/" rel="noopener noreferrer"&gt;Performance and Apache Iceberg's Metadata&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-04/" rel="noopener noreferrer"&gt;Technical Deep Dive on Partition Evolution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-05/" rel="noopener noreferrer"&gt;Technical Deep Dive on Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-06/" rel="noopener noreferrer"&gt;Writing to an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-07/" rel="noopener noreferrer"&gt;What Are Lakehouse Catalogs?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-08/" rel="noopener noreferrer"&gt;Embedded Catalogs: S3 Tables and MinIO AI Stor&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-09/" rel="noopener noreferrer"&gt;How Iceberg Table Storage Degrades Over Time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-10/" rel="noopener noreferrer"&gt;Maintaining Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-11/" rel="noopener noreferrer"&gt;Apache Iceberg Metadata Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-12/" rel="noopener noreferrer"&gt;Using Iceberg with Python and MPP Engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-13/" rel="noopener noreferrer"&gt;Streaming Data into Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-14/" rel="noopener noreferrer"&gt;Hands-On with Iceberg Using Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-15/" rel="noopener noreferrer"&gt;Migrating to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The World Before Table Formats
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3o685rz15tbrp5y5lx82.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3o685rz15tbrp5y5lx82.png" alt="How table formats solved the chaos of raw data lakes with a structured metadata layer" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before table formats, data lakes relied on a simple convention: data was organized into directories in object storage (S3, ADLS, GCS), and the &lt;a href="https://cwiki.apache.org/confluence/display/hive/design#Design-HiveMetastore" rel="noopener noreferrer"&gt;Hive Metastore&lt;/a&gt; tracked which directories corresponded to which partitions.&lt;/p&gt;

&lt;p&gt;This approach had five critical problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No atomic commits.&lt;/strong&gt; If a Spark job wrote 500 new Parquet files and failed after writing 300, readers could see the 300 partial files. There was no mechanism to make all 500 files visible at once or none of them. Cleanup required manual intervention or custom garbage collection scripts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expensive query planning.&lt;/strong&gt; To determine which files to scan, the engine issued &lt;code&gt;LIST&lt;/code&gt; requests against object storage. S3 returns up to 5,000 objects per request. A table with 100,000 files required 20+ sequential HTTP calls before query execution could even start. At Netflix, query planning for large tables could take minutes just from directory listing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema changes required rewrites.&lt;/strong&gt; Adding a column to a Hive table meant either rewriting every file (expensive) or accepting that old files had a different schema than new files (confusing). Renaming a column was not supported without a full table rewrite because Hive mapped columns by position, not by identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No time travel.&lt;/strong&gt; Once data was overwritten, the previous version was gone. There was no snapshot history, no ability to roll back a bad write, and no way to reproduce a query result from last Tuesday.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exposed partitioning.&lt;/strong&gt; Users had to know the physical partition layout. A table partitioned by &lt;code&gt;year&lt;/code&gt; and &lt;code&gt;month&lt;/code&gt; required queries to explicitly filter on those columns using the exact partition column names (&lt;code&gt;WHERE year = 2024 AND month = 3&lt;/code&gt;). If partitioning changed, every downstream query broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Table Format Actually Is
&lt;/h2&gt;

&lt;p&gt;A table format is a specification that defines how to organize metadata about data files so that query engines can treat them as reliable, transactional tables. It sits between the query engine and the physical files.&lt;/p&gt;

&lt;p&gt;The core responsibilities of every table format:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File tracking&lt;/strong&gt;: Maintain an explicit list of which data files belong to the current version of the table, eliminating directory listing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic commits&lt;/strong&gt;: Make all changes to a table visible to readers at once through a single metadata pointer swap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema management&lt;/strong&gt;: Track the table schema and its evolution history, allowing safe column adds, drops, renames, and reorders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partition management&lt;/strong&gt;: Define how data is partitioned and enable query pruning without exposing the physical layout to users&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot history&lt;/strong&gt;: Maintain a history of table states for time travel, rollback, and auditing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistics&lt;/strong&gt;: Store column-level min/max values and other statistics to enable file skipping during query planning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The data files themselves are still standard &lt;a href="https://parquet.apache.org/" rel="noopener noreferrer"&gt;Parquet&lt;/a&gt; or ORC. The table format adds a metadata layer on top that gives those files the properties of a database table.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Table Formats
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54mph70nshvopthgc1wg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54mph70nshvopthgc1wg.png" alt="Timeline showing the evolution from Hive Metastore through Hudi, Iceberg, Delta Lake, Paimon, and DuckLake" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five table formats exist today, each born from a different problem and optimized for a different workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Iceberg
&lt;/h3&gt;

&lt;p&gt;Iceberg started at Netflix in 2017, created by Ryan Blue to solve Netflix's petabyte-scale query planning problems. It uses a three-layer metadata tree: a &lt;code&gt;metadata.json&lt;/code&gt; file points to a manifest list, which points to manifest files, which track individual data files with column-level statistics.&lt;/p&gt;

&lt;p&gt;Iceberg's defining feature is its &lt;a href="https://iceberg.apache.org/spec/" rel="noopener noreferrer"&gt;formal specification&lt;/a&gt;. Any engine that follows the spec can read and write Iceberg tables correctly. This makes Iceberg the most engine-neutral format. Spark, Trino, Flink, &lt;a href="https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt;, Snowflake, BigQuery, Athena, StarRocks, and DuckDB all support it.&lt;/p&gt;

&lt;p&gt;Iceberg also introduced &lt;a href="https://www.dremio.com/blog/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/" rel="noopener noreferrer"&gt;hidden partitioning&lt;/a&gt; and partition evolution, which are covered in depth in Parts 4 and 5 of this series.&lt;/p&gt;

&lt;h3&gt;
  
  
  Delta Lake
&lt;/h3&gt;

&lt;p&gt;Delta Lake was created at Databricks in 2019. It stores metadata as a sequential transaction log (&lt;code&gt;_delta_log/&lt;/code&gt;) of JSON and Parquet checkpoint files. Each commit appends a new log entry describing which files were added or removed.&lt;/p&gt;

&lt;p&gt;Delta Lake's design prioritizes simplicity within the Spark ecosystem. Its strongest features are Liquid Clustering (adaptive data organization that replaces static partitioning) and UniForm (automatic generation of Iceberg-compatible metadata so other engines can read Delta tables as Iceberg).&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Hudi
&lt;/h3&gt;

&lt;p&gt;Hudi originated at Uber in 2016, designed specifically for Change Data Capture (CDC) pipelines that needed to upsert millions of records per hour. Hudi uses a timeline-based metadata architecture where each commit, compaction, and rollback is an "action instant."&lt;/p&gt;

&lt;p&gt;Hudi offers both Copy-on-Write (rewrite entire files on update) and Merge-on-Read (write deltas and merge at read time) table types, plus record-level indexing for fast point lookups. It excels when your primary workload involves frequent row-level updates and deletes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apache Paimon
&lt;/h3&gt;

&lt;p&gt;Paimon evolved from Flink Table Store at Alibaba and entered Apache incubation in 2023. It uses &lt;a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree" rel="noopener noreferrer"&gt;LSM-tree&lt;/a&gt; based storage internally, making it the most streaming-native table format.&lt;/p&gt;

&lt;p&gt;Tables in Paimon are divided into partitions and then further into buckets, each containing an independent LSM tree. This structure enables high-throughput streaming writes with millisecond-level latency. Paimon supports multiple merge engines (deduplication, partial update, aggregation) that determine how records with the same primary key are resolved.&lt;/p&gt;

&lt;h3&gt;
  
  
  DuckLake
&lt;/h3&gt;

&lt;p&gt;DuckLake is the newest entry, released by DuckDB Labs and MotherDuck in 2025. It takes a fundamentally different approach: instead of storing metadata as files in object storage, DuckLake stores all metadata in a standard SQL database (PostgreSQL, MySQL, SQLite, or DuckDB itself).&lt;/p&gt;

&lt;p&gt;This means a single SQL query resolves all metadata (schema, file list, statistics) instead of the multiple HTTP requests required by file-based metadata formats. The tradeoff is a dependency on a running database for the metadata layer and currently limited engine support (primarily DuckDB).&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Each Format Excels
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsiu05e2c5oholrsnf2qb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsiu05e2c5oholrsnf2qb.png" alt="Positioning chart showing where Iceberg, Delta Lake, Hudi, Paimon, and DuckLake sit on batch vs streaming and single vs multi-engine axes" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Iceberg&lt;/th&gt;
&lt;th&gt;Delta Lake&lt;/th&gt;
&lt;th&gt;Hudi&lt;/th&gt;
&lt;th&gt;Paimon&lt;/th&gt;
&lt;th&gt;DuckLake&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File-based tree&lt;/td&gt;
&lt;td&gt;File-based log&lt;/td&gt;
&lt;td&gt;File-based timeline&lt;/td&gt;
&lt;td&gt;File-based LSM&lt;/td&gt;
&lt;td&gt;SQL database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engine support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Broadest&lt;/td&gt;
&lt;td&gt;Good (via UniForm)&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;td&gt;DuckDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema evolution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;By column ID&lt;/td&gt;
&lt;td&gt;By name&lt;/td&gt;
&lt;td&gt;By version&lt;/td&gt;
&lt;td&gt;By version&lt;/td&gt;
&lt;td&gt;SQL ALTER&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Partition evolution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (unique)&lt;/td&gt;
&lt;td&gt;Liquid Clustering&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Bucket evolution&lt;/td&gt;
&lt;td&gt;SQL-managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Streaming writes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-engine analytics&lt;/td&gt;
&lt;td&gt;Spark/Databricks&lt;/td&gt;
&lt;td&gt;CDC/upserts&lt;/td&gt;
&lt;td&gt;Flink streaming&lt;/td&gt;
&lt;td&gt;Local SQL analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: each format reflects the priorities of the team that built it. Netflix needed multi-engine reads at petabyte scale (Iceberg). Uber needed high-frequency upserts (Hudi). Alibaba needed real-time streaming from Flink (Paimon). Databricks needed Spark-optimized simplicity (Delta). DuckDB Labs wanted SQL-native metadata management (DuckLake).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Iceberg Has Become the Default
&lt;/h2&gt;

&lt;p&gt;Iceberg has achieved the broadest adoption for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Specification-first design.&lt;/strong&gt; Iceberg's &lt;a href="https://iceberg.apache.org/spec/" rel="noopener noreferrer"&gt;spec&lt;/a&gt; is independent of any engine or vendor. Any team can build a conforming implementation. This created a network effect: more engine support attracted more users, which attracted more engine support.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No engine dependency.&lt;/strong&gt; Unlike Delta Lake's historical Spark dependency or Paimon's Flink focus, Iceberg was designed from day one to work across engines. A table written by Spark can be read by &lt;a href="https://www.dremio.com/blog/apache-iceberg-delta-lake-apache-hudi-a-comparison/" rel="noopener noreferrer"&gt;Dremio&lt;/a&gt;, Trino, Flink, or Snowflake without conversion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Industry convergence.&lt;/strong&gt; Snowflake, AWS (Athena, EMR), Google (BigQuery), and Databricks (via UniForm) have all adopted Iceberg as an interoperability standard. When the major cloud vendors align on a format, it becomes the safe choice for long-term investments.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That said, Iceberg is not universally superior. Hudi's record-level indexing makes it faster for point lookups on upsert-heavy tables. Paimon's LSM-tree architecture handles continuous streaming ingestion with lower latency than Iceberg's batch-oriented commit model. DuckLake's SQL-based metadata is simpler for single-engine, local-first analytics.&lt;/p&gt;

&lt;p&gt;The rest of this series focuses on Iceberg because its design decisions and capabilities represent the state of the art for multi-engine analytical lakehouses. &lt;a href="https://iceberglakehouse.com/posts/2026-04-29-iceberg-masterclass-02/" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt; examines the metadata structures of all five formats in detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Books to Go Deeper
&lt;/h3&gt;

&lt;p&gt;To learn more about Apache Iceberg and the lakehouse architecture, check out these resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting the Apache Iceberg Lakehouse&lt;/a&gt; by Alex Merced (Manning)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands-ebook/dp/B0GQL4QNRT/" rel="noopener noreferrer"&gt;Lakehouses with Apache Iceberg: Agentic Hands-on&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Constructing-Context-Semantics-Agents-Embeddings/dp/B0GSHRZNZ5/" rel="noopener noreferrer"&gt;Constructing Context: Semantics, Agents, and Embeddings&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Apache-Iceberg-Agentic-Connecting-Structured/dp/B0GW2WF4PX/" rel="noopener noreferrer"&gt;Apache Iceberg &amp;amp; Agentic AI: Connecting Structured Data&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.amazon.com/Open-Source-Lakehouse-Architecting-Analytical/dp/B0GW595MVL/" rel="noopener noreferrer"&gt;Open Source Lakehouse: Architecting Analytical Systems&lt;/a&gt; by Alex Merced&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Free Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>AI Weekly: Google's TPU Split, Cursor's $60B, and MCP at Scale</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 29 Apr 2026 12:54:58 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/ai-weekly-googles-tpu-split-cursors-60b-and-mcp-at-scale-1c6e</link>
      <guid>https://forem.com/alexmercedcoder/ai-weekly-googles-tpu-split-cursors-60b-and-mcp-at-scale-1c6e</guid>
      <description>&lt;p&gt;&lt;strong&gt;Week of April 23–29, 2026&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This week, Google split its eighth-generation TPU into two specialized chips. SpaceX disclosed rights to acquire Cursor for $60 billion. Google Cloud Next 2026 framed enterprise software around autonomous agents, and the Model Context Protocol moved deeper into production-grade territory.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Coding Tools: SpaceX Eyes Cursor at $60B and Google Pushes Agent Platforms
&lt;/h2&gt;

&lt;p&gt;SpaceX announced on April 22 that it has rights to buy AI coding tool Cursor for $60 billion later this year, with an alternative $10 billion partnership option. The move positions Elon Musk's space and AI properties to compete with Anthropic and OpenAI ahead of a planned Wall Street debut. Cursor, made by San Francisco startup Anysphere, has wide distribution to expert software engineers, which is part of what makes it attractive to Musk's company. &lt;a href="https://www.usnews.com/news/best-states/california/articles/2026-04-22/spacex-says-it-can-buy-ai-coding-tool-cursor-for-60b-later-this-year" rel="noopener noreferrer"&gt;Read the AP report&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Google Cloud Next 2026 ran April 22–24 in Las Vegas and made coding agents the centerpiece. Google rebranded its AI platform as the Gemini Enterprise Agent Platform, billed as a one-stop shop for autonomous agents with 200+ foundation models and enterprise governance. The platform supports a new Agents CLI that takes agents from creation to production through a single command-line tool. &lt;a href="https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/next-2026/" rel="noopener noreferrer"&gt;See the announcements&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Cursor 2.0 also gained attention this month for supporting up to eight parallel AI agents working on different sections of a codebase at the same time. Claude Code, meanwhile, now powers GitHub Copilot's enterprise tier with multi-agent coordination that splits large tasks into parallel subtasks. The category leaders are converging on the same pattern: agents that read codebases, plan changes across multiple files, write the code, and run the tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Processing: Google Splits Its TPU Into Training and Inference Chips
&lt;/h2&gt;

&lt;p&gt;Google Cloud announced on April 22 that its eighth-generation TPU is splitting into two specialized chips. The TPU 8t targets model training and the TPU 8i targets inference. Google reports up to 3x faster AI model training and 80% better performance per dollar over the previous generation, with the ability to link more than 1 million TPUs in a single cluster. &lt;a href="https://techcrunch.com/2026/04/22/google-cloud-next-new-tpu-ai-chips-compete-with-nvidia/" rel="noopener noreferrer"&gt;Read the TechCrunch coverage&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Google also confirmed a partnership with Nvidia to extend Falcon, the software-based networking technology Google created and open-sourced in 2023 under the Open Compute Project. The work aims to make Nvidia-based systems perform better inside Google Cloud, a notable detente given Google's TPU sales push.&lt;/p&gt;

&lt;p&gt;The Nvidia chip rival market is also booming. AI chip startups raised $8.3 billion globally in 2026, according to Dealroom, with Cerebras Systems pulling in $1 billion in February and $500 million rounds going to MatX, Ayar Labs, and Etched. European companies like Axelera and Olix raised rounds north of $200 million. The argument: GPUs were not purpose-designed for AI inference, and novel system architectures bring big savings in energy and cost. &lt;a href="https://www.cnbc.com/2026/04/17/nvidia-ai-chip-rivals-funding-euclyd-fractile.html" rel="noopener noreferrer"&gt;See the CNBC report&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards &amp;amp; Protocols: MCP Hits Production Scale and Agentic Foundations Mature
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol crossed a clear adoption threshold this month. MCP downloads now run at roughly 110 million per month across OpenAI, Google, LangChain, and other frameworks, according to a recent Anthropic keynote on the protocol's direction. By Q2 2026, community-built MCP servers exist for GitHub, Slack, PostgreSQL, Stripe, Figma, Docker, Kubernetes, and over 200 other tools. &lt;a href="https://en.wikipedia.org/wiki/Model_Context_Protocol" rel="noopener noreferrer"&gt;See the Wikipedia summary&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The 2026 MCP roadmap published in March identified four priorities. First is streamable HTTP transport scalability. Second is the Tasks primitive lifecycle, including retry semantics and expiry policies. Third is governance maturation. Fourth is enterprise readiness covering audit trails, SSO-integrated auth, gateway behavior, and configuration portability. Stateful sessions fight with load balancers and horizontal scaling needs better support, so the working groups are evolving the existing transport rather than adding new ones. &lt;a href="https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/" rel="noopener noreferrer"&gt;Read the roadmap&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Google Cloud Next 2026 also gave standards work a public showcase. A breakout session covered "Generative UI for any agent, anywhere: A2UI, AG-UI, MCP Apps, and more." Interoperability between agent UI standards is now part of mainstream cloud roadmaps.&lt;/p&gt;

&lt;p&gt;The Agentic AI Foundation launched in December 2025 under the Linux Foundation. Founding contributions came from Anthropic's MCP, OpenAI's AGENTS.md, and Block's Goose. AAIF held its first MCP Dev Summit North America in New York earlier this month, drawing about 1,200 attendees, double the prior event. The next AAIF events are AGNTCon + MCPCon Europe on September 17–18 in Amsterdam and AGNTCon + MCPCon North America on October 22–23 in San Jose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources to Go Further
&lt;/h2&gt;

&lt;p&gt;The AI landscape changes fast. Here are tools and resources to help you keep pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try Dremio Free&lt;/strong&gt; — Experience agentic analytics and an Apache Iceberg-powered lakehouse. &lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=04-29-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Start your free trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn Agentic AI with Data&lt;/strong&gt; — Dremio's agentic analytics features let your AI agents query and act on live data. &lt;a href="https://www.dremio.com/use-cases/agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=04-29-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Explore Dremio Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join the Community&lt;/strong&gt; — Connect with data engineers and AI practitioners building on open standards. &lt;a href="https://developer.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=04-29-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Join the Dremio Developer Community&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: The 2026 Guide to AI-Assisted Development&lt;/strong&gt; — Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. &lt;a href="https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: Using AI Agents for Data Engineering and Data Analysis&lt;/strong&gt; — A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. &lt;a href="https://www.amazon.com/Using-Agents-Data-Engineering-Analysis-ebook/dp/B0GR6PYJT9/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>google</category>
      <category>news</category>
      <category>startup</category>
    </item>
    <item>
      <title>Apache Data Lakehouse Weekly: April 23–29, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 29 Apr 2026 12:52:51 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/apache-data-lakehouse-weekly-april-23-29-2026-36b7</link>
      <guid>https://forem.com/alexmercedcoder/apache-data-lakehouse-weekly-april-23-29-2026-36b7</guid>
      <description>&lt;p&gt;Three weeks past the Iceberg Summit, the lakehouse projects shifted from in-person alignment back into shipping mode. Polaris cut its 1.4.0 release and immediately followed up with a Python CLI 1.4.0, Arrow shipped its 24.0.0 major release and kicked off an arrow-rs 58.2.0 vote, and Parquet's design lists stayed dense with proposals on footers, page encoding, and a new java release discussion. Iceberg's dev list was quieter this week as contributors digested summit follow-ups and continued narrowing on V4 design questions in the background.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Iceberg
&lt;/h2&gt;

&lt;p&gt;The post-summit wave of formal proposals continued translating into design work this week. The V4 metadata.json optionality direction that has anchored multiple syncs — treating catalog-managed metadata as a first-class supported mode while keeping static-table portability through explicit opt-in semantics — is still the defining V4 design conversation, with Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu continuing to push edge cases on portability and Spark driver behavior. The single-file commits proposal that Russell Spitzer and Amogh Jahagirdar have been advancing remains on track for a formal write-up, with the latency and metadata footprint reductions driving urgency.&lt;/p&gt;

&lt;p&gt;Péter Váry's &lt;a href="https://www.mail-archive.com/dev@iceberg.apache.org/msg12972.html" rel="noopener noreferrer"&gt;efficient column updates proposal&lt;/a&gt; for wide tables continued attracting collaboration. The design — write only the columns that change on each commit, then stitch the result at read time — is squarely aimed at petabyte-scale feature stores with thousands of embedding and model-score columns, and the I/O savings make it one of the more practically grounded V4 proposals on the list. Anurag Mantripragada and Gábor Kaszab are working alongside Péter on POC benchmarks to support the formal proposal that should land on the dev list in the coming weeks.&lt;/p&gt;

&lt;p&gt;On the Rust side, the &lt;a href="https://www.mail-archive.com/dev@iceberg.apache.org/msg12986.html" rel="noopener noreferrer"&gt;Iceberg Rust 0.9.0 release&lt;/a&gt; shipped earlier this development cycle and continues to anchor downstream adoption discussions, with its DataFusion integration making it a serious option for teams that want Iceberg without a JVM dependency. Iceberg Summit 2026 session recordings are also rolling out on the project's YouTube channel this week, giving the global community access to the V4 design talks, the vendor panel, and the production case studies from Apple, Bloomberg, Pinterest, and others. The AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March is still expected to land as published guidance covering disclosure requirements and code provenance standards.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris
&lt;/h2&gt;

&lt;p&gt;Polaris had its biggest release week of the year. Adnan Hemani &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04499.html" rel="noopener noreferrer"&gt;announced Apache Polaris 1.4.0&lt;/a&gt; on April 23, the project's first major release as a graduated top-level project. Dmitri Bourlatchkov, Yufei Gu, Xi Wen, and Alexandre Dutra all weighed in with congratulations and follow-up notes on packaging and distribution. Right behind it, Adnan kicked off and shepherded the &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04509.html" rel="noopener noreferrer"&gt;Apache Polaris Python CLI 1.4.0 RC2 vote&lt;/a&gt;, which collected binding +1s from Yufei Gu, Honah J., and Jean-Baptiste Onofré, with Yong Zheng adding non-binding support. The &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04551.html" rel="noopener noreferrer"&gt;Python CLI 1.4.0 release&lt;/a&gt; shipped on April 28, completing the back-to-back release pair. Jean-Baptiste also confirmed in a &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04476.html" rel="noopener noreferrer"&gt;HEADS UP note&lt;/a&gt; that the project is now back on a monthly release cadence after the graduation transition.&lt;/p&gt;

&lt;p&gt;The release had its share of post-launch fires. Alexandre Dutra opened threads on &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04512.html" rel="noopener noreferrer"&gt;Helm chart repo inconsistency after the 1.4.0 release&lt;/a&gt;, &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04513.html" rel="noopener noreferrer"&gt;a release workflow failure in step 4&lt;/a&gt;, and an &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04514.html" rel="noopener noreferrer"&gt;Artifact Hub request for official status&lt;/a&gt;. A &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04544.html" rel="noopener noreferrer"&gt;GitHub thread on KMS-related errors after bumping to 1.4.0&lt;/a&gt; surfaced a real upgrade bug that drew immediate attention. Yufei Gu took the lead on triaging most of these, and the discussions are doing exactly what a healthy post-release cycle should — surfacing rough edges before they reach more users.&lt;/p&gt;

&lt;p&gt;Design discussions stayed active alongside the release work. EJ Wang's &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04485.html" rel="noopener noreferrer"&gt;DISCUSS thread on AGENTS.md for Polaris&lt;/a&gt; opened a conversation about adding agent-readable repository metadata, picking up engagement from Yufei Gu. Yufei separately started &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04486.html" rel="noopener noreferrer"&gt;a discussion on narrowing the scope of SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION&lt;/a&gt;, which Dmitri Bourlatchkov and Dennis Huo dug into. ITing Lee's &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04430.html" rel="noopener noreferrer"&gt;proposal to add OpenLineage to Polaris&lt;/a&gt; continued attracting feedback from Adnan Hemani, Jean-Baptiste Onofré, Yufei Gu, and Michael Collado. Alexandre Dutra's URL path decoding thread and his &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04429.html" rel="noopener noreferrer"&gt;PolarisPrivilege fields and grant validation&lt;/a&gt; discussion both kept multiple contributors engaged through the week, and Selvamohan Neethiraj raised a &lt;a href="https://mail-archive.com/dev@polaris.apache.org/msg04496.html" rel="noopener noreferrer"&gt;PolarisPrincipal user attributes server-side bug&lt;/a&gt; that Alexandre and Yufei traced through.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;Arrow had its own back-to-back release week. Raúl Cumplido &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34611.html" rel="noopener noreferrer"&gt;announced Apache Arrow 24.0.0&lt;/a&gt; on April 22, closing out the &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34606.html" rel="noopener noreferrer"&gt;24.0.0 RC0 vote&lt;/a&gt; that spanned mid-April. Matt Topol followed with the &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34613.html" rel="noopener noreferrer"&gt;Apache Arrow Go 18.6.0 RC0 vote&lt;/a&gt; on April 22 and announced the &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34629.html" rel="noopener noreferrer"&gt;release result&lt;/a&gt; on April 28, with Pedro Matias, Ian Cook, David Li, and Bryce Mecum carrying the verification work. Andrew Lamb then opened the &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34631.html" rel="noopener noreferrer"&gt;arrow-rs 58.2.0 RC1 vote&lt;/a&gt; on April 28, with Bryce Mecum, Ed Seidl, Jeffrey Vo, and Raúl Cumplido moving quickly through verification — finishing what last week's newsletter flagged as the next ship to watch.&lt;/p&gt;

&lt;p&gt;Beyond releases, the design conversations stayed lively. Emil Sadek opened a &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34619.html" rel="noopener noreferrer"&gt;DISCUSS thread on an ADBC Logo Proposal&lt;/a&gt; with Nic Crane, Julian Hyde, and Rusty Conover weighing in on visual identity for the database connectivity standard. Benjamin Philip kicked off a new &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34628.html" rel="noopener noreferrer"&gt;DISCUSS thread on Arrow Erlang's grant documents&lt;/a&gt;, continuing the project's expansion into more language ecosystems. The &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34576.html" rel="noopener noreferrer"&gt;pyarrow-stubs donation vote&lt;/a&gt; that Rok Mihevc opened on April 14 stayed active, drawing additional support this week with Rok pushing for a final tally. Mandukhai Alimaa's earlier &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34604.html" rel="noopener noreferrer"&gt;proposal for a canonical BigDecimal extension type&lt;/a&gt; and Andrew Lamb's &lt;a href="https://www.mail-archive.com/dev@arrow.apache.org/msg34610.html" rel="noopener noreferrer"&gt;arrow-rs security policy discussion&lt;/a&gt; both continued generating engagement as the project tightens its production posture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;Parquet's lists were as dense as any project's this week. Ismaël Mejía opened a &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27247.html" rel="noopener noreferrer"&gt;thread soliciting code reviews for Java performance optimization work&lt;/a&gt;, with Steve Loughran picking it up immediately. Manu Zhang's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27212.html" rel="noopener noreferrer"&gt;DISCUSS thread on a new parquet-java release&lt;/a&gt; drew sustained engagement from Steve Loughran, Aaron Niskode-Dossett, Fokko Driesprong, Julien Le Dem, Gang Wu, and Rahil C — covering both the timing question and what should ship in the next release. Julien Le Dem's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27227.html" rel="noopener noreferrer"&gt;Parquet sync on April 22&lt;/a&gt; drew Manu Zhang and Micah Kornfield into the agenda discussion.&lt;/p&gt;

&lt;p&gt;The format-level proposals continued to evolve. Will Edwards's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27142.html" rel="noopener noreferrer"&gt;DISCUSS thread on an alternative to the FlatBuffer footer with a lightweight byte-offset index&lt;/a&gt; kept pulling in design feedback from Andrew Lamb, Ed Seidl, Jan Finis, Alkis Evlogimenos, Raphael Taylor-Davies, Andrew Bell, and others. Ed Seidl's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27197.html" rel="noopener noreferrer"&gt;proposal to make path_in_schema optional&lt;/a&gt; attracted commentary from Gang Wu, Steve Loughran, and Micah Kornfield. Andrew Lamb's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27192.html" rel="noopener noreferrer"&gt;thread on where VariantJsonParser should live&lt;/a&gt; — touching the boundary between Parquet and Iceberg's variant tooling — continued with Steve Loughran and Gang Wu. Jan Finis's question on &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27214.html" rel="noopener noreferrer"&gt;whether a too-long RLE bitpack at the end of a page is valid&lt;/a&gt; drew careful answers from Raphael Taylor-Davies and Micah Kornfield, the kind of spec-edge clarification that matters for cross-implementation interop. Milan Stefanovic's &lt;a href="https://mail-archive.com/dev@parquet.apache.org/msg27136.html" rel="noopener noreferrer"&gt;Geospatial CRS string format clarification&lt;/a&gt; continued threading toward closure with Dewey Dunnington and Micah Kornfield.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Project Themes
&lt;/h2&gt;

&lt;p&gt;This week's clearest pattern is post-graduation Polaris finding its operational rhythm. The 1.4.0 release plus the Python CLI 1.4.0, the return to a monthly cadence, and the visible upgrade-path bugs and Helm packaging issues are all the work of a project growing into its TLP independence. The fact that contributors are surfacing problems publicly and triaging them on the dev list — rather than routing through a parent project — is itself the marker of a healthy graduation.&lt;/p&gt;

&lt;p&gt;The release wave across projects also reflects how synchronized the lakehouse stack has become. Arrow 24.0.0 plus arrow-rs 58.2.0 plus arrow-go 18.6.0 plus Polaris 1.4.0 plus Polaris Python CLI 1.4.0 all landing within a single week is a coordination story. Engines and tools downstream of these libraries — Spark, Trino, Dremio, DataFusion, DuckDB, Snowflake — can pick up the new versions in a coherent batch rather than chasing staggered upgrades across half a dozen vendors. The format-level design work in Parquet (footers, optional path_in_schema, variant tooling location) and the V4 design work in Iceberg (metadata.json optionality, single-file commits, efficient column updates) are also starting to rhyme: both communities are picking apart assumptions baked into v1 and v2 spec design and asking what a leaner, AI-workload-aware format looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;Watch the arrow-rs 58.2.0 RC vote close out in the coming days. Polaris should publish 1.4.1 or move toward 1.5.0 planning given the monthly cadence commitment, and the AGENTS.md discussion is likely to firm into a concrete proposal. The Polaris OpenLineage RFC has the volume of feedback it needs to move toward implementation. On the Iceberg side, the formal V4 single-file commits write-up and the published AI contribution policy remain the next concrete deliverables to track. Iceberg Summit 2026 talk recordings will continue rolling out on YouTube, and the parquet-java release discussion should converge on a target version.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Get Started with Dremio&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-04-29&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Free&lt;/a&gt; — Build your lakehouse on Iceberg with a free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/use-cases/lake-to-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-04-29&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Build a Lakehouse with Iceberg, Parquet, Polaris &amp;amp; Arrow&lt;/a&gt; — Learn how Dremio brings the open lakehouse stack together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Free Downloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html" rel="noopener noreferrer"&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-polaris-guide-reg.html" rel="noopener noreferrer"&gt;Apache Polaris: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Books by Alex Merced&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/ref=sr_1_5?crid=1304S78BQAP6U&amp;amp;dib=eyJ2IjoiMSJ9.7Z17wXFJVWtv1gDIVF5-z5NwgT7B-vj9kEQuLkAKtLh00KncwXYc4bQ6hyydwcMHXbJOlFCSO7-2JmKTC5KCV-q2XEdeq7kBBmicVzI6tlDtqPqAgE6RHJE_XZ_n-zxxAjRHE2THP0J4DEgzDmiXrF9bdkEFyaruSUW28Ryx0zYyI_NuD5vZ4HYqQv3u5hzBVjjOlxyRYSTIsRSeVIoJC2XvjrXdNFvQ9jm4Kr1xFOw.yog4MgCdYecbJT0bAcGXNJJvZbvD4F_TP0lDbPA1xGI&amp;amp;dib_tag=se&amp;amp;keywords=alex+merced&amp;amp;qid=1773236747&amp;amp;sprefix=alex+mer%2Caps%2C570&amp;amp;sr=8-5" rel="noopener noreferrer"&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Enabling-Agentic-Analytics-Apache-Iceberg-ebook/dp/B0GQXT6W3N/" rel="noopener noreferrer"&gt;Enabling Agentic Analytics with Apache Iceberg and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/" rel="noopener noreferrer"&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Book-Using-Apache-Iceberg-Python/dp/B0GNZ454FF/" rel="noopener noreferrer"&gt;The Book on Using Apache Iceberg with Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>data</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The Journey from Scattered Data to an Apache Iceberg Lakehouse with Governed Agentic Analytics</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Sat, 25 Apr 2026 20:53:35 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/the-journey-from-scattered-data-to-an-apache-iceberg-lakehouse-with-governed-agentic-analytics-1o3o</link>
      <guid>https://forem.com/alexmercedcoder/the-journey-from-scattered-data-to-an-apache-iceberg-lakehouse-with-governed-agentic-analytics-1o3o</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9w3dhe9wzzqfysof72z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9w3dhe9wzzqfysof72z.png" alt="Journey from scattered data to governed agentic analytics through federation, semantic layer, and Iceberg lakehouse" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The conventional wisdom for data platform modernization goes like this: pick a target system, build ETL pipelines for every source, migrate everything, validate the data, retrain your users, and then start getting value. That process takes six to eighteen months. During that time, analysts are waiting and leadership is asking why the investment has not produced results yet.&lt;/p&gt;

&lt;p&gt;There is a better sequence. Instead of making everyone wait for a full migration, you start producing value on day one and migrate to &lt;a href="https://iceberg.apache.org/" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt; at your own pace. The key is treating federation, the semantic layer, AI access, and Iceberg migration as four independent phases, each delivering value on its own, rather than a single all-or-nothing project.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehben2lwpuek3ftz1vav.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehben2lwpuek3ftz1vav.png" alt="Four-phase journey from connecting sources to Iceberg lakehouse showing value at every phase" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: Connect Your Data Where It Lives
&lt;/h2&gt;

&lt;p&gt;Sign up for &lt;a href="https://www.dremio.com/get-started" rel="noopener noreferrer"&gt;Dremio Cloud&lt;/a&gt; and you get a lakehouse project with a pre-configured Open Catalog right away. From there, start connecting your existing data sources through Dremio's federated query engine: PostgreSQL, MySQL, MongoDB, S3, Snowflake, BigQuery, Redshift, AWS Glue, Unity Catalog, and more.&lt;/p&gt;

&lt;p&gt;No data copying. No ETL pipelines. Dremio queries your data where it already lives, using predicate pushdowns to push filtering work down to each source system.&lt;/p&gt;

&lt;p&gt;The result: by the end of day one, your team has unified SQL access across every connected source. An analyst can join a PostgreSQL customer table with an S3-based event stream in a single query, without waiting for a data engineer to build a pipeline first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: Build a Semantic Layer Over Everything
&lt;/h2&gt;

&lt;p&gt;Raw source tables have cryptic column names, inconsistent types, and zero business context. Before anyone can get reliable answers, whether human or AI, you need a curated layer on top.&lt;/p&gt;

&lt;p&gt;Dremio's &lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vl7mjbliccc61w8okl7q.png" rel="noopener noreferrer"&gt;AI Semantic Layer&lt;/a&gt; uses SQL views organized in three tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bronze/Raw views&lt;/strong&gt; map to raw sources. They standardize column names, cast data types, and apply basic filters. One Bronze view per source table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silver/Business views&lt;/strong&gt; apply business logic. This is where you define what "active customer" means (purchased in the last 90 days, not on a trial), join data across sources, and compute metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gold/Application views&lt;/strong&gt; serve specific consumers: a dashboard, a report, or an AI agent. Each Gold view is optimized for its use case.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dremio's AI Agent can help you come up with the SQL to generate these views efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Govern Access and Document Everything
&lt;/h3&gt;

&lt;p&gt;Grant users access to specific views using Role-Based Access Control (RBAC) at the folder, dataset, and column level. For sensitive data, add Fine-Grained Access Control (FGAC) via UDFs for row-level security and column-level masking.&lt;/p&gt;

&lt;p&gt;Then enrich every dataset with &lt;strong&gt;Wikis&lt;/strong&gt; (human-readable documentation explaining what each column means) and &lt;strong&gt;Tags&lt;/strong&gt; (categorical labels for discoverability). Dremio can auto-generate Wiki descriptions and suggest Tags by sampling your table data and schema. You review and refine the output instead of writing everything from scratch.&lt;/p&gt;

&lt;p&gt;This metadata is not just for humans. It is what the AI Agent reads when generating SQL. Better documentation means more accurate answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3: Turn On Agentic Analytics
&lt;/h2&gt;

&lt;p&gt;With a governed semantic layer in place, you are ready for AI. This is the important part: &lt;strong&gt;you do not need to complete the Iceberg migration first.&lt;/strong&gt; Agentic analytics works on federated data from the moment the semantic layer exists.&lt;/p&gt;

&lt;p&gt;Dremio's built-in &lt;a href="https://www.dremio.com/ai-agent/" rel="noopener noreferrer"&gt;AI Agent&lt;/a&gt; lets users type plain-English questions in the console. The agent writes SQL, executes it against your governed views, returns results, generates charts, and suggests follow-up questions. It respects every RBAC and FGAC policy in your catalog. Users can only get answers about data they are authorized to see.&lt;/p&gt;

&lt;p&gt;For teams that want to use external tools, Dremio's MCP (Model Context Protocol) server lets ChatGPT, Claude Desktop, or custom agents connect directly to your Dremio environment. External tools get the same semantic context and security controls as the built-in agent.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Interface&lt;/th&gt;
&lt;th&gt;What It Provides&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Built-in AI Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural language queries, SQL generation, charts, follow-up suggestions inside Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP Server&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connect any MCP-compatible AI tool (ChatGPT, Claude, custom agents) with full governance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI SQL Functions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;AI_GENERATE&lt;/code&gt;, &lt;code&gt;AI_CLASSIFY&lt;/code&gt;, &lt;code&gt;AI_COMPLETE&lt;/code&gt; directly in SQL for unstructured data analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At this point your organization has unified data access, a governed semantic layer, and AI-powered analytics, and you have not migrated a single table to Iceberg yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 4: Migrate to Iceberg, One Dataset at a Time
&lt;/h2&gt;

&lt;p&gt;Federation gets you access, but a full &lt;a href="https://www.dremio.com/platform/apache-iceberg/" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt; lakehouse gets you more: Autonomous Reflections that optimize query performance based on actual usage patterns, end-to-end caching, automated table maintenance (compaction, clustering, vacuuming), and interoperability with every Iceberg-compatible engine (Spark, Flink, Trino). Your data stays in your storage, in an open format, with no vendor lock-in.&lt;/p&gt;

&lt;p&gt;The migration pattern is deliberately incremental:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick one dataset&lt;/strong&gt; to migrate (start with the highest-volume or most-queried table)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build an Iceberg pipeline&lt;/strong&gt; to land that data in your object storage (S3 or Azure)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update the Bronze view&lt;/strong&gt; to point to the new Iceberg table instead of the legacy federated source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silver and Gold views stay unchanged.&lt;/strong&gt; They reference the Bronze view, which now reads from Iceberg instead of the old source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every consumer is unaffected.&lt;/strong&gt; Dashboards, reports, and AI agents continue to work exactly as before.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Repeat for the next dataset whenever you are ready. There is no deadline and no big-bang cutover.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the View Layer Makes Migration Invisible
&lt;/h2&gt;

&lt;p&gt;This is the architectural insight that makes the whole journey work. The semantic layer acts as a contract between physical data storage and every consumer above it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrv7pvm5lsfxz2ytn6e7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkrv7pvm5lsfxz2ytn6e7.png" alt="View layer swap showing Bronze view pointing to PostgreSQL before migration and Apache Iceberg after, with Silver, Gold, and AI Agent layers unchanged" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you swap a Bronze view's underlying source from PostgreSQL to an Iceberg table, every Silver view, Gold view, dashboard, report, and AI agent that depends on it continues to work without changes. The view contract (column names, data types, business logic) is preserved. Only the physical source pointer changes.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No dashboard rewiring&lt;/li&gt;
&lt;li&gt;No report migration&lt;/li&gt;
&lt;li&gt;No API endpoint changes&lt;/li&gt;
&lt;li&gt;No AI Agent reconfiguration&lt;/li&gt;
&lt;li&gt;No user communication (beyond governance notifications if your policies require them)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The migration happens underneath the abstraction layer. Everyone above it is oblivious.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tradeoffs
&lt;/h2&gt;

&lt;p&gt;This phased approach is not free of costs.&lt;/p&gt;

&lt;p&gt;Federation introduces network latency. Queries that join a PostgreSQL table in one region with an S3 bucket in another will be slower than queries against co-located Iceberg tables. Reflections and caching mitigate this for repeated queries, but the first execution of a new query pattern will feel it.&lt;/p&gt;

&lt;p&gt;Iceberg migration still requires building ingest pipelines. Dremio does not eliminate that work. What it does is decouple the pipeline work from the analytics timeline. Your analysts and AI agents are productive while engineers build migration pipelines in the background.&lt;/p&gt;

&lt;p&gt;Autonomous Reflections need a 7-day observation window before they start optimizing. Day-one performance on brand-new Iceberg tables relies on baseline optimizations (C3 caching, predicate pushdowns, vectorized execution). The system gets faster as it learns your query patterns.&lt;/p&gt;

&lt;p&gt;And Dremio is an analytical engine, not a transactional database. Your OLTP workloads stay in PostgreSQL, MongoDB, or whatever system runs your application. You query those systems through federation, not as a replacement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Today, Migrate Over Time
&lt;/h2&gt;

&lt;p&gt;The traditional approach forces you to choose: spend months migrating, or keep running fragmented analytics on scattered data. Dremio eliminates that choice. Connect your sources, build your semantic layer, enable AI access, and start migrating to Iceberg when you are ready. Each phase delivers value independently, and the view layer ensures that migration never disrupts the people who are already getting answers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.dremio.com/get-started" rel="noopener noreferrer"&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start the journey from wherever your data lives today.&lt;/p&gt;

&lt;h3&gt;
  
  
  Free Resources to Go Deeper
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpageiceberg" rel="noopener noreferrer"&gt;FREE - Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://drmevn.fyi/linkpagepolaris" rel="noopener noreferrer"&gt;FREE - Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-ai-for-dummies-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Agentic AI for Dummies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hello.dremio.com/wp-resources-agentic-analytics-guide-reg.html?utm_source=link_page&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=qr-link-list-04-07-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;FREE - Leverage Federation, The Semantic Layer and the Lakehouse for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://forms.gle/xdsun6JiRvFY9rB36" rel="noopener noreferrer"&gt;FREE with Survey - Understanding and Getting Hands-on with Apache Iceberg in 100 Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>analytics</category>
      <category>architecture</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Apache Data Lakehouse Weekly: April 16–22, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 22 Apr 2026 18:19:22 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/apache-data-lakehouse-weekly-april-16-22-2026-2519</link>
      <guid>https://forem.com/alexmercedcoder/apache-data-lakehouse-weekly-april-16-22-2026-2519</guid>
      <description>&lt;p&gt;Two weeks past the Iceberg Summit, the San Francisco in-person alignments are now translating into formal proposals and code on the dev lists. Iceberg's V4 design work continued consolidating, Polaris kept moving toward its 1.4.0 milestone, Parquet's Geospatial spec picked up a cleanup commit from a new contributor, and Arrow's release engineering and Java modernization discussions stayed active.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Iceberg
&lt;/h2&gt;

&lt;p&gt;The post-summit V4 design work continued as the defining thread on the Iceberg dev list this week. The &lt;a href="http://www.mail-archive.com/dev@iceberg.apache.org/msg12699.html" rel="noopener noreferrer"&gt;V4 metadata.json optionality discussion&lt;/a&gt; that Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu drove through March kept narrowing on practical design questions. The concrete direction emerging from the summit is to treat catalog-managed metadata as a first-class supported mode while preserving static-table portability through explicit opt-in semantics, rather than the current implicit assumption that the root JSON file is always present.&lt;/p&gt;

&lt;p&gt;Russell Spitzer and Amogh Jahagirdar's &lt;a href="http://www.mail-archive.com/dev@iceberg.apache.org/msg12574.html" rel="noopener noreferrer"&gt;one-file commits design&lt;/a&gt; moved toward a formal spec write-up this week. The approach replaces manifest lists with root manifests and introduces manifest delete vectors, enabling single-file commits that cut metadata write overhead dramatically for high-frequency writers. The in-person sessions at the summit cleared the last design disagreements about inline versus external manifest delete vectors, and the community is now aligning on the implementation plan.&lt;/p&gt;

&lt;p&gt;Péter Váry's &lt;a href="http://www.mail-archive.com/dev@iceberg.apache.org/msg12958.html" rel="noopener noreferrer"&gt;efficient column updates proposal&lt;/a&gt; for AI and ML workloads drew steady engagement. The design lets Iceberg write only the columns that change on each write for wide feature tables, then stitch the result at read time. For teams managing petabyte-scale feature stores with embedding vectors and model scores, the I/O savings are meaningful. Anurag Mantripragada and Gábor Herman are working alongside Péter on POC benchmarks to support the formal proposal.&lt;/p&gt;

&lt;p&gt;The AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March is moving toward published guidance. The summit provided the in-person alignment that async debate rarely produces, and a working policy covering disclosure requirements and code provenance standards for AI-generated contributions is expected on the dev list in the next couple of weeks. Polaris is navigating the same question in parallel, and the two communities are likely to converge on a shared approach given their overlapping contributor base.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Polaris
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://polaris.apache.org/downloads/" rel="noopener noreferrer"&gt;Polaris 1.4.0 release&lt;/a&gt; is in active scope finalization as the project's first release since graduating to top-level status on February 18. Credential vending for Azure and Google Cloud Storage is the headline feature, alongside catalog federation that lets one Polaris instance front multiple catalog backends across clouds. The &lt;a href="https://polaris.apache.org/community/release-guides/semi-automated-release-guide/" rel="noopener noreferrer"&gt;schedule-driven release model&lt;/a&gt; calls for a release intent email to the dev list about a week before the RC cut, so watch the list for that thread shortly.&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://www.mail-archive.com/dev@ranger.apache.org/msg39491.html" rel="noopener noreferrer"&gt;Apache Ranger authorization RFC from Selvamohan Neethiraj&lt;/a&gt; remained the most active governance discussion. The plugin lets organizations running Ranger with Hive, Spark, and Trino manage Polaris security within the same policy framework, eliminating the policy duplication that arises when teams bolt separate authorization onto each engine. It is opt-in and backward compatible with Polaris's internal authorization layer, which lowers the enterprise adoption barrier considerably.&lt;/p&gt;

&lt;p&gt;On the community side, Polaris's blog continued its post-graduation cadence with a &lt;a href="https://polaris.apache.org/blog/" rel="noopener noreferrer"&gt;Sunday April 4 post on building a fully integrated, locally-running open data lakehouse in under 30 minutes&lt;/a&gt; using k3d, Apache Ozone, Polaris, and Trino. The Polaris PMC also shipped a &lt;a href="https://polaris.apache.org/blog/" rel="noopener noreferrer"&gt;March 29 post&lt;/a&gt; covering automated entity management for catalogs, principals, and roles. With incubator overhead behind it, release velocity has picked up noticeably from the 1.3.0 release on January 16.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;Arrow's &lt;a href="https://github.com/apache/arrow-rs" rel="noopener noreferrer"&gt;release calendar&lt;/a&gt; shows arrow-rs 58.2.0 landing this month, following 58.1.0 in March which shipped with no breaking API changes. The cadence has held at roughly one minor version per month, with 59.0.0 already scheduled for May as a major release that may include breaking changes. The Rust implementation has become one of the most actively maintained segments of the Arrow ecosystem, with a DataFusion integration drawing engines that want Arrow without a JVM dependency.&lt;/p&gt;

&lt;p&gt;Jean-Baptiste Onofré's JDK 17 minimum proposal for Arrow Java 20.0.0 continued drawing input from Micah Kornfield and Antoine Pitrou. The practical rationale is coordination: setting JDK 17 as Arrow's Java baseline aligns with Iceberg's own upgrade timeline and effectively raises the minimum across the entire lakehouse stack in a single coordinated move. The decision is expected before the 20.0.0 release cycle formally opens.&lt;/p&gt;

&lt;p&gt;Nic Crane's thread on using LLMs for Arrow project maintenance continued generating discussion. The framing — AI as a resource for maintainers, not just contributors — is distinct from how Iceberg and Polaris are approaching their AI policies. Arrow's angle is practical: a lean maintainer group managing a growing issue backlog needs help triaging, and LLMs can do that work without introducing the code-provenance concerns that matter for contributions. Google Summer of Code 2026 student proposals that landed in early April are being sorted this week, with interest concentrated in compute kernels and Go and Swift language bindings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;Parquet's week centered on hardening the Geospatial spec that was adopted earlier this year. Milan Stefanovic merged &lt;a href="http://www.mail-archive.com/commits@parquet.apache.org/msg04335.html" rel="noopener noreferrer"&gt;PR #560 on April 20&lt;/a&gt;, clarifying the Geospatial spec wording for coordinate reference systems. The change documents existing CRS usage practice for the default OGC:CRS84 system and removes ambiguity caught during implementation reviews. Small spec-hardening commits like this are how a new type goes from "shipped" to "production-reliable" across engines.&lt;/p&gt;

&lt;p&gt;The community blog effort continued alongside the spec work. The &lt;a href="https://parquet.apache.org/blog/2026/02/13/native-geospatial-types-in-apache-parquet/" rel="noopener noreferrer"&gt;Native Geospatial Types blog&lt;/a&gt; that Jia Yu and Dewey Dunnington published on February 13 remains the community's reference explainer, and Andrew Lamb has been coordinating with Aihua Xu on the companion Variant blog post. Spotlighting recent additions through the Parquet blog is part of a deliberate push to give the project the same kind of voice that DataFusion and Arrow have built.&lt;/p&gt;

&lt;p&gt;The ALP encoding that cleared its acceptance vote in the prior week moved into implementation discussion. Engine teams across Spark, Trino, Dremio, and DataFusion are comparing notes on how to integrate ALP into their Parquet readers, with compression gains for float-heavy ML feature stores as the immediate benefit. The File logical type proposal for unstructured data (images, PDFs, audio) also kept advancing in community discussion, extending Parquet's scope beyond pure analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-Project Themes
&lt;/h2&gt;

&lt;p&gt;The summit's downstream effect is now visible across every dev list. Iceberg's V4 work, Polaris's 1.4.0 scope, Arrow's JDK 17 decision, and Parquet's Geospatial cleanup are running in parallel, and the cross-project coordination on shared questions like AI contribution policy and Java baselines has intensified. The JDK 17 alignment is the clearest case: moving Arrow Java 20.0.0, Iceberg's next major, and downstream engines to the same floor in a single window removes years of compatibility friction.&lt;/p&gt;

&lt;p&gt;The second pattern is the steady expansion of format scope to meet AI workloads. Iceberg's efficient column updates, Parquet's File logical type, the Geospatial spec hardening, and Polaris's multi-cloud federation all respond to the same pressure: the lakehouse stack is being asked to power AI pipelines, not just analytical queries. Each project is making changes that only make sense if you assume the next decade's workloads look different from the last.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;Watch for the V4 single-file commits formal spec write-up and the metadata optionality vote on the Iceberg dev list, along with a published AI contribution policy. The Polaris 1.4.0 release intent email should land in the coming days. Arrow's JDK 17 baseline decision for Java 20.0.0 is close to a vote, and arrow-rs 58.2.0 should ship before the end of the month. Iceberg Summit 2026 session recordings are also rolling out on the project's YouTube channel.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Further Learning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Get Started with Dremio&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-04-22&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Try Dremio Free&lt;/a&gt; — Build your lakehouse on Iceberg with a free trial&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.dremio.com/use-cases/lake-to-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=apache-newsletter-2026-04-22&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Build a Lakehouse with Iceberg, Parquet, Polaris &amp;amp; Arrow&lt;/a&gt; — Learn how Dremio brings the open lakehouse stack together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Free Downloads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html" rel="noopener noreferrer"&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hello.dremio.com/wp-apache-polaris-guide-reg.html" rel="noopener noreferrer"&gt;Apache Polaris: The Definitive Guide&lt;/a&gt; — O'Reilly book, free download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Books by Alex Merced&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Architecting-Apache-Iceberg-Lakehouse-open-source/dp/1633435105/" rel="noopener noreferrer"&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Enabling-Agentic-Analytics-Apache-Iceberg-ebook/dp/B0GQXT6W3N/" rel="noopener noreferrer"&gt;Enabling Agentic Analytics with Apache Iceberg and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/" rel="noopener noreferrer"&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.amazon.com/Book-Using-Apache-Iceberg-Python/dp/B0GNZ454FF/" rel="noopener noreferrer"&gt;The Book on Using Apache Iceberg with Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>news</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AI Weekly: Opus 4.7, Kimi K2.6, and a $25B Amazon Deal, April 16–22, 2026</title>
      <dc:creator>Alex Merced</dc:creator>
      <pubDate>Wed, 22 Apr 2026 18:03:32 +0000</pubDate>
      <link>https://forem.com/alexmercedcoder/ai-weekly-opus-47-kimi-k26-and-a-25b-amazon-deal-april-16-22-2026-1k7p</link>
      <guid>https://forem.com/alexmercedcoder/ai-weekly-opus-47-kimi-k26-and-a-25b-amazon-deal-april-16-22-2026-1k7p</guid>
      <description>&lt;p&gt;Three stories defined the past week: Anthropic shipped Claude Opus 4.7, Moonshot open-sourced Kimi K2.6 with 300-agent swarms, and Amazon committed another $25 billion to Anthropic alongside a $100 billion AWS spend. Here is what you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Coding Tools: Opus 4.7 Ships With a 1M Context Window
&lt;/h2&gt;

&lt;p&gt;Anthropic released &lt;a href="https://www.cnbc.com/2026/04/16/anthropic-claude-opus-4-7-model-mythos.html" rel="noopener noreferrer"&gt;Claude Opus 4.7 on April 16&lt;/a&gt;, a new flagship model focused on agentic coding and long-horizon work. The model scores 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro, jumping from 80.8% on Opus 4.6. It runs with a full 1 million token context window and high-resolution image support for charts and dense documents.&lt;/p&gt;

&lt;p&gt;The model landed across major platforms the same week. &lt;a href="https://aws.amazon.com/blogs/aws/aws-weekly-roundup-claude-opus-4-7-in-amazon-bedrock-aws-interconnect-ga-and-more-april-20-2026/" rel="noopener noreferrer"&gt;Claude Opus 4.7 arrived on Amazon Bedrock&lt;/a&gt; on launch day in four regions, with up to 10,000 requests per minute per account. &lt;a href="https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-generally-available/" rel="noopener noreferrer"&gt;GitHub Copilot began rolling out Opus 4.7 to Copilot Pro+&lt;/a&gt; users with a 7.5x premium request multiplier until April 30. The model is replacing both Opus 4.5 and Opus 4.6 in the Copilot model picker.&lt;/p&gt;

&lt;p&gt;Claude Code shipped Opus 4.7 the same day with new controls. The update added an "xhigh" effort level between high and max, a &lt;code&gt;/ultrareview&lt;/code&gt; multi-agent code review command, and Auto mode for Max subscribers. &lt;a href="https://releasebot.io/updates/anthropic" rel="noopener noreferrer"&gt;Anthropic also launched Claude Design&lt;/a&gt;, a new Anthropic Labs product for building prototypes, slides, and one-pagers in collaboration with the model. Pricing stays at $5 per million input tokens and $25 per million output tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Models: Kimi K2.6 Opens the Door to 12-Hour Agent Runs
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://siliconangle.com/2026/04/20/moonshot-ai-releases-kimi-k2-6-model-1t-parameters-attention-optimizations/" rel="noopener noreferrer"&gt;Moonshot AI released Kimi K2.6 on April 20&lt;/a&gt; as an open-source agentic model built for long-horizon coding. The model has 1 trillion total parameters in a Mixture-of-Experts architecture with 32 billion active per forward pass. It supports text, image, and video input, a 256K context window, and thinking and non-thinking modes behind an OpenAI-compatible API.&lt;/p&gt;

&lt;p&gt;The headline claim is stamina. &lt;a href="https://www.marktechpost.com/2026/04/20/moonshot-ai-releases-kimi-k2-6-with-long-horizon-coding-agent-swarm-scaling-to-300-sub-agents-and-4000-coordinated-steps/" rel="noopener noreferrer"&gt;Kimi K2.6 targets 12-hour autonomous coding sessions&lt;/a&gt; and agent swarms that scale to 300 sub-agents across 4,000 coordinated steps. On benchmarks, Moonshot claims SWE-Bench Pro at 58.6, SWE-bench Multilingual at 76.7, and BrowseComp at 83.2. The model matches or beats GPT-5.4 and Claude Opus 4.6 on several open-source leaderboards.&lt;/p&gt;

&lt;p&gt;K2.6 is available immediately on Kimi.com, the developer API, Kimi Code CLI, Ollama, and Hugging Face. Day-one integrations cover Kilo Code, VS Code and JetBrains extensions, OpenClaw, Tencent CodeBuddy, and Genspark. The MIT-derived license allows commercial use and redistribution, a direct challenge to closed-source frontier labs.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Infrastructure: AWS Interconnect Reaches GA and Amazon Adds $25B to Anthropic
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/blogs/aws/aws-weekly-roundup-claude-opus-4-7-in-amazon-bedrock-aws-interconnect-ga-and-more-april-20-2026/" rel="noopener noreferrer"&gt;AWS Interconnect reached general availability on April 20&lt;/a&gt; with two new capabilities. AWS Interconnect Multicloud provides Layer 3 private connections between AWS VPCs and other clouds, starting with Google Cloud, with Azure and OCI coming later in 2026. Traffic flows over the AWS global backbone with built-in MACsec encryption, never crossing the public internet. AWS also published the Interconnect specification on GitHub under Apache 2.0, so any cloud provider can become a partner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cnbc.com/2026/04/20/amazon-invest-up-to-25-billion-in-anthropic-part-of-ai-infrastructure.html" rel="noopener noreferrer"&gt;Amazon announced a $25 billion investment in Anthropic on April 20&lt;/a&gt;, on top of the $8 billion already committed. The deal includes $5 billion immediately, with up to $20 billion tied to commercial milestones. &lt;a href="https://finance.yahoo.com/sectors/technology/articles/amazon-investing-25-billion-more-113801183.html" rel="noopener noreferrer"&gt;Anthropic committed to spending more than $100 billion on AWS over 10 years&lt;/a&gt;, securing up to 5 gigawatts of Trainium chip capacity. One gigawatt is scheduled to come online this year using Trainium2 and Trainium3.&lt;/p&gt;

&lt;p&gt;The structure mirrors the &lt;a href="https://www.geekwire.com/2026/amazon-doubles-down-on-anthropic-with-25b-investment-mirroring-its-openai-cloud-deal/" rel="noopener noreferrer"&gt;$50 billion Amazon-OpenAI deal from February&lt;/a&gt;. Anthropic is now valued at $380 billion, with annualized revenue climbing from $9 billion at the end of 2025 to more than $30 billion. Enterprise customers spending at least $1 million annually have doubled since February, crossing 1,000 accounts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Standards and Protocols: Interconnect Spec Goes Open
&lt;/h2&gt;

&lt;p&gt;The AWS Interconnect specification going to GitHub under Apache 2.0 is the standards story of the week. The move gives any cloud provider a published path to join the private connectivity mesh without negotiating bilateral deals. For AI workloads moving data between model training clusters in one cloud and inference infrastructure in another, the alternative has been either the public internet or expensive dedicated circuits.&lt;/p&gt;

&lt;p&gt;The broader pattern is that hyperscale cloud providers are open-sourcing infrastructure specs to lock in network effects. Trainium chip access is exclusive, but the connectivity layer is open. This is the same playbook the Linux Foundation's Agentic AI Foundation uses for MCP and A2A: open standards for the plumbing, proprietary value on top.&lt;/p&gt;

&lt;p&gt;MCP and A2A also saw continued adoption this week. Claude Opus 4.7 ships with both protocols built in, and Kimi K2.6 supports tool calls and OpenAI-compatible APIs that slot into MCP-aware agent stacks. The layered architecture is holding: MCP handles agent-to-tool connections, A2A handles agent-to-agent coordination, and the new open models and frontier releases are all landing with both built in by default.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources to Go Further
&lt;/h2&gt;

&lt;p&gt;The AI landscape changes fast. Here are tools and resources to help you keep pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try Dremio Free&lt;/strong&gt; — Experience agentic analytics and an Apache Iceberg-powered lakehouse. &lt;a href="https://www.dremio.com/get-started?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=04-22-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Start your free trial&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn Agentic AI with Data&lt;/strong&gt; — Dremio's agentic analytics features let your AI agents query and act on live data. &lt;a href="https://www.dremio.com/use-cases/agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=04-22-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Explore Dremio Agentic AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Join the Community&lt;/strong&gt; — Connect with data engineers and AI practitioners building on open standards. &lt;a href="https://developer.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=04-22-2026&amp;amp;utm_content=alexmerced" rel="noopener noreferrer"&gt;Join the Dremio Developer Community&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: The 2026 Guide to AI-Assisted Development&lt;/strong&gt; — Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. &lt;a href="https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Book: Using AI Agents for Data Engineering and Data Analysis&lt;/strong&gt; — A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. &lt;a href="https://www.amazon.com/Using-Agents-Data-Engineering-Analysis-ebook/dp/B0GR6PYJT9/" rel="noopener noreferrer"&gt;Get it on Amazon&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>llm</category>
      <category>news</category>
    </item>
  </channel>
</rss>
