Forem: Alex Merced

AI Weekly: Free Web Tools, MCP Production Wins, Trusted-Compute Models (April 30–May 6, 2026)

Alex Merced — Wed, 06 May 2026 16:22:22 +0000

This week pushed three concrete lines forward at once. Vercel open-sourced an AI security harness, TinyFish made paid web search and fetch APIs free for AI agents, and Jama Software shipped the first Model Context Protocol server for engineering management. Underneath that, Z.ai's GLM-5.1 went live inside a trusted execution environment, and Anthropic previewed a proactive assistant for its work product. Here is what shipped and why each piece matters.

AI Coding Tools: Vercel Ships deepsec, TinyFish Drops Search Behind a Paywall

Vercel open-sourced deepsec, an AI-powered security harness that uses Claude and Codex to scan large codebases for vulnerabilities. The tool runs CLI-first, scales to over 1,000 concurrent sandboxes, and works with any pluggable coding agent through Vercel's AI Gateway or your own subscription. The pitch is straight: most coding agents handle one repo at a time, but security audits need to fan out across dozens of services in parallel. deepsec treats the agent as a worker pool and scales horizontally.

TinyFish made its Web Search and Fetch APIs free for all developers and AI agents on May 5. The free tier supports Claude Code, Cursor, Codex, and other major agent frameworks, with no credit card required and what TinyFish calls generous rate limits. Web access has been a paid bottleneck for agent workflows since 2024, and a free tier from a vendor that already serves the agent ecosystem will pull pricing pressure across the rest of the search-API market.

Anthropic also previewed a proactive assistant called Orbit for Claude Cowork. Orbit will pull insights from Gmail, Slack, GitHub, Calendar, Drive, and Figma, then surface them on its own without the user asking. The product is reportedly a Max-tier feature, and Orbit Apps were also referenced in the leaks. The combination of always-on context and proactive surface area is the next step beyond chat-only agent products.

AI Processing: GLM-5.1 Runs FP8 Inside a Trusted Execution Environment

Z.ai's GLM-5.1 went live on the 0G Private Computer running FP8 inside a Trusted Execution Environment on May 5. The model is a 754-billion-parameter Mixture-of-Experts release with 40 billion active parameters per token, shipped under the MIT license on April 7. Running it inside a TEE means the weights and prompts stay encrypted from the host operating system and cloud provider, which closes the residual trust gap that has slowed enterprise self-hosting of large open-weight models.

The 0G deployment matters for a specific reason. GLM-5.1 was trained entirely on Huawei Ascend 910B chips with no Nvidia or AMD GPUs, scores 58.4 percent on SWE-Bench Pro, and sustains autonomous task execution for over 8 hours. Putting that capability behind a TEE on a third-party serving platform is the first time a frontier-tier open-weight model has been delivered with hardware-backed confidentiality outside a hyperscaler.

Air Street's State of AI report for May noted that the UK AI Security Institute now estimates frontier cyber-offence capability is doubling every four months, with both Anthropic's Claude Mythos Preview and OpenAI's GPT-5.5 clearing a 32-step end-to-end cyber-attack range in a single month. The compute side of the picture stayed dense as well. Anthropic raised an additional $40 billion from Google plus $5 billion from Amazon, packaged with $100 billion of AWS spend and chip deals with Google and Broadcom reportedly worth hundreds of billions.

Standards & Protocols: First MCP Server for Engineering Management

Jama Software launched the first MCP Server for engineering management software on May 4. Jama Connect 9.35 lets engineers work in Claude, Codex, Cursor, GitHub Copilot, Visual Studio, or any MCP-compatible tool while keeping the existing Traceability Information Model, permissions, lifecycle workflows, and audit trails intact. The pitch from CTO Jim Davidson is that AI engineering agents need Spec Driven Development to deliver compliant velocity gains, and MCP is now the standard pipe for that integration.

Unity AI also entered open beta this week with built-in MCP Server support, alongside the AI Gateway for third-party AI integrations. Game studios get a built-in agent tuned for Unity workflows plus the option to plug in any MCP-compatible client. The pattern across both releases is clear: vertical product vendors are no longer asking whether to support MCP. They are shipping it as a default integration surface alongside their native UIs.

The Model Context Protocol's 2026 roadmap sets the priorities behind these adoptions. Lead Maintainer David Soria Parra named four focus areas: stateless transport for horizontal scaling, server discovery through .well-known URLs, task lifecycle for retry semantics and result expiry, and enterprise-readiness work covering audit trails, SSO, and gateway behavior. The June 2026 specification cycle is targeted for the stateless transport changes. Agentic AI Foundation governance, two specification releases through 2025, and a 500-plus public server count have moved MCP from an experiment into the production layer it was designed to be.

Resources to Go Further

The AI landscape changes fast. Here are tools and resources to help you keep pace.

Try Dremio Free: Experience agentic analytics and an Apache Iceberg-powered lakehouse. Start your free trial

Learn Agentic AI with Data: Dremio's agentic analytics features let your AI agents query and act on live data. Explore Dremio Agentic AI

Join the Community: Connect with data engineers and AI practitioners building on open standards. Join the Dremio Developer Community

Book: The 2026 Guide to AI-Assisted Development: Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. Get it on Amazon

Book: Using AI Agents for Data Engineering and Data Analysis: A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. Get it on Amazon

Apache Data Lakehouse Weekly: April 30–May 6, 2026

Alex Merced — Wed, 06 May 2026 15:20:55 +0000

The release wave that defined late April carried straight into early May, with Arrow shipping two more votes in seven days, Polaris settling into post-1.4.0 stabilization mode, and the Iceberg dev list staying focused on V4 design follow-ups from the summit. The clearest story of the week is Arrow's release engineering: the arrow-rs 58.2.0 vote that opened on April 28 closed cleanly on May 2, and the Arrow .NET 23.0.0 vote opened the same day and passed by May 5. Two votes, two passing results, four days apart — a cadence that would have been unimaginable a year ago when the project was still navigating its full-stack release cycle. Iceberg's design lists stayed in absorption mode as contributors continued to translate post-summit alignments into formal specification work, and Parquet's dev list remained dense with format-level threads that have been simmering since the ALP encoding vote closed in April.

Apache Iceberg

Iceberg's dev list ran quieter this week than the Arrow and Polaris lists, but the design conversations that have anchored 2026 continued to advance in the background. The V4 metadata.json optionality direction — the proposal to treat catalog-managed metadata as a first-class supported mode while preserving static-table portability through explicit opt-in semantics — is still the project's defining specification conversation, with Anton Okolnychyi, Yufei Gu, Shawn Chang, Steven Wu, and Russell Spitzer continuing to push edge cases on portability guarantees and Spark driver behavior. The single-file commits proposal that Russell Spitzer and Amogh Jahagirdar have been advancing remains on track for a formal write-up that should land on the dev list in the coming weeks.

Péter Váry's efficient column updates proposal for wide tables continues to attract collaboration. Anurag Mantripragada and Gábor Kaszab are working alongside Péter on POC benchmarks for both the Iceberg-native and Parquet-native approaches, with the latency and metadata footprint improvements making this one of the more practically grounded V4 proposals on the list. The design — write only the columns that change on each commit, then stitch the result at read time — is squarely aimed at petabyte-scale feature stores with thousands of embedding and model-score columns, and that workload pressure is precisely what's pulling the V4 spec design forward.

The labels in LoadTableResponse proposal that Andrei Tserakhau drove through March continues to anchor the catalog-managed metadata conversation. The design lets each catalog (Polaris, Unity Catalog, Lakekeeper) surface internal metadata such as ownership, cost attribution, and semantic context through a standard optional field on table loads, without forcing requirements onto catalogs that don't track that data. The cross-implementation POCs that Andrei published — Polaris (PR #4048), Unity Catalog (PR #1417), Lakekeeper (PR #1676), and the PyIceberg client (PR #3191) — remain useful reference points as the spec change progresses through review. Iceberg Summit 2026 session recordings continued rolling out on the project's YouTube channel, and the published AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March remains the next concrete deliverable to track.

Apache Polaris

Polaris transitioned from release-week intensity into stabilization mode this week. The 1.4.0 release that Adnan Hemani announced on April 23, followed by the Python CLI 1.4.0 release on April 28, gave the project its first major release pair as a graduated top-level project. The post-launch issues that Alexandre Dutra surfaced — the Helm chart repo inconsistency, the release workflow failure in step 4, the Artifact Hub request, and the KMS-related upgrade bug — are exactly the kind of friction a project surfaces in its first independent release cycle. Yufei Gu has continued to triage most of the upgrade-path issues, and the Helm packaging questions are converging toward resolution.

Design discussions stayed active alongside the post-release stabilization. EJ Wang's DISCUSS thread on AGENTS.md for Polaris — the proposal to add agent-readable repository metadata so coding agents can pick up the project conventions consistently — continued building toward a concrete implementation proposal, which the previous newsletter flagged as the next deliverable to watch. ITing Lee's proposal to add OpenLineage to Polaris has accumulated the volume of review feedback from Adnan Hemani, Jean-Baptiste Onofré, Yufei Gu, and Michael Collado that it needs to move toward an implementation RFC. Yufei's thread on narrowing the scope of SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION drew further engagement from Dmitri Bourlatchkov and Dennis Huo, and Alexandre Dutra's URL path decoding and PolarisPrivilege grant validation threads continued to be active points of discussion.

Jean-Baptiste Onofré's confirmation that Polaris is back on a monthly release cadence means a 1.4.1 patch release or 1.5.0 planning email is the natural next step. Given the volume of upgrade-path issues that surfaced after 1.4.0, a quick 1.4.1 to address the KMS bug and Helm packaging fixes seems the more likely path before the project moves on to 1.5.0 feature scoping.

Apache Arrow

Arrow's release engine kept running. Andrew Lamb's arrow-rs 58.2.0 RC1 vote that opened on April 28 closed on May 2, with the release approved by 6 +1 votes (4 binding) and immediately published to crates.io. Bryce Mecum, Ed Seidl, Jeffrey Vo, Raúl Cumplido, and L. C. Hsieh carried the verification work, with L. C. Hsieh casting the final binding +1 from an Intel Mac on April 29. The 58.2.0 release continues the monthly arrow-rs cadence that has held since 58.1.0 shipped in March, and 59.0.0 remains scheduled as a major release that may include breaking changes.

Curt Hagenlocher opened the Arrow .NET 23.0.0 RC0 vote on May 2 — the same day arrow-rs 58.2.0 was approved — and the vote passed on May 5 with 5 binding +1s from Bryce Mecum, Adam Reeve, Raúl Cumplido, Sutou Kouhei, and Curt himself. Sutou Kouhei verified on Debian sid with .NET SDK 8.0.413, and Curt ported verify_rc.sh to Powershell as part of the validation. Curt is now working through the post-vote release tasks, including a 401 issue with the GitHub release download script that he flagged for follow-up. The .NET 23.0.0 release continues the steady cadence the .NET implementation has settled into post the 22.0.0 cycle.

Beyond releases, the design conversations stayed lively. The pyarrow-stubs donation vote that Rok Mihevc opened on April 14 continued building toward a final tally. Emil Sadek's ADBC Logo Proposal drew further engagement from Nic Crane, Julian Hyde, and Rusty Conover, and Benjamin Philip's Arrow Erlang grant documents thread continued the project's expansion into more language ecosystems. Andrew Lamb's arrow-rs security policy discussion and Mandukhai Alimaa's canonical BigDecimal extension type proposal both continued to draw input as the project tightens its production posture.

Apache Parquet

Parquet's lists stayed dense. Manu Zhang's DISCUSS thread on a new parquet-java release continued attracting input from Steve Loughran, Aaron Niskode-Dossett, Fokko Driesprong, Julien Le Dem, Gang Wu, and Rahil C, with the conversation now narrowing on a target version and ship date for what would be the next parquet-java release after 1.17.0. Ismaël Mejía's thread soliciting code reviews for Java performance optimization work continued with Steve Loughran picking up the review load.

The format-level proposals continued evolving. Will Edwards's DISCUSS thread on an alternative to the FlatBuffer footer with a lightweight byte-offset index kept drawing design feedback from Andrew Lamb, Ed Seidl, Jan Finis, Alkis Evlogimenos, Raphael Taylor-Davies, and Andrew Bell. Ed Seidl's proposal to make path_in_schema optional continued attracting commentary from Gang Wu, Steve Loughran, and Micah Kornfield. Andrew Lamb's thread on where VariantJsonParser should live — the cross-project boundary question between Parquet and Iceberg's variant tooling — kept moving with input from Steve Loughran and Gang Wu.

The Geospatial work continued threading toward closure. Milan Stefanovic's Geospatial CRS string format clarification drew further input from Dewey Dunnington and Micah Kornfield, and Jan Finis's question on RLE bitpack page-edge validity continued the kind of spec-edge clarification work that matters for cross-implementation interoperability. The Parquet sync that Julien Le Dem ran on April 22 set the agenda for the design work that's now playing out across the dev list.

Cross-Project Themes

This week's clearest pattern is the rhythm of post-graduation Polaris finding its operational footing alongside Arrow's well-established release cadence. Two Arrow votes in four days plus the Polaris 1.4.x stabilization wave plus Iceberg's quiet absorption of summit alignments plus Parquet's dense format-level work make the lakehouse stack feel less like four separate projects and more like one coordinated platform. The arrow-rs 58.2.0 release in particular landed inside a single five-day vote window — proposed April 28, approved May 2, published to crates.io the same day — which is a useful benchmark for how tight Apache release engineering can run when the verification community is engaged.

The second pattern is the continued translation of post-summit alignments into spec work. The V4 metadata.json optionality direction, the labels-in-LoadTableResponse proposal, the AGENTS.md thread for Polaris, the OpenLineage RFC, the Parquet footer redesign work, and the Geospatial spec clarifications are all converging on the same broader question: what does the lakehouse stack look like when the workloads it powers shift from analytical SQL to AI agents and ML feature engineering? Each design conversation makes more sense if you assume the next decade's workload mix looks meaningfully different from the last decade's.

The third pattern is enterprise-readiness work surfacing in real time. Polaris's KMS upgrade bug, Helm packaging issues, OAuth2 Manager v2 design, and credential-subscoping scope discussion are all the work of a project being deployed at scale rather than a project being built. The visible triage on the dev list rather than behind closed doors is a healthy signal.

Looking Ahead

Watch for a Polaris 1.4.1 patch release vote to address the KMS bug and Helm packaging issues that surfaced after 1.4.0, with 1.5.0 planning to follow. The AGENTS.md discussion should firm into a concrete implementation proposal, and the Polaris OpenLineage RFC has the volume of feedback it needs to move toward an implementation. On the Iceberg side, the formal V4 single-file commits write-up, the V4 metadata.json optionality direction, and the published AI contribution policy remain the next concrete deliverables to track. The labels-in-LoadTableResponse spec PR (apache/iceberg#15750) should converge toward merge as the cross-catalog POCs validate the design.

On the Arrow side, the pyarrow-stubs donation vote should close in the coming days, and arrow-go and arrow-cpp release planning will shape what ships in May and June. For Parquet, Manu Zhang's parquet-java release thread should converge on a target version, the path_in_schema optionality proposal looks ready for a formal vote, and the FlatBuffer-footer alternative is on track for a more formal design document. Iceberg Summit 2026 session recordings will continue rolling out on YouTube — the V4 design talks and production case studies from Apple, Bloomberg, and Pinterest are particularly worth catching as they land.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads

Apache Iceberg: The Definitive Guide — O'Reilly book, free download
Apache Polaris: The Definitive Guide — O'Reilly book, free download

Books by Alex Merced

Performance and Apache Iceberg's Metadata

Alex Merced — Wed, 06 May 2026 15:00:09 +0000

This is Part 3 of a 15-part Apache Iceberg Masterclass. Part 2 covered the metadata structures of all five table formats. This article focuses on exactly how query engines use Iceberg's metadata to avoid reading data they don't need.

The single biggest performance advantage of Iceberg over raw data lakes is not a clever algorithm or a faster codec. It is metadata-driven data skipping. By the time a query engine begins scanning actual Parquet files, Iceberg's metadata has already eliminated 90-99% of the files from consideration. Understanding this process explains why Iceberg tables with billions of rows can return query results in seconds.

The Scan Planning Pipeline

When a query engine like Dremio, Spark, or Trino receives a query against an Iceberg table, it executes a four-stage planning pipeline before reading any data:

Stage 1: Snapshot Resolution

The engine contacts the catalog to get the current metadata file location. It reads metadata.json and identifies the current snapshot. This tells the engine which manifest list represents the table's current state.

If the query includes a time travel clause (AS OF TIMESTAMP '2024-03-01'), the engine scans the snapshot list in metadata.json to find the snapshot that was current at that timestamp. This is a metadata-only operation; no data files are touched.

Stage 2: Manifest List Pruning

The manifest list contains one entry per manifest file. Each entry includes partition-level summary statistics: the minimum and maximum values of the partition columns across all data files tracked by that manifest.

The engine evaluates query predicates against these summaries. If a query filters on order_date = '2024-03-15' and a manifest's partition summary shows its date range is 2024-01 to 2024-02, that entire manifest is skipped. This single check can eliminate hundreds of manifest files and the thousands of data files they reference.

Stage 3: Manifest File Pruning (File Skipping)

For each surviving manifest, the engine reads the individual file entries. Each entry contains:

File path and size
Row count
Partition values for this specific file
Column-level min/max values for each column
Null counts and NaN counts per column

The engine evaluates query predicates against these per-file statistics. A query filtering on amount > 500 can skip every file whose amount column has a maximum value below 500. A query filtering on status = 'shipped' can skip files where the min and max of the status column are both 'pending' (alphabetically before 'shipped' in some encodings, though string pruning depends on sort order).

Stage 4: Parquet Internal Pruning

After Iceberg's metadata has identified the relevant files, the engine reads each Parquet file's footer. Parquet stores its own row-group-level min/max statistics. The engine can skip individual row groups within a file if their statistics exclude the query's filter values.

If bloom filters are configured (available in Iceberg v2+), the engine can also check probabilistic membership tests for equality filters, skipping row groups where the bloom filter says the value definitely does not exist.

A Concrete Example

Consider a table orders partitioned by month with 12 months of data, 20 files per month (240 total files):

SELECT * FROM orders
WHERE order_date = '2024-03-15'
  AND amount > 500

Manifest list pruning: The engine checks partition summaries. 11 of 12 monthly manifests have date ranges that do not include March 2024. They are skipped. Only the March manifest is read.

File pruning: The March manifest contains 20 file entries. The engine checks each file's amount column statistics. 15 files have max(amount) < 500, so they cannot contain any rows matching amount > 500. They are skipped. 5 files remain.

Result: 5 out of 240 files are scanned. The engine eliminated 98% of I/O through metadata alone.

What Makes Statistics Effective

The effectiveness of file skipping depends entirely on how tight the min/max ranges are per file. Two factors determine this:

Sort Order

If the amount column is sorted within each file (or approximately sorted through clustering), each file contains a narrow range of values. File 1 might have amount from 10 to 200, File 2 from 200 to 400, and so on. A filter on amount > 500 can skip the first several files completely.

If the column is randomly distributed, every file has a range of roughly min(amount) to max(amount) across the entire dataset. No file can be skipped because every file's range overlaps every filter. Sort order turns file skipping from theoretical to practical.

Iceberg supports declaring a sort order at the table level. When engines compact data (rewrite files), they can apply this sort order to produce files with tight column ranges. Dremio's automatic table optimization handles this without manual intervention for tables managed by Open Catalog.

File Size and Count

Smaller files mean tighter statistics per file but more manifest entries to manage. Larger files reduce metadata overhead but produce wider min/max ranges (less effective pruning). The typical recommendation is 128 MB to 512 MB per file for analytical workloads.

Too many small files (the "small file problem") bloat manifests and slow down planning. Regular compaction merges small files into optimally-sized ones while preserving or improving sort order.

Beyond Min/Max: Other Statistics

Iceberg's spec supports several statistical measures per column per file:

Statistic	Purpose	Pruning Power
Min/Max values	Range-based filtering	High (if sorted)
Null count	`IS NOT NULL` filters	High
NaN count	Float NaN filtering	Moderate
Value count	Row count estimation	Used by optimizer
Distinct count	Cardinality estimation	Used by cost-based optimizer

Engines like Dremio and Spark use the value counts and distinct counts for cost-based optimization decisions (choosing join strategies, selecting scan parallelism) even when they do not directly prune files.

Metadata Caching

Reading metadata from object storage on every query adds latency. Production engines cache metadata aggressively:

Metadata file cache: The metadata.json and manifest list are typically cached in memory. They change only when the table is updated.
Manifest cache: Manifest files are immutable (they are never modified, only replaced). Once read, they can be cached indefinitely until they are no longer referenced by any snapshot.
Parquet footer cache: Since Parquet files are immutable, their footers (which contain row-group statistics and schema) can be cached permanently.

Dremio's Columnar Cloud Cache (C3) caches both metadata and data on local NVMe drives at executor nodes, turning cloud storage latency into local-disk speed for frequently-accessed tables.

When Metadata Is Not Enough

Metadata-driven pruning has limits. If a filter column is not in the partition spec and the data is not sorted by that column, min/max ranges will overlap across all files and no pruning occurs. In these cases:

Add the column to the sort order and compact the table. This is the most effective fix.
Consider partition evolution (covered in Part 4) to add a partition transform on the column.
Enable bloom filters for high-cardinality equality filters like user IDs or transaction IDs.

The metadata is only as good as the physical organization of the data. Well-organized tables skip 95%+ of I/O. Poorly organized tables with random data distribution skip nothing, and the metadata overhead becomes pure cost with no benefit.

Books to Go Deeper

Architecting the Apache Iceberg Lakehouse by Alex Merced (Manning)
Lakehouses with Apache Iceberg: Agentic Hands-on by Alex Merced
Constructing Context: Semantics, Agents, and Embeddings by Alex Merced
Apache Iceberg & Agentic AI: Connecting Structured Data by Alex Merced
Open Source Lakehouse: Architecting Analytical Systems by Alex Merced

Free Resources

What is Dremio? The Unified Lakehouse and AI Platform

Alex Merced — Tue, 05 May 2026 18:30:15 +0000

If you manage a modern data stack, you likely spend the majority of your time and compute budget moving data around. You pull data from an operational database, stage it in object storage, transform it, load it into a data warehouse, and finally extract it into BI extracts. This DIY approach creates fragile pipelines, delayed insights, and vendor lock-in.

Dremio exists to eliminate this complexity. As a mature platform with 11 years of engineering development behind it, it is a unified analytics solution that allows you to query data where it lives, govern it securely, and interact with it using built-in Agentic AI.

To understand what Dremio does, you must view it as a three-part platform: a Federated Query Engine, an Iceberg Lakehouse Platform, and an Agentic AI Layer.

Pillar 1: The Federated Query Engine

At its core, Dremio is an execution engine built on the principle of "Query, Don't Move."

Instead of forcing you to centralize all your data into a single proprietary warehouse, Dremio acts as a logical abstraction layer. When a user or BI dashboard submits a SQL query, Dremio parses the request, identifies the underlying data sources, and generates optimized sub-queries. It pushes down filters and aggregations to the source systems, retrieves the minimal necessary data, and executes the final joins in memory.

This architecture eliminates the serialization tax and allows for Zero-Copy Data Movement. While many other platforms have historically struggled to scale query federation, Dremio is able to scale it effortlessly. This is because of Apache Arrow's high-speed in-memory columnar execution, Dremio's intelligent pushdowns, and Iceberg-based Reflections. These features give Dremio a massive performance advantage over other query federation tools that do not leverage them. You bypass complex, multi-stage ETL pipelines entirely while maintaining interactive analytics speed.

Pillar 2: The Iceberg Lakehouse Platform

While federation is a great starting place to operationalize your data analytics rapidly, you ideally want the majority of your analytics to operate directly from your data lake using Apache Iceberg tables. Shifting workloads to Iceberg provides three major advantages:

Reduction in costs: You rely on cheaper object storage (like Amazon S3, ADLS, or Google Cloud Storage) while eliminating the need for duplicative storage and expensive ETL pipelines.
Tool interoperability: Open standards ensure better collaboration between teams, allowing data engineers, analysts, and data scientists to interact with the exact same data using different compute engines.
Autonomous performance management: Dremio automatically optimizes your Iceberg tables and accelerates their performance with background Reflections. This makes a lakehouse feel as fast and easy to use as a traditional warehouse, but without the premium costs.

By natively supporting Apache Parquet and Apache Iceberg, Dremio brings relational database capabilities (like ACID transactions, schema evolution, and time travel) directly to your object storage.

To manage this open ecosystem securely, Dremio integrates tightly with Apache Polaris. Polaris serves as a neutral, open catalog that provides centralized governance, role-based access control (RBAC), and credential vending. It ensures that whether you query data using Dremio, Apache Spark, or Apache Flink, every engine respects the same security policies.

However, querying raw files on object storage can occasionally bottleneck at large scales. Dremio solves this with Autonomous Reflections. Instead of relying on data engineers to manually build and maintain materialized views or OLAP cubes, Dremio monitors query patterns and automatically materializes optimized data structures in the background. When a user runs a query, the engine transparently routes it to the Reflection, delivering sub-second BI performance directly on the lakehouse.

Pillar 3: The Agentic AI Layer

A fast query engine is useless if users cannot find or understand the data. Dremio bridges this gap by integrating artificial intelligence deeply into the platform.

The foundation of this layer is the AI-powered semantic layer. It maps raw tables and columns into clean, business-friendly concepts through SQL Views, tags, wikis, lineage and a knowledge graph with built-in semantic search capabilities to leverage it. This governed semantic layer ensures that both human analysts and AI agents interpret the data identically.

For human users, Dremio includes a built-in AI Agent. Users simply type a natural language request, such as "Show top customers by revenue," and the agent instantly translates it into a highly optimized SQL query based on the context embedded in the semantic layer. But it goes beyond just translation (the agent immediately executes the query and can automatically generates interactive data visualizations or insightsbased on the results).

For system automation, Dremio provides a Model Context Protocol (MCP) Server. The Dremio MCP Server allows external AI assistants and local IDEs to securely connect to the lakehouse with already built in ability to leverage Dremio's semantic layer. The server registers tools for semantic discovery and query execution, enabling AI agents to autonomously research and analyze data on your behalf.

Finally, Dremio brings Generative AI directly into your data pipelines through Native AI SQL Functions. Functions like AI_COMPLETE, AI_GENERATE, and AI_CLASSIFY allow you to process unstructured data directly within a SELECT statement. You can extract structured fields from raw PDF blobs or classify customer sentiment without ever moving the data to an external machine learning service.

Conclusion

Dremio is not a traditional data warehouse. It is a unified platform that eliminates data silos through a federated query engine, secures your object storage with an Iceberg-based lakehouse, and accelerates insights with an Agentic AI layer.

By building on open standards like Apache Iceberg, Apache Parquet, Apache Arrow, and Apache Polaris, you maintain full control of your data. You achieve interactive BI performance without vendor lock-in.

Ready to build your open data architecture? Take the next step:

Try the free trial
Learn more about Dremio at a workshop or webinar (Events and Workshops)
Download free books:

Semantic Layer: The Definitive Guide

Alex Merced — Fri, 01 May 2026 13:13:54 +0000

The term "semantic layer" has been part of the data industry's vocabulary for over 35 years. It first appeared in a 1991 patent filing by Business Objects, and it has since been reinvented, abandoned, and reinvented again across three distinct eras of data architecture. Today, it sits at the center of one of the most consequential design debates in the industry: should the semantic layer be a standalone product you bolt onto your stack, or a native capability of the platform that already manages your data?

This guide covers the full arc: what a semantic layer is, where it came from, how it split into two competing architectural approaches, and why the choice between them determines whether your AI agents produce accurate answers or hallucinated nonsense.

What a Semantic Layer Actually Is

A semantic layer is an abstraction that maps the physical structure of your data (table names, column names, join logic, filter conditions) to the business terms that people actually use (revenue, churn rate, active customer, cost per acquisition). It sits between the raw data and every consumer of that data: BI dashboards, AI agents, Python notebooks, Excel spreadsheets, and custom applications.

The semantic layer has three core responsibilities:

Metric consistency. When the finance team says "revenue," they mean recognized revenue net of refunds. When the sales team says "revenue," they mean bookings including pending deals. Without a semantic layer, both teams write their own SQL, get different numbers, and spend the next two weeks arguing about which dashboard is right. A semantic layer defines "revenue" once, in one place, and every downstream consumer uses that definition.

Access governance. The semantic layer controls who sees what. A marketing analyst querying customer data should not see Social Security numbers. A regional manager should only see data for their region. These rules (row-level security, column masking, role-based access) are defined at the semantic layer and enforced consistently regardless of which tool is doing the querying.

Query abstraction. Business users and AI agents should not need to know that "customer churn rate" requires joining three tables, filtering out test accounts, calculating a 90-day rolling window, and dividing by the active customer count from the prior period. The semantic layer encapsulates that logic in a reusable definition. Consumers ask for "churn rate" and get the right answer.

The Origin Story: Business Objects, 1991

The semantic layer was invented to solve a simple problem: business users could not write SQL.

In 1991, Business Objects filed a patent for a "relational database access system using semantically dynamic objects." The product feature was called "Universes." It worked like this: a data architect would build a metadata model that mapped physical database tables and join paths into business-friendly objects ("Customer," "Product," "Sales Amount"). Report builders could then drag and drop these objects to create queries without touching SQL.

This was a significant advance. Before Universes, generating a report from a relational database required either a developer who understood the schema or a business user willing to learn SQL. Business Objects eliminated that requirement entirely.

IBM's Cognos followed with "Framework Manager," which served the same purpose: map the physical database into a logical, business-friendly model. SAP built InfoProviders and BEx queries on top of SAP BW. Microsoft introduced SQL Server Analysis Services.

Every major enterprise BI vendor in the 1990s built some version of a semantic layer. But they all shared the same fundamental limitation: the semantic layer was proprietary and locked to a single vendor's BI tool. If you built your metrics in Business Objects Universes, those definitions did not carry over to Cognos. If you modeled your data in SSAS, Tableau could not read it. The semantic layer existed, but it was a walled garden.

OLAP Cubes: The Implicit Semantic Layer

Running parallel to the relational semantic layer was the OLAP (Online Analytical Processing) cube. Products like SQL Server Analysis Services, Cognos TM1, and Oracle Essbase pre-computed data into multidimensional structures: dimensions (Customer, Product, Time), measures (Revenue, Quantity, Profit), and hierarchies (Year > Quarter > Month > Day).

The cube itself functioned as a semantic layer. Business users did not query tables; they navigated dimensions. They did not write SQL; they used MDX (Multidimensional Expressions) or simply clicked through pivot-table interfaces. The business logic was baked into the cube's structure.

OLAP cubes worked well for their era. Pre-computing aggregations meant that analytical queries returned in seconds, even on the hardware of the early 2000s. But they had three fatal weaknesses:

Rigidity. Adding a new dimension or changing a hierarchy required rebuilding the cube, which could take hours for large datasets. Business requirements change faster than cubes can be rebuilt.
Cost. Cubes stored pre-aggregated copies of data. For large organizations, this meant maintaining terabytes of redundant, pre-computed data on expensive storage.
Specialization. Operating an OLAP cube required specialized skills (MDX, cube design, aggregation strategies) that most data teams did not have.

As cloud data warehouses like Snowflake, BigQuery, and Redshift made raw compute cheap and fast, the need for pre-aggregation declined. You could run the analytical query directly against the detail data and get results in seconds. The cube's primary value proposition, speed through pre-computation, was no longer unique.

The Self-Service Era and the Loss of the Semantic Layer

The 2010s brought a dramatic shift. Self-service BI tools like Tableau and Power BI connected directly to databases, bypassing the semantic layer entirely. This was marketed as democratization: give every analyst direct access to the data, and they will find their own insights.

For small teams, this worked. For organizations with more than a handful of analysts, it created a problem that the industry calls "metric drift." Without a centralized semantic layer, each analyst wrote their own SQL. Each SQL query embedded its own business logic. Revenue was calculated five different ways by five different people, and no one could agree on which number was correct.

The first response to metric drift came from Looker, founded in 2012, which introduced LookML as a code-based semantic layer. You defined your metrics, dimensions, and relationships in version-controlled modeling files. This was a meaningful evolution because it separated the semantic logic from the BI tool's proprietary report format. Google acquired Looker for $2.6 billion in 2019, validating that the semantic layer was worth owning. But LookML was still tied to Looker's ecosystem. If you used Tableau or Power BI as your primary BI tool, LookML did not help.

The broader industry realization was clear: skipping the semantic layer does not eliminate the need for one. It just distributes the problem across every team and every dashboard, where it becomes harder to find and harder to fix.

Dremio: The Semantic Layer Built Into the Query Engine From Day One

While Looker was coupling the semantic layer to a BI tool, a different approach was emerging. Dremio was founded in 2015 by Tomer Shiran and Jacques Nadeau, creators and contributors to the Apache Drill project. When Dremio publicly launched in July 2017, it introduced what it called a "governed, self-service semantic layer" as a core architectural component, not an add-on.

The key difference: Dremio's semantic layer was integrated directly into a high-performance query engine. From its first release, the platform shipped with:

Virtual Datasets (Views). SQL-defined business logic that users could create, share, and layer on top of any connected data source. No data movement required.
Reflections. Patented, transparent materialized views that the query optimizer substitutes automatically. Users query the governed view; Dremio serves the fastest available Reflection behind the scenes.
Federated access. The semantic layer worked across data sources (S3, HDFS, relational databases) from the start, not just against a single warehouse.

Dremio added Wikis and Labels (Tags) in subsequent releases, providing Markdown-formatted documentation and classification metadata directly attached to datasets in the catalog. This meant the semantic layer was not just a set of views; it included the context that made those views discoverable and understandable.

This was architecturally distinct from every other semantic layer on the market. AtScale (founded 2013) and Cube (open-sourced 2019) built the semantic layer as a separate product. Dremio built it into the same platform that executed the queries and managed the catalog. That design decision would become increasingly important as AI agents entered the picture.

The Modern Resurgence: Two Divergent Paths

By the early 2020s, the semantic layer was firmly back. dbt Labs acquired Transform (the creators of MetricFlow) in February 2023 to build a code-based metrics layer. Cube had open-sourced its API-first semantic layer in 2019 and launched Cube Cloud commercially in 2021. AtScale had been building its enterprise virtualization layer since 2013.

The market had split into two fundamentally different architectural forms, and the choice between them has significant consequences for how your data platform operates.

Path 1: The semantic layer as a standalone product. Companies like AtScale (2013), Cube (2019), and dbt (MetricFlow, 2023) built the semantic layer as a separate service that sits between your data warehouse and your BI tools. You deploy it as its own infrastructure, manage it as its own system, and integrate it with your existing stack.

Path 2: The semantic layer as a platform feature. Dremio (2017) integrated the semantic layer directly into its query engine and data catalog from the start. There is no separate service to deploy. The semantic layer is a native capability of the same platform that stores, governs, and queries your data.

Both approaches solve the metric consistency problem. They differ in how they solve it, what they require operationally, and how well they extend to AI-driven analytics.

Path 1: The Semantic Layer as a Standalone Product

Three standalone semantic layer products dominate the current market. Each targets a different architecture and team profile.

AtScale (Founded 2013)

AtScale, founded by veterans of the Yahoo data team, positions itself as a "universal semantic layer" for large enterprises. It creates a virtualization layer across multiple data warehouses (Snowflake, BigQuery, Databricks), presenting a unified semantic model to BI tools. Its strongest feature is native MDX and DAX compatibility, which makes it the only option for organizations with heavy Excel and SSAS dependencies.

AtScale excels when you have data spread across multiple warehouses and need a single semantic model that works across all of them. The tradeoff is infrastructure complexity and licensing cost. AtScale requires dedicated infrastructure, and its enterprise pricing model reflects its positioning.

Cube (Open-Sourced 2019)

Cube started as Statsbot in 2016 before pivoting to become an open-source, API-first semantic layer in 2019. It provides REST, GraphQL, SQL, MDX, and DAX APIs, making it the most flexible option for embedded analytics and customer-facing dashboards. Cube Cloud launched commercially in 2021. Cube's pre-aggregation engine can deliver sub-second responses for complex queries by pre-computing results and caching them.

Cube excels when your primary consumer is a custom application, not a BI tool. The tradeoff is operational overhead: Cube runs as a separate server, requires its own infrastructure, and demands expertise in designing pre-aggregation strategies to achieve optimal performance.

dbt Semantic Layer (MetricFlow, 2023)

The dbt Semantic Layer is powered by MetricFlow, the technology dbt Labs acquired when it purchased Transform in February 2023. It lets teams define metrics as code in YAML files within their existing dbt projects. Metrics are version-controlled, reviewed via pull requests, and deployed alongside your dbt transformations. In late 2025, dbt Labs moved MetricFlow to an Apache 2.0 license, signaling a commitment to open, portable metric definitions.

The dbt Semantic Layer excels when your team is already a dbt shop and wants metrics managed in the same Git-based workflow as transformations. The tradeoff is that it requires dbt Cloud for the serving layer, lacks native caching (relying on the underlying warehouse for query execution), and is less suited for high-concurrency embedded applications.

The Structural Tradeoff of Standalone Products

All three standalone products share the same architectural limitation: they exist as a separate layer of infrastructure that must integrate with your data catalog, your governance system, and your query engine.

This means:

Another system to operate. You deploy it, monitor it, upgrade it, and debug it.
Governance is a separate concern. Access policies defined in your catalog or warehouse must be replicated or synced with the semantic layer. Any gap is a security risk.
No native execution. Standalone semantic layers define metrics but do not execute queries. They translate user requests into SQL and send that SQL to an external engine. If the engine and the semantic layer disagree on the data model, you get wrong results.
Sync lag. When you change a table schema, add a column, or update governance rules, the semantic layer must be updated separately. Until it is, your definitions are stale.

For teams with a single data warehouse, a strong DevOps practice, and a primary use case that matches one of these products, standalone semantic layers work well. For teams managing federated data across multiple sources, or teams building AI-driven analytics, the gap between "definition" and "execution" creates friction that compounds over time.

Path 2: The Semantic Layer as a Platform Feature

The alternative is to build the semantic layer into the same platform that manages your data catalog, governs access, and executes queries. This is the approach Dremio takes.

In Dremio, the semantic layer is not a separate product you bolt on. It is a native set of capabilities (views, wikis, labels, lineage, knowledge graph) that are integrated with the Open Catalog (built on Apache Polaris), the MPP query engine (built on Apache Arrow), and the governance system (Fine-Grained Access Control, row-level security, column masking).

This matters because the three activities that define a semantic layer, defining metrics, governing access, and executing queries, all happen in the same system. There is no handoff, no sync, no governance gap.

How Dremio's Semantic Layer Works

Dremio's AI Semantic Layer is built from five components that work together: views, wikis, labels, lineage, and the knowledge graph.

Views (Virtual Datasets)

Views are the foundation. A view is a SQL-defined virtual dataset that encapsulates business logic: joins, filters, calculations, and transformations. You write the SQL once, and every consumer (BI tool, AI agent, Python notebook) queries the view instead of the raw tables.

Dremio recommends a three-layer architecture for views:

Preparation Layer. One view per source table. Handles type casting, column renaming, null handling. A direct 1:1 mapping of raw data into clean, standardized form.
Business Layer. Shared business logic. This is where you define "active customer" (customers with at least one order in the last 90 days, excluding test accounts), "revenue" (order_amount minus refunds, in USD), and every other metric that needs a single definition.
Application Layer. Tailored datasets for specific consumers. A marketing dashboard view joins customer demographics with campaign performance. An AI agent view exposes the most commonly asked metrics with rich column-level documentation.

Because views are virtual, they do not copy or move data. They execute against the underlying data at query time, using Dremio's federated query engine to pull from S3, PostgreSQL, Snowflake, MongoDB, or any connected source. Change the underlying data, and the view reflects it immediately.

Wikis

Wikis are Markdown-formatted documentation attached directly to spaces, sources, folders, tables, views, and columns. They serve two audiences: human analysts browsing the catalog, and AI agents generating SQL.

A wiki for a view called analytics.customer_health might contain:

## Customer Health Score

Composite metric combining purchase frequency, support ticket volume,
and NPS survey responses over the trailing 90 days.

**Owner:** Customer Success team
**Refresh:** Updated daily by the ETL pipeline
**Filters:** Excludes test accounts (account_type != 'test')
**Churn definition:** Score below 30 for two consecutive months

Dremio can also auto-generate wiki content. The platform samples table data, analyzes column distributions, and produces context-rich descriptions using generative AI. This is particularly valuable for large data estates where manually documenting hundreds of tables is impractical.

Labels

Labels classify and organize data assets. You tag a table as PII, Finance, Marketing, Approved, or Draft. Labels serve two purposes: they improve discoverability (semantic search returns results filtered by label), and they integrate with governance rules (all PII-labeled columns automatically apply masking policies).

Like wikis, labels can be AI-suggested. Dremio analyzes column names, data patterns, and content to recommend labels like contains-email or likely-PII.

Lineage

Dremio automatically tracks the flow of data from source to view to consumer. You can see which raw tables feed into which business views, and which dashboards or AI queries consume those views.

Lineage is critical for impact analysis. Before changing the schema of a source table, you can trace all downstream dependencies and understand exactly what will break. Without automated lineage, this analysis requires manually reading SQL definitions and hoping you did not miss one.

Knowledge Graph

The knowledge graph is the newest addition to Dremio's semantic layer. It operates at a higher level than individual wikis and labels, building a connected graph of entity relationships, metric definitions, and usage patterns.

The knowledge graph works in three ways:

Pattern detection. It analyzes query patterns across your organization to detect implicit definitions. If 80% of queries that reference "active customers" use the same WHERE clause (last_order_date > CURRENT_DATE - INTERVAL '90' DAY AND account_type != 'test'), the knowledge graph surfaces that pattern as a candidate definition.
User-defined context. You can provide business context as structured markdown files. These files define entities, relationships, and business rules that the knowledge graph ingests and makes available to AI agents.
Relationship mapping. The knowledge graph connects related entities (customers are related to orders, orders contain products, products belong to categories) and exposes those relationships to AI agents, enabling more accurate multi-table SQL generation.

Semantic Search

Semantic search lets users and AI agents discover data assets using natural language. Instead of browsing a schema tree looking for a table called dwh_fact_cust_ord_line_item, you search for "customer orders by product category" and find the relevant view, complete with its wiki documentation and labels.

Semantic search indexes wikis, labels, column names, table descriptions, and view definitions. It is the entry point for both human exploration and AI agent data discovery.

Why the Integrated Approach Changes Everything for AI

The reason the platform-versus-product distinction matters more now than it did five years ago is AI. Specifically, AI agents that generate SQL from natural language questions.

An AI agent that receives the question "What was our customer churn rate by region last quarter?" needs three things to produce an accurate answer:

Context. What does "churn rate" mean in this organization? What table contains the data? Which columns are relevant? What filters should be applied? The semantic layer's wikis, labels, views, and knowledge graph provide this context.
Access. Can this user see the churn data? Are there row-level filters based on their role? Are any columns masked? The governance system enforces these rules.
Execution speed. The user expects an answer in seconds, not minutes. The query engine needs to be fast enough for interactive use.

In a standalone semantic layer architecture, these three capabilities live in three different systems: the semantic layer product provides context, the data catalog or warehouse provides governance, and a separate query engine provides execution. The AI agent must coordinate across all three, and any mismatch between them produces wrong answers, security violations, or slow responses.

In Dremio's architecture, all three are co-located. The AI Agent reads the wikis, labels, and view definitions from the semantic layer, generates SQL that respects governance rules, and executes the query on the built-in MPP engine. The entire loop happens within a single governed platform.

Dremio's MCP Server extends this to external AI tools. ChatGPT, Claude Desktop, or any custom agent that supports the Model Context Protocol can connect to Dremio and query through the same governed semantic layer. The external AI agent receives the same business context, respects the same governance rules, and gets the same fast query execution as the built-in AI Agent.

The semantic layer teaches the AI your business language so it generates the right SQL, not generic SQL.

Platform vs Product: A Side-by-Side Comparison

Dimension	Standalone Products (AtScale, Cube, dbt)	Platform-Integrated (Dremio)
Infrastructure	Separate server or service to deploy	Built into the platform
Governance	Must integrate with external catalog and warehouse	Native FGAC, row-level security, column masking
Query execution	Depends on external warehouse or engine	Built-in MPP engine (Apache Arrow)
Metric definitions	YAML files, code, or GUI-based models	SQL views in the catalog
AI readiness	Requires separate MCP adapter or API integration	Native AI Agent + MCP Server
Data access	Single warehouse or requires federation setup	Federated queries across 30+ sources
Performance optimization	Pre-aggregation (Cube) or warehouse-dependent (dbt)	Reflections (autonomous, transparent acceleration)
Sync lag	Possible lag between definition changes and enforcement	Real-time; definitions and execution are the same system
Best for	Teams with a single warehouse and specific tooling needs	Teams with diverse data sources, AI-driven analytics, or multi-engine requirements

When a Standalone Product Fits

You use a single data warehouse (Snowflake, BigQuery) and your semantic layer needs are limited to consistent BI metrics
Your team is already deeply invested in dbt and wants metrics alongside transformations
You are building customer-facing embedded analytics and need Cube's pre-aggregation performance
You have heavy Excel/MDX requirements that only AtScale supports

When the Platform Approach Fits

Your data lives across multiple sources (S3, PostgreSQL, Snowflake, MongoDB) and you need federated access
You want governance rules defined once and enforced everywhere, including for AI agents
You are building or planning AI-driven analytics (AI Agent, MCP, natural language querying)
You want to eliminate the operational overhead of managing a separate semantic layer product
You need the semantic layer, the catalog, and the query engine to operate as a single governed system

Building Your Semantic Layer: A Practical Framework

If you are starting from scratch or migrating from an ad-hoc metric landscape, here is a practical sequence:

Step 1: Identify your top 10 metrics. Not all metrics need to be in the semantic layer on day one. Start with the metrics that cause the most confusion: revenue, churn, active users, cost per acquisition, NPS. These are the metrics where two teams have two different SQL queries and two different numbers.

Step 2: Build the layered view architecture. For each metric, create the three-layer view stack in Dremio. Preparation views clean the source data. Business views encode the agreed-upon logic. Application views tailor the output for specific consumers.

Step 3: Add wikis and labels. Document each view and its columns. Define what the metric means, who owns it, how it is calculated, and what filters are applied. Tag columns with labels like PII, Finance, or Approved.

Step 4: Configure governance. Apply Fine-Grained Access Control: row-level security for multi-tenant data, column masking for sensitive fields, role-based access for views. These rules are enforced at query time for every consumer, including AI agents.

Step 5: Connect AI interfaces. Enable the Dremio AI Agent for your team. Set up the MCP Server for external AI tools. The wikis and labels you added in Step 3 become the context that makes AI-generated SQL accurate.

Step 6: Expand. Add the next 10 metrics. Build knowledge graph definitions for complex entity relationships. Let autonomous Reflections learn from query patterns and accelerate the most common queries automatically.

The semantic layer is not a one-time project. It is a living system that grows with your organization's data needs. Start small, prove value on the metrics that matter most, and expand from there.

Try Dremio Cloud free for 30 days to build your semantic layer on top of your existing data sources with zero data movement and native AI agent support.

Free Resources to Go Deeper

The Metadata Structure of Modern Table Formats

Alex Merced — Thu, 30 Apr 2026 15:46:39 +0000

This is Part 2 of a 15-part Apache Iceberg Masterclass. Part 1 covered why table formats exist. This article breaks down exactly how each format organizes its metadata.

The metadata structure of a table format determines everything: how fast queries start planning, how efficiently concurrent writes are handled, how schema changes propagate, and how much overhead accumulates over time. Two formats can both claim "ACID support" and "time travel" while having fundamentally different mechanisms under the hood.

Apache Iceberg: The Metadata Tree

Iceberg organizes metadata into a tree with four levels. Each level adds specificity and enables pruning at query planning time.

Level 1: Catalog pointer. The catalog (a REST catalog, Dremio Open Catalog, AWS Glue, or Hive Metastore) stores a pointer to the current metadata.json file. This pointer is the single source of truth for the table's current state.

Level 2: Metadata file (metadata.json). A JSON file containing the table's schema (with column IDs), partition spec, sort order, table properties, and a list of snapshots. Each snapshot represents a complete, immutable version of the table. When the table is updated, a new metadata.json is created with the new snapshot appended to the list.

Level 3: Manifest list (Avro). Each snapshot points to exactly one manifest list. The manifest list is a table of contents: it lists all the manifest files that make up this snapshot and includes partition-level summary statistics for each manifest. These summaries let the query planner skip entire manifests that cannot contain data matching the query filter.

Level 4: Manifest files (Avro). Each manifest file tracks a set of data files and delete files. For each file, the manifest stores the file path, file size, row count, partition values, and column-level statistics (min value, max value, null count, NaN count, distinct count). These per-file statistics enable file-level pruning during query planning.

The key insight is that each level progressively narrows the search space. A query engine using Dremio or Spark reads the catalog pointer (1 request), loads the metadata file (1 read), checks the manifest list to skip irrelevant manifests (1 read, many skips), then reads only the relevant manifests to find the actual data files to scan. For a petabyte table, this can reduce planning from minutes of directory listing to milliseconds of metadata traversal.

Delta Lake: The Sequential Transaction Log

Delta Lake uses a simpler, linear structure. All metadata lives in the _delta_log/ directory alongside the data.

JSON commit files (000001.json, 000002.json, ...) record each transaction as a set of actions: add (a new data file), remove (a file marked for deletion), metaData (schema or property change), and protocol (version requirements). Each commit file is sequentially numbered.

Parquet checkpoint files are created every 10 commits (by default). A checkpoint is a Parquet file that summarizes the cumulative state of the table at that version, essentially a snapshot of all currently-active add actions. This prevents readers from having to replay hundreds of small JSON files.

_last_checkpoint is a small file pointing to the most recent checkpoint. The read process is: find the latest checkpoint, load it, then replay any JSON commits after it.

The tradeoff: Delta's log is simple and easy to reason about, but it does not have the multi-level pruning that Iceberg's manifest tree provides. File-level statistics exist in the add actions but are not organized hierarchically. For very large tables (millions of files), the planning phase can be slower because there is no intermediate pruning layer equivalent to Iceberg's manifest list.

Apache Hudi: The Timeline

Hudi stores metadata in the .hoodie/ directory as a sequence of "instants" on a timeline. Each instant represents an operation (commit, compaction, rollback, clean) and transitions through three states: REQUESTED, INFLIGHT, and COMPLETED.

The timeline is split into two parts:

Active timeline contains recent instants that are needed for current read and write operations. The file naming pattern is <timestamp>.<action_type>.<state>. For example, 20250429010500.commit.completed indicates a completed write operation.

Archived timeline contains older instants that have been moved to .hoodie/archived/ to keep the active timeline lean. Hudi 1.0 introduced an LSM-based timeline that compacts archived instants into Parquet files for efficient long-term storage.

Hudi's timeline tracks more granular operation types than other formats: commit (COW write), delta_commit (MOR write), compaction, clean (garbage collection), rollback, savepoint, and replace (clustering). This granularity reflects Hudi's focus on complex write patterns like CDC pipelines.

Apache Paimon: Snapshots and LSM Trees

Paimon's metadata is organized around snapshots and buckets. Each partition is divided into a fixed number of buckets, and each bucket contains an independent LSM (Log-Structured Merge) tree.

The snapshot metadata tracks which data files and changelog files belong to each bucket at each point in time. Inside each bucket, the LSM tree structure contains multiple "sorted runs" (levels) of Parquet files. When data is written, it lands in level 0 as a small sorted file. Background compaction merges small files into larger ones at higher levels.

This is fundamentally different from the other formats because Paimon's metadata structure is designed for continuous streaming writes rather than batch commits. The LSM tree handles high-frequency inserts and updates efficiently by buffering writes in memory and flushing them as sorted runs.

DuckLake: SQL Database as Metadata

DuckLake takes the most radical departure. Instead of storing metadata as files in object storage, all metadata lives in a traditional SQL database (PostgreSQL, MySQL, SQLite, or DuckDB itself).

The metadata database contains tables for: schemas, snapshots, data files, column statistics, and table properties. When a query engine needs to plan a query, it issues a single SQL query against the metadata database instead of reading multiple metadata files from object storage.

The tradeoff is a dependency on a running database process for metadata management. The benefit is dramatically simpler metadata access patterns and the ability to use SQL for metadata operations like listing snapshots, finding files, and checking statistics.

Side-by-Side Comparison

Dimension	Iceberg	Delta Lake	Hudi	Paimon	DuckLake
Metadata format	JSON + Avro files	JSON + Parquet files	Avro instant files	Snapshot + LSM files	SQL database tables
Metadata location	Object storage	`_delta_log/` directory	`.hoodie/` directory	Table directory	External database
Multi-level pruning	Yes (manifest list + manifests)	No (flat file list)	Partial (index-based)	No (bucket-level)	Via SQL queries
Planning overhead	Low (tree traversal)	Moderate (checkpoint + replay)	Moderate (timeline scan)	Low (snapshot lookup)	Lowest (single SQL query)
Metadata growth	Controlled (manifest reuse)	Requires checkpointing	Requires archiving	Requires compaction	Database manages it
Engine independence	High (spec-defined)	Moderate (Spark-oriented)	Moderate	Low (Flink-oriented)	Low (DuckDB-oriented)

For teams building on multiple engines, Iceberg's metadata structure provides the best combination of planning efficiency and engine independence. Dremio uses Iceberg's metadata tree to achieve fast query planning even on tables with millions of files, and its Columnar Cloud Cache caches frequently-accessed metadata locally to further reduce planning latency.

Part 3 covers how query engines use Iceberg's metadata specifically for performance optimization.

Books to Go Deeper

Architecting the Apache Iceberg Lakehouse by Alex Merced (Manning)
Lakehouses with Apache Iceberg: Agentic Hands-on by Alex Merced
Constructing Context: Semantics, Agents, and Embeddings by Alex Merced
Apache Iceberg & Agentic AI: Connecting Structured Data by Alex Merced
Open Source Lakehouse: Architecting Analytical Systems by Alex Merced

Free Resources

What Are Table Formats and Why Were They Needed?

Alex Merced — Thu, 30 Apr 2026 15:17:52 +0000

This is Part 1 of a 15-part Apache Iceberg Masterclass. This article covers the fundamental question: what problem do table formats solve, and why does the choice between them matter?

A data lake without a table format is a collection of files. It has no concept of a transaction, no mechanism to prevent two writers from producing corrupted state, and no efficient way to determine which files belong to the current version of a table. Table formats exist because the gap between "a pile of Parquet files" and "a reliable analytical table" is enormous, and bridging it requires a formal metadata specification.

The World Before Table Formats

Before table formats, data lakes relied on a simple convention: data was organized into directories in object storage (S3, ADLS, GCS), and the Hive Metastore tracked which directories corresponded to which partitions.

This approach had five critical problems:

No atomic commits. If a Spark job wrote 500 new Parquet files and failed after writing 300, readers could see the 300 partial files. There was no mechanism to make all 500 files visible at once or none of them. Cleanup required manual intervention or custom garbage collection scripts.

Expensive query planning. To determine which files to scan, the engine issued LIST requests against object storage. S3 returns up to 5,000 objects per request. A table with 100,000 files required 20+ sequential HTTP calls before query execution could even start. At Netflix, query planning for large tables could take minutes just from directory listing.

Schema changes required rewrites. Adding a column to a Hive table meant either rewriting every file (expensive) or accepting that old files had a different schema than new files (confusing). Renaming a column was not supported without a full table rewrite because Hive mapped columns by position, not by identity.

No time travel. Once data was overwritten, the previous version was gone. There was no snapshot history, no ability to roll back a bad write, and no way to reproduce a query result from last Tuesday.

Exposed partitioning. Users had to know the physical partition layout. A table partitioned by year and month required queries to explicitly filter on those columns using the exact partition column names (WHERE year = 2024 AND month = 3). If partitioning changed, every downstream query broke.

What a Table Format Actually Is

A table format is a specification that defines how to organize metadata about data files so that query engines can treat them as reliable, transactional tables. It sits between the query engine and the physical files.

The core responsibilities of every table format:

File tracking: Maintain an explicit list of which data files belong to the current version of the table, eliminating directory listing
Atomic commits: Make all changes to a table visible to readers at once through a single metadata pointer swap
Schema management: Track the table schema and its evolution history, allowing safe column adds, drops, renames, and reorders
Partition management: Define how data is partitioned and enable query pruning without exposing the physical layout to users
Snapshot history: Maintain a history of table states for time travel, rollback, and auditing
Statistics: Store column-level min/max values and other statistics to enable file skipping during query planning

The data files themselves are still standard Parquet or ORC. The table format adds a metadata layer on top that gives those files the properties of a database table.

The Five Table Formats

Five table formats exist today, each born from a different problem and optimized for a different workload.

Apache Iceberg

Iceberg started at Netflix in 2017, created by Ryan Blue to solve Netflix's petabyte-scale query planning problems. It uses a three-layer metadata tree: a metadata.json file points to a manifest list, which points to manifest files, which track individual data files with column-level statistics.

Iceberg's defining feature is its formal specification. Any engine that follows the spec can read and write Iceberg tables correctly. This makes Iceberg the most engine-neutral format. Spark, Trino, Flink, Dremio, Snowflake, BigQuery, Athena, StarRocks, and DuckDB all support it.

Iceberg also introduced hidden partitioning and partition evolution, which are covered in depth in Parts 4 and 5 of this series.

Delta Lake

Delta Lake was created at Databricks in 2019. It stores metadata as a sequential transaction log (_delta_log/) of JSON and Parquet checkpoint files. Each commit appends a new log entry describing which files were added or removed.

Delta Lake's design prioritizes simplicity within the Spark ecosystem. Its strongest features are Liquid Clustering (adaptive data organization that replaces static partitioning) and UniForm (automatic generation of Iceberg-compatible metadata so other engines can read Delta tables as Iceberg).

Apache Hudi

Hudi originated at Uber in 2016, designed specifically for Change Data Capture (CDC) pipelines that needed to upsert millions of records per hour. Hudi uses a timeline-based metadata architecture where each commit, compaction, and rollback is an "action instant."

Hudi offers both Copy-on-Write (rewrite entire files on update) and Merge-on-Read (write deltas and merge at read time) table types, plus record-level indexing for fast point lookups. It excels when your primary workload involves frequent row-level updates and deletes.

Apache Paimon

Paimon evolved from Flink Table Store at Alibaba and entered Apache incubation in 2023. It uses LSM-tree based storage internally, making it the most streaming-native table format.

Tables in Paimon are divided into partitions and then further into buckets, each containing an independent LSM tree. This structure enables high-throughput streaming writes with millisecond-level latency. Paimon supports multiple merge engines (deduplication, partial update, aggregation) that determine how records with the same primary key are resolved.

DuckLake

DuckLake is the newest entry, released by DuckDB Labs and MotherDuck in 2025. It takes a fundamentally different approach: instead of storing metadata as files in object storage, DuckLake stores all metadata in a standard SQL database (PostgreSQL, MySQL, SQLite, or DuckDB itself).

This means a single SQL query resolves all metadata (schema, file list, statistics) instead of the multiple HTTP requests required by file-based metadata formats. The tradeoff is a dependency on a running database for the metadata layer and currently limited engine support (primarily DuckDB).

Where Each Format Excels

Dimension	Iceberg	Delta Lake	Hudi	Paimon	DuckLake
Metadata	File-based tree	File-based log	File-based timeline	File-based LSM	SQL database
Engine support	Broadest	Good (via UniForm)	Moderate	Growing	DuckDB
Schema evolution	By column ID	By name	By version	By version	SQL ALTER
Partition evolution	Yes (unique)	Liquid Clustering	Limited	Bucket evolution	SQL-managed
Streaming writes	Good	Good	Excellent	Excellent	Limited
Best for	Multi-engine analytics	Spark/Databricks	CDC/upserts	Flink streaming	Local SQL analytics

The key insight: each format reflects the priorities of the team that built it. Netflix needed multi-engine reads at petabyte scale (Iceberg). Uber needed high-frequency upserts (Hudi). Alibaba needed real-time streaming from Flink (Paimon). Databricks needed Spark-optimized simplicity (Delta). DuckDB Labs wanted SQL-native metadata management (DuckLake).

Why Iceberg Has Become the Default

Iceberg has achieved the broadest adoption for three reasons:

Specification-first design. Iceberg's spec is independent of any engine or vendor. Any team can build a conforming implementation. This created a network effect: more engine support attracted more users, which attracted more engine support.
No engine dependency. Unlike Delta Lake's historical Spark dependency or Paimon's Flink focus, Iceberg was designed from day one to work across engines. A table written by Spark can be read by Dremio, Trino, Flink, or Snowflake without conversion.
Industry convergence. Snowflake, AWS (Athena, EMR), Google (BigQuery), and Databricks (via UniForm) have all adopted Iceberg as an interoperability standard. When the major cloud vendors align on a format, it becomes the safe choice for long-term investments.

That said, Iceberg is not universally superior. Hudi's record-level indexing makes it faster for point lookups on upsert-heavy tables. Paimon's LSM-tree architecture handles continuous streaming ingestion with lower latency than Iceberg's batch-oriented commit model. DuckLake's SQL-based metadata is simpler for single-engine, local-first analytics.

The rest of this series focuses on Iceberg because its design decisions and capabilities represent the state of the art for multi-engine analytical lakehouses. Part 2 examines the metadata structures of all five formats in detail.

Books to Go Deeper

To learn more about Apache Iceberg and the lakehouse architecture, check out these resources:

Architecting the Apache Iceberg Lakehouse by Alex Merced (Manning)
Lakehouses with Apache Iceberg: Agentic Hands-on by Alex Merced
Constructing Context: Semantics, Agents, and Embeddings by Alex Merced
Apache Iceberg & Agentic AI: Connecting Structured Data by Alex Merced
Open Source Lakehouse: Architecting Analytical Systems by Alex Merced

Free Resources

AI Weekly: Google's TPU Split, Cursor's $60B, and MCP at Scale

Alex Merced — Wed, 29 Apr 2026 12:54:58 +0000

Week of April 23–29, 2026

This week, Google split its eighth-generation TPU into two specialized chips. SpaceX disclosed rights to acquire Cursor for $60 billion. Google Cloud Next 2026 framed enterprise software around autonomous agents, and the Model Context Protocol moved deeper into production-grade territory.

AI Coding Tools: SpaceX Eyes Cursor at $60B and Google Pushes Agent Platforms

SpaceX announced on April 22 that it has rights to buy AI coding tool Cursor for $60 billion later this year, with an alternative $10 billion partnership option. The move positions Elon Musk's space and AI properties to compete with Anthropic and OpenAI ahead of a planned Wall Street debut. Cursor, made by San Francisco startup Anysphere, has wide distribution to expert software engineers, which is part of what makes it attractive to Musk's company. Read the AP report.

Google Cloud Next 2026 ran April 22–24 in Las Vegas and made coding agents the centerpiece. Google rebranded its AI platform as the Gemini Enterprise Agent Platform, billed as a one-stop shop for autonomous agents with 200+ foundation models and enterprise governance. The platform supports a new Agents CLI that takes agents from creation to production through a single command-line tool. See the announcements.

Cursor 2.0 also gained attention this month for supporting up to eight parallel AI agents working on different sections of a codebase at the same time. Claude Code, meanwhile, now powers GitHub Copilot's enterprise tier with multi-agent coordination that splits large tasks into parallel subtasks. The category leaders are converging on the same pattern: agents that read codebases, plan changes across multiple files, write the code, and run the tests.

AI Processing: Google Splits Its TPU Into Training and Inference Chips

Google Cloud announced on April 22 that its eighth-generation TPU is splitting into two specialized chips. The TPU 8t targets model training and the TPU 8i targets inference. Google reports up to 3x faster AI model training and 80% better performance per dollar over the previous generation, with the ability to link more than 1 million TPUs in a single cluster. Read the TechCrunch coverage.

Google also confirmed a partnership with Nvidia to extend Falcon, the software-based networking technology Google created and open-sourced in 2023 under the Open Compute Project. The work aims to make Nvidia-based systems perform better inside Google Cloud, a notable detente given Google's TPU sales push.

The Nvidia chip rival market is also booming. AI chip startups raised $8.3 billion globally in 2026, according to Dealroom, with Cerebras Systems pulling in $1 billion in February and $500 million rounds going to MatX, Ayar Labs, and Etched. European companies like Axelera and Olix raised rounds north of $200 million. The argument: GPUs were not purpose-designed for AI inference, and novel system architectures bring big savings in energy and cost. See the CNBC report.

Standards & Protocols: MCP Hits Production Scale and Agentic Foundations Mature

The Model Context Protocol crossed a clear adoption threshold this month. MCP downloads now run at roughly 110 million per month across OpenAI, Google, LangChain, and other frameworks, according to a recent Anthropic keynote on the protocol's direction. By Q2 2026, community-built MCP servers exist for GitHub, Slack, PostgreSQL, Stripe, Figma, Docker, Kubernetes, and over 200 other tools. See the Wikipedia summary.

The 2026 MCP roadmap published in March identified four priorities. First is streamable HTTP transport scalability. Second is the Tasks primitive lifecycle, including retry semantics and expiry policies. Third is governance maturation. Fourth is enterprise readiness covering audit trails, SSO-integrated auth, gateway behavior, and configuration portability. Stateful sessions fight with load balancers and horizontal scaling needs better support, so the working groups are evolving the existing transport rather than adding new ones. Read the roadmap.

Google Cloud Next 2026 also gave standards work a public showcase. A breakout session covered "Generative UI for any agent, anywhere: A2UI, AG-UI, MCP Apps, and more." Interoperability between agent UI standards is now part of mainstream cloud roadmaps.

The Agentic AI Foundation launched in December 2025 under the Linux Foundation. Founding contributions came from Anthropic's MCP, OpenAI's AGENTS.md, and Block's Goose. AAIF held its first MCP Dev Summit North America in New York earlier this month, drawing about 1,200 attendees, double the prior event. The next AAIF events are AGNTCon + MCPCon Europe on September 17–18 in Amsterdam and AGNTCon + MCPCon North America on October 22–23 in San Jose.

Resources to Go Further

The AI landscape changes fast. Here are tools and resources to help you keep pace.

Try Dremio Free — Experience agentic analytics and an Apache Iceberg-powered lakehouse. Start your free trial

Learn Agentic AI with Data — Dremio's agentic analytics features let your AI agents query and act on live data. Explore Dremio Agentic AI

Join the Community — Connect with data engineers and AI practitioners building on open standards. Join the Dremio Developer Community

Book: The 2026 Guide to AI-Assisted Development — Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. Get it on Amazon

Book: Using AI Agents for Data Engineering and Data Analysis — A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. Get it on Amazon

Apache Data Lakehouse Weekly: April 23–29, 2026

Alex Merced — Wed, 29 Apr 2026 12:52:51 +0000

Three weeks past the Iceberg Summit, the lakehouse projects shifted from in-person alignment back into shipping mode. Polaris cut its 1.4.0 release and immediately followed up with a Python CLI 1.4.0, Arrow shipped its 24.0.0 major release and kicked off an arrow-rs 58.2.0 vote, and Parquet's design lists stayed dense with proposals on footers, page encoding, and a new java release discussion. Iceberg's dev list was quieter this week as contributors digested summit follow-ups and continued narrowing on V4 design questions in the background.

Apache Iceberg

The post-summit wave of formal proposals continued translating into design work this week. The V4 metadata.json optionality direction that has anchored multiple syncs — treating catalog-managed metadata as a first-class supported mode while keeping static-table portability through explicit opt-in semantics — is still the defining V4 design conversation, with Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu continuing to push edge cases on portability and Spark driver behavior. The single-file commits proposal that Russell Spitzer and Amogh Jahagirdar have been advancing remains on track for a formal write-up, with the latency and metadata footprint reductions driving urgency.

Péter Váry's efficient column updates proposal for wide tables continued attracting collaboration. The design — write only the columns that change on each commit, then stitch the result at read time — is squarely aimed at petabyte-scale feature stores with thousands of embedding and model-score columns, and the I/O savings make it one of the more practically grounded V4 proposals on the list. Anurag Mantripragada and Gábor Kaszab are working alongside Péter on POC benchmarks to support the formal proposal that should land on the dev list in the coming weeks.

On the Rust side, the Iceberg Rust 0.9.0 release shipped earlier this development cycle and continues to anchor downstream adoption discussions, with its DataFusion integration making it a serious option for teams that want Iceberg without a JVM dependency. Iceberg Summit 2026 session recordings are also rolling out on the project's YouTube channel this week, giving the global community access to the V4 design talks, the vendor panel, and the production case studies from Apple, Bloomberg, Pinterest, and others. The AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March is still expected to land as published guidance covering disclosure requirements and code provenance standards.

Apache Polaris

Polaris had its biggest release week of the year. Adnan Hemani announced Apache Polaris 1.4.0 on April 23, the project's first major release as a graduated top-level project. Dmitri Bourlatchkov, Yufei Gu, Xi Wen, and Alexandre Dutra all weighed in with congratulations and follow-up notes on packaging and distribution. Right behind it, Adnan kicked off and shepherded the Apache Polaris Python CLI 1.4.0 RC2 vote, which collected binding +1s from Yufei Gu, Honah J., and Jean-Baptiste Onofré, with Yong Zheng adding non-binding support. The Python CLI 1.4.0 release shipped on April 28, completing the back-to-back release pair. Jean-Baptiste also confirmed in a HEADS UP note that the project is now back on a monthly release cadence after the graduation transition.

The release had its share of post-launch fires. Alexandre Dutra opened threads on Helm chart repo inconsistency after the 1.4.0 release, a release workflow failure in step 4, and an Artifact Hub request for official status. A GitHub thread on KMS-related errors after bumping to 1.4.0 surfaced a real upgrade bug that drew immediate attention. Yufei Gu took the lead on triaging most of these, and the discussions are doing exactly what a healthy post-release cycle should — surfacing rough edges before they reach more users.

Design discussions stayed active alongside the release work. EJ Wang's DISCUSS thread on AGENTS.md for Polaris opened a conversation about adding agent-readable repository metadata, picking up engagement from Yufei Gu. Yufei separately started a discussion on narrowing the scope of SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION, which Dmitri Bourlatchkov and Dennis Huo dug into. ITing Lee's proposal to add OpenLineage to Polaris continued attracting feedback from Adnan Hemani, Jean-Baptiste Onofré, Yufei Gu, and Michael Collado. Alexandre Dutra's URL path decoding thread and his PolarisPrivilege fields and grant validation discussion both kept multiple contributors engaged through the week, and Selvamohan Neethiraj raised a PolarisPrincipal user attributes server-side bug that Alexandre and Yufei traced through.

Apache Arrow

Arrow had its own back-to-back release week. Raúl Cumplido announced Apache Arrow 24.0.0 on April 22, closing out the 24.0.0 RC0 vote that spanned mid-April. Matt Topol followed with the Apache Arrow Go 18.6.0 RC0 vote on April 22 and announced the release result on April 28, with Pedro Matias, Ian Cook, David Li, and Bryce Mecum carrying the verification work. Andrew Lamb then opened the arrow-rs 58.2.0 RC1 vote on April 28, with Bryce Mecum, Ed Seidl, Jeffrey Vo, and Raúl Cumplido moving quickly through verification — finishing what last week's newsletter flagged as the next ship to watch.

Beyond releases, the design conversations stayed lively. Emil Sadek opened a DISCUSS thread on an ADBC Logo Proposal with Nic Crane, Julian Hyde, and Rusty Conover weighing in on visual identity for the database connectivity standard. Benjamin Philip kicked off a new DISCUSS thread on Arrow Erlang's grant documents, continuing the project's expansion into more language ecosystems. The pyarrow-stubs donation vote that Rok Mihevc opened on April 14 stayed active, drawing additional support this week with Rok pushing for a final tally. Mandukhai Alimaa's earlier proposal for a canonical BigDecimal extension type and Andrew Lamb's arrow-rs security policy discussion both continued generating engagement as the project tightens its production posture.

Apache Parquet

Parquet's lists were as dense as any project's this week. Ismaël Mejía opened a thread soliciting code reviews for Java performance optimization work, with Steve Loughran picking it up immediately. Manu Zhang's DISCUSS thread on a new parquet-java release drew sustained engagement from Steve Loughran, Aaron Niskode-Dossett, Fokko Driesprong, Julien Le Dem, Gang Wu, and Rahil C — covering both the timing question and what should ship in the next release. Julien Le Dem's Parquet sync on April 22 drew Manu Zhang and Micah Kornfield into the agenda discussion.

The format-level proposals continued to evolve. Will Edwards's DISCUSS thread on an alternative to the FlatBuffer footer with a lightweight byte-offset index kept pulling in design feedback from Andrew Lamb, Ed Seidl, Jan Finis, Alkis Evlogimenos, Raphael Taylor-Davies, Andrew Bell, and others. Ed Seidl's proposal to make path_in_schema optional attracted commentary from Gang Wu, Steve Loughran, and Micah Kornfield. Andrew Lamb's thread on where VariantJsonParser should live — touching the boundary between Parquet and Iceberg's variant tooling — continued with Steve Loughran and Gang Wu. Jan Finis's question on whether a too-long RLE bitpack at the end of a page is valid drew careful answers from Raphael Taylor-Davies and Micah Kornfield, the kind of spec-edge clarification that matters for cross-implementation interop. Milan Stefanovic's Geospatial CRS string format clarification continued threading toward closure with Dewey Dunnington and Micah Kornfield.

Cross-Project Themes

This week's clearest pattern is post-graduation Polaris finding its operational rhythm. The 1.4.0 release plus the Python CLI 1.4.0, the return to a monthly cadence, and the visible upgrade-path bugs and Helm packaging issues are all the work of a project growing into its TLP independence. The fact that contributors are surfacing problems publicly and triaging them on the dev list — rather than routing through a parent project — is itself the marker of a healthy graduation.

The release wave across projects also reflects how synchronized the lakehouse stack has become. Arrow 24.0.0 plus arrow-rs 58.2.0 plus arrow-go 18.6.0 plus Polaris 1.4.0 plus Polaris Python CLI 1.4.0 all landing within a single week is a coordination story. Engines and tools downstream of these libraries — Spark, Trino, Dremio, DataFusion, DuckDB, Snowflake — can pick up the new versions in a coherent batch rather than chasing staggered upgrades across half a dozen vendors. The format-level design work in Parquet (footers, optional path_in_schema, variant tooling location) and the V4 design work in Iceberg (metadata.json optionality, single-file commits, efficient column updates) are also starting to rhyme: both communities are picking apart assumptions baked into v1 and v2 spec design and asking what a leaner, AI-workload-aware format looks like.

Looking Ahead

Watch the arrow-rs 58.2.0 RC vote close out in the coming days. Polaris should publish 1.4.1 or move toward 1.5.0 planning given the monthly cadence commitment, and the AGENTS.md discussion is likely to firm into a concrete proposal. The Polaris OpenLineage RFC has the volume of feedback it needs to move toward implementation. On the Iceberg side, the formal V4 single-file commits write-up and the published AI contribution policy remain the next concrete deliverables to track. Iceberg Summit 2026 talk recordings will continue rolling out on YouTube, and the parquet-java release discussion should converge on a target version.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads

Apache Iceberg: The Definitive Guide — O'Reilly book, free download
Apache Polaris: The Definitive Guide — O'Reilly book, free download

Books by Alex Merced

The Journey from Scattered Data to an Apache Iceberg Lakehouse with Governed Agentic Analytics

Alex Merced — Sat, 25 Apr 2026 20:53:35 +0000

The conventional wisdom for data platform modernization goes like this: pick a target system, build ETL pipelines for every source, migrate everything, validate the data, retrain your users, and then start getting value. That process takes six to eighteen months. During that time, analysts are waiting and leadership is asking why the investment has not produced results yet.

There is a better sequence. Instead of making everyone wait for a full migration, you start producing value on day one and migrate to Apache Iceberg at your own pace. The key is treating federation, the semantic layer, AI access, and Iceberg migration as four independent phases, each delivering value on its own, rather than a single all-or-nothing project.

Phase 1: Connect Your Data Where It Lives

Sign up for Dremio Cloud and you get a lakehouse project with a pre-configured Open Catalog right away. From there, start connecting your existing data sources through Dremio's federated query engine: PostgreSQL, MySQL, MongoDB, S3, Snowflake, BigQuery, Redshift, AWS Glue, Unity Catalog, and more.

No data copying. No ETL pipelines. Dremio queries your data where it already lives, using predicate pushdowns to push filtering work down to each source system.

The result: by the end of day one, your team has unified SQL access across every connected source. An analyst can join a PostgreSQL customer table with an S3-based event stream in a single query, without waiting for a data engineer to build a pipeline first.

Phase 2: Build a Semantic Layer Over Everything

Raw source tables have cryptic column names, inconsistent types, and zero business context. Before anyone can get reliable answers, whether human or AI, you need a curated layer on top.

Dremio's AI Semantic Layer uses SQL views organized in three tiers:

Bronze/Raw views map to raw sources. They standardize column names, cast data types, and apply basic filters. One Bronze view per source table.
Silver/Business views apply business logic. This is where you define what "active customer" means (purchased in the last 90 days, not on a trial), join data across sources, and compute metrics.
Gold/Application views serve specific consumers: a dashboard, a report, or an AI agent. Each Gold view is optimized for its use case.

Dremio's AI Agent can help you come up with the SQL to generate these views efficiently.

Govern Access and Document Everything

Grant users access to specific views using Role-Based Access Control (RBAC) at the folder, dataset, and column level. For sensitive data, add Fine-Grained Access Control (FGAC) via UDFs for row-level security and column-level masking.

Then enrich every dataset with Wikis (human-readable documentation explaining what each column means) and Tags (categorical labels for discoverability). Dremio can auto-generate Wiki descriptions and suggest Tags by sampling your table data and schema. You review and refine the output instead of writing everything from scratch.

This metadata is not just for humans. It is what the AI Agent reads when generating SQL. Better documentation means more accurate answers.

Phase 3: Turn On Agentic Analytics

With a governed semantic layer in place, you are ready for AI. This is the important part: you do not need to complete the Iceberg migration first. Agentic analytics works on federated data from the moment the semantic layer exists.

Dremio's built-in AI Agent lets users type plain-English questions in the console. The agent writes SQL, executes it against your governed views, returns results, generates charts, and suggests follow-up questions. It respects every RBAC and FGAC policy in your catalog. Users can only get answers about data they are authorized to see.

For teams that want to use external tools, Dremio's MCP (Model Context Protocol) server lets ChatGPT, Claude Desktop, or custom agents connect directly to your Dremio environment. External tools get the same semantic context and security controls as the built-in agent.

Interface	What It Provides
Built-in AI Agent	Natural language queries, SQL generation, charts, follow-up suggestions inside Dremio
MCP Server	Connect any MCP-compatible AI tool (ChatGPT, Claude, custom agents) with full governance
AI SQL Functions	Run `AI_GENERATE`, `AI_CLASSIFY`, `AI_COMPLETE` directly in SQL for unstructured data analysis

At this point your organization has unified data access, a governed semantic layer, and AI-powered analytics, and you have not migrated a single table to Iceberg yet.

Phase 4: Migrate to Iceberg, One Dataset at a Time

Federation gets you access, but a full Apache Iceberg lakehouse gets you more: Autonomous Reflections that optimize query performance based on actual usage patterns, end-to-end caching, automated table maintenance (compaction, clustering, vacuuming), and interoperability with every Iceberg-compatible engine (Spark, Flink, Trino). Your data stays in your storage, in an open format, with no vendor lock-in.

The migration pattern is deliberately incremental:

Pick one dataset to migrate (start with the highest-volume or most-queried table)
Build an Iceberg pipeline to land that data in your object storage (S3 or Azure)
Update the Bronze view to point to the new Iceberg table instead of the legacy federated source
Silver and Gold views stay unchanged. They reference the Bronze view, which now reads from Iceberg instead of the old source.
Every consumer is unaffected. Dashboards, reports, and AI agents continue to work exactly as before.

Repeat for the next dataset whenever you are ready. There is no deadline and no big-bang cutover.

Why the View Layer Makes Migration Invisible

This is the architectural insight that makes the whole journey work. The semantic layer acts as a contract between physical data storage and every consumer above it.

When you swap a Bronze view's underlying source from PostgreSQL to an Iceberg table, every Silver view, Gold view, dashboard, report, and AI agent that depends on it continues to work without changes. The view contract (column names, data types, business logic) is preserved. Only the physical source pointer changes.

This means:

No dashboard rewiring
No report migration
No API endpoint changes
No AI Agent reconfiguration
No user communication (beyond governance notifications if your policies require them)

The migration happens underneath the abstraction layer. Everyone above it is oblivious.

The Tradeoffs

This phased approach is not free of costs.

Federation introduces network latency. Queries that join a PostgreSQL table in one region with an S3 bucket in another will be slower than queries against co-located Iceberg tables. Reflections and caching mitigate this for repeated queries, but the first execution of a new query pattern will feel it.

Iceberg migration still requires building ingest pipelines. Dremio does not eliminate that work. What it does is decouple the pipeline work from the analytics timeline. Your analysts and AI agents are productive while engineers build migration pipelines in the background.

Autonomous Reflections need a 7-day observation window before they start optimizing. Day-one performance on brand-new Iceberg tables relies on baseline optimizations (C3 caching, predicate pushdowns, vectorized execution). The system gets faster as it learns your query patterns.

And Dremio is an analytical engine, not a transactional database. Your OLTP workloads stay in PostgreSQL, MongoDB, or whatever system runs your application. You query those systems through federation, not as a replacement.

Start Today, Migrate Over Time

The traditional approach forces you to choose: spend months migrating, or keep running fragmented analytics on scattered data. Dremio eliminates that choice. Connect your sources, build your semantic layer, enable AI access, and start migrating to Iceberg when you are ready. Each phase delivers value independently, and the view layer ensures that migration never disrupts the people who are already getting answers.

Try Dremio Cloud free for 30 days and start the journey from wherever your data lives today.

Free Resources to Go Deeper

Apache Data Lakehouse Weekly: April 16–22, 2026

Alex Merced — Wed, 22 Apr 2026 18:19:22 +0000

Two weeks past the Iceberg Summit, the San Francisco in-person alignments are now translating into formal proposals and code on the dev lists. Iceberg's V4 design work continued consolidating, Polaris kept moving toward its 1.4.0 milestone, Parquet's Geospatial spec picked up a cleanup commit from a new contributor, and Arrow's release engineering and Java modernization discussions stayed active.

Apache Iceberg

The post-summit V4 design work continued as the defining thread on the Iceberg dev list this week. The V4 metadata.json optionality discussion that Anton Okolnychyi, Yufei Gu, Shawn Chang, and Steven Wu drove through March kept narrowing on practical design questions. The concrete direction emerging from the summit is to treat catalog-managed metadata as a first-class supported mode while preserving static-table portability through explicit opt-in semantics, rather than the current implicit assumption that the root JSON file is always present.

Russell Spitzer and Amogh Jahagirdar's one-file commits design moved toward a formal spec write-up this week. The approach replaces manifest lists with root manifests and introduces manifest delete vectors, enabling single-file commits that cut metadata write overhead dramatically for high-frequency writers. The in-person sessions at the summit cleared the last design disagreements about inline versus external manifest delete vectors, and the community is now aligning on the implementation plan.

Péter Váry's efficient column updates proposal for AI and ML workloads drew steady engagement. The design lets Iceberg write only the columns that change on each write for wide feature tables, then stitch the result at read time. For teams managing petabyte-scale feature stores with embedding vectors and model scores, the I/O savings are meaningful. Anurag Mantripragada and Gábor Herman are working alongside Péter on POC benchmarks to support the formal proposal.

The AI contribution policy that Holden Karau, Kevin Liu, Steve Loughran, and Sung Yun pushed through March is moving toward published guidance. The summit provided the in-person alignment that async debate rarely produces, and a working policy covering disclosure requirements and code provenance standards for AI-generated contributions is expected on the dev list in the next couple of weeks. Polaris is navigating the same question in parallel, and the two communities are likely to converge on a shared approach given their overlapping contributor base.

Apache Polaris

The Polaris 1.4.0 release is in active scope finalization as the project's first release since graduating to top-level status on February 18. Credential vending for Azure and Google Cloud Storage is the headline feature, alongside catalog federation that lets one Polaris instance front multiple catalog backends across clouds. The schedule-driven release model calls for a release intent email to the dev list about a week before the RC cut, so watch the list for that thread shortly.

The Apache Ranger authorization RFC from Selvamohan Neethiraj remained the most active governance discussion. The plugin lets organizations running Ranger with Hive, Spark, and Trino manage Polaris security within the same policy framework, eliminating the policy duplication that arises when teams bolt separate authorization onto each engine. It is opt-in and backward compatible with Polaris's internal authorization layer, which lowers the enterprise adoption barrier considerably.

On the community side, Polaris's blog continued its post-graduation cadence with a Sunday April 4 post on building a fully integrated, locally-running open data lakehouse in under 30 minutes using k3d, Apache Ozone, Polaris, and Trino. The Polaris PMC also shipped a March 29 post covering automated entity management for catalogs, principals, and roles. With incubator overhead behind it, release velocity has picked up noticeably from the 1.3.0 release on January 16.

Apache Arrow

Arrow's release calendar shows arrow-rs 58.2.0 landing this month, following 58.1.0 in March which shipped with no breaking API changes. The cadence has held at roughly one minor version per month, with 59.0.0 already scheduled for May as a major release that may include breaking changes. The Rust implementation has become one of the most actively maintained segments of the Arrow ecosystem, with a DataFusion integration drawing engines that want Arrow without a JVM dependency.

Jean-Baptiste Onofré's JDK 17 minimum proposal for Arrow Java 20.0.0 continued drawing input from Micah Kornfield and Antoine Pitrou. The practical rationale is coordination: setting JDK 17 as Arrow's Java baseline aligns with Iceberg's own upgrade timeline and effectively raises the minimum across the entire lakehouse stack in a single coordinated move. The decision is expected before the 20.0.0 release cycle formally opens.

Nic Crane's thread on using LLMs for Arrow project maintenance continued generating discussion. The framing — AI as a resource for maintainers, not just contributors — is distinct from how Iceberg and Polaris are approaching their AI policies. Arrow's angle is practical: a lean maintainer group managing a growing issue backlog needs help triaging, and LLMs can do that work without introducing the code-provenance concerns that matter for contributions. Google Summer of Code 2026 student proposals that landed in early April are being sorted this week, with interest concentrated in compute kernels and Go and Swift language bindings.

Apache Parquet

Parquet's week centered on hardening the Geospatial spec that was adopted earlier this year. Milan Stefanovic merged PR #560 on April 20, clarifying the Geospatial spec wording for coordinate reference systems. The change documents existing CRS usage practice for the default OGC:CRS84 system and removes ambiguity caught during implementation reviews. Small spec-hardening commits like this are how a new type goes from "shipped" to "production-reliable" across engines.

The community blog effort continued alongside the spec work. The Native Geospatial Types blog that Jia Yu and Dewey Dunnington published on February 13 remains the community's reference explainer, and Andrew Lamb has been coordinating with Aihua Xu on the companion Variant blog post. Spotlighting recent additions through the Parquet blog is part of a deliberate push to give the project the same kind of voice that DataFusion and Arrow have built.

The ALP encoding that cleared its acceptance vote in the prior week moved into implementation discussion. Engine teams across Spark, Trino, Dremio, and DataFusion are comparing notes on how to integrate ALP into their Parquet readers, with compression gains for float-heavy ML feature stores as the immediate benefit. The File logical type proposal for unstructured data (images, PDFs, audio) also kept advancing in community discussion, extending Parquet's scope beyond pure analytics.

Cross-Project Themes

The summit's downstream effect is now visible across every dev list. Iceberg's V4 work, Polaris's 1.4.0 scope, Arrow's JDK 17 decision, and Parquet's Geospatial cleanup are running in parallel, and the cross-project coordination on shared questions like AI contribution policy and Java baselines has intensified. The JDK 17 alignment is the clearest case: moving Arrow Java 20.0.0, Iceberg's next major, and downstream engines to the same floor in a single window removes years of compatibility friction.

The second pattern is the steady expansion of format scope to meet AI workloads. Iceberg's efficient column updates, Parquet's File logical type, the Geospatial spec hardening, and Polaris's multi-cloud federation all respond to the same pressure: the lakehouse stack is being asked to power AI pipelines, not just analytical queries. Each project is making changes that only make sense if you assume the next decade's workloads look different from the last.

Looking Ahead

Watch for the V4 single-file commits formal spec write-up and the metadata optionality vote on the Iceberg dev list, along with a published AI contribution policy. The Polaris 1.4.0 release intent email should land in the coming days. Arrow's JDK 17 baseline decision for Java 20.0.0 is close to a vote, and arrow-rs 58.2.0 should ship before the end of the month. Iceberg Summit 2026 session recordings are also rolling out on the project's YouTube channel.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free — Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow — Learn how Dremio brings the open lakehouse stack together

Free Downloads

Apache Iceberg: The Definitive Guide — O'Reilly book, free download
Apache Polaris: The Definitive Guide — O'Reilly book, free download

Books by Alex Merced

AI Weekly: Opus 4.7, Kimi K2.6, and a $25B Amazon Deal, April 16–22, 2026

Alex Merced — Wed, 22 Apr 2026 18:03:32 +0000

Three stories defined the past week: Anthropic shipped Claude Opus 4.7, Moonshot open-sourced Kimi K2.6 with 300-agent swarms, and Amazon committed another $25 billion to Anthropic alongside a $100 billion AWS spend. Here is what you need to know.

AI Coding Tools: Opus 4.7 Ships With a 1M Context Window

Anthropic released Claude Opus 4.7 on April 16, a new flagship model focused on agentic coding and long-horizon work. The model scores 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro, jumping from 80.8% on Opus 4.6. It runs with a full 1 million token context window and high-resolution image support for charts and dense documents.

The model landed across major platforms the same week. Claude Opus 4.7 arrived on Amazon Bedrock on launch day in four regions, with up to 10,000 requests per minute per account. GitHub Copilot began rolling out Opus 4.7 to Copilot Pro+ users with a 7.5x premium request multiplier until April 30. The model is replacing both Opus 4.5 and Opus 4.6 in the Copilot model picker.

Claude Code shipped Opus 4.7 the same day with new controls. The update added an "xhigh" effort level between high and max, a /ultrareview multi-agent code review command, and Auto mode for Max subscribers. Anthropic also launched Claude Design, a new Anthropic Labs product for building prototypes, slides, and one-pagers in collaboration with the model. Pricing stays at $5 per million input tokens and $25 per million output tokens.

AI Models: Kimi K2.6 Opens the Door to 12-Hour Agent Runs

Moonshot AI released Kimi K2.6 on April 20 as an open-source agentic model built for long-horizon coding. The model has 1 trillion total parameters in a Mixture-of-Experts architecture with 32 billion active per forward pass. It supports text, image, and video input, a 256K context window, and thinking and non-thinking modes behind an OpenAI-compatible API.

The headline claim is stamina. Kimi K2.6 targets 12-hour autonomous coding sessions and agent swarms that scale to 300 sub-agents across 4,000 coordinated steps. On benchmarks, Moonshot claims SWE-Bench Pro at 58.6, SWE-bench Multilingual at 76.7, and BrowseComp at 83.2. The model matches or beats GPT-5.4 and Claude Opus 4.6 on several open-source leaderboards.

K2.6 is available immediately on Kimi.com, the developer API, Kimi Code CLI, Ollama, and Hugging Face. Day-one integrations cover Kilo Code, VS Code and JetBrains extensions, OpenClaw, Tencent CodeBuddy, and Genspark. The MIT-derived license allows commercial use and redistribution, a direct challenge to closed-source frontier labs.

AI Infrastructure: AWS Interconnect Reaches GA and Amazon Adds $25B to Anthropic

AWS Interconnect reached general availability on April 20 with two new capabilities. AWS Interconnect Multicloud provides Layer 3 private connections between AWS VPCs and other clouds, starting with Google Cloud, with Azure and OCI coming later in 2026. Traffic flows over the AWS global backbone with built-in MACsec encryption, never crossing the public internet. AWS also published the Interconnect specification on GitHub under Apache 2.0, so any cloud provider can become a partner.

Amazon announced a $25 billion investment in Anthropic on April 20, on top of the $8 billion already committed. The deal includes $5 billion immediately, with up to $20 billion tied to commercial milestones. Anthropic committed to spending more than $100 billion on AWS over 10 years, securing up to 5 gigawatts of Trainium chip capacity. One gigawatt is scheduled to come online this year using Trainium2 and Trainium3.

The structure mirrors the $50 billion Amazon-OpenAI deal from February. Anthropic is now valued at $380 billion, with annualized revenue climbing from $9 billion at the end of 2025 to more than $30 billion. Enterprise customers spending at least $1 million annually have doubled since February, crossing 1,000 accounts.

Standards and Protocols: Interconnect Spec Goes Open

The AWS Interconnect specification going to GitHub under Apache 2.0 is the standards story of the week. The move gives any cloud provider a published path to join the private connectivity mesh without negotiating bilateral deals. For AI workloads moving data between model training clusters in one cloud and inference infrastructure in another, the alternative has been either the public internet or expensive dedicated circuits.

The broader pattern is that hyperscale cloud providers are open-sourcing infrastructure specs to lock in network effects. Trainium chip access is exclusive, but the connectivity layer is open. This is the same playbook the Linux Foundation's Agentic AI Foundation uses for MCP and A2A: open standards for the plumbing, proprietary value on top.

MCP and A2A also saw continued adoption this week. Claude Opus 4.7 ships with both protocols built in, and Kimi K2.6 supports tool calls and OpenAI-compatible APIs that slot into MCP-aware agent stacks. The layered architecture is holding: MCP handles agent-to-tool connections, A2A handles agent-to-agent coordination, and the new open models and frontier releases are all landing with both built in by default.

Resources to Go Further

The AI landscape changes fast. Here are tools and resources to help you keep pace.

Try Dremio Free — Experience agentic analytics and an Apache Iceberg-powered lakehouse. Start your free trial

Learn Agentic AI with Data — Dremio's agentic analytics features let your AI agents query and act on live data. Explore Dremio Agentic AI

Join the Community — Connect with data engineers and AI practitioners building on open standards. Join the Dremio Developer Community

Book: The 2026 Guide to AI-Assisted Development — Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. Get it on Amazon

Book: Using AI Agents for Data Engineering and Data Analysis — A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. Get it on Amazon

Forem: Alex Merced

AI Weekly: Free Web Tools, MCP Production Wins, Trusted-Compute Models (April 30–May 6, 2026)

AI Coding Tools: Vercel Ships deepsec, TinyFish Drops Search Behind a Paywall

AI Processing: GLM-5.1 Runs FP8 Inside a Trusted Execution Environment

Standards & Protocols: First MCP Server for Engineering Management

Resources to Go Further

Apache Data Lakehouse Weekly: April 30–May 6, 2026

Apache Iceberg

Apache Polaris

Apache Arrow

Apache Parquet

Cross-Project Themes

Looking Ahead

Resources & Further Learning

Performance and Apache Iceberg's Metadata

Table of Contents

The Scan Planning Pipeline

Stage 1: Snapshot Resolution

Stage 2: Manifest List Pruning

Stage 3: Manifest File Pruning (File Skipping)

Stage 4: Parquet Internal Pruning

A Concrete Example

What Makes Statistics Effective

Sort Order

File Size and Count

Beyond Min/Max: Other Statistics

Metadata Caching

When Metadata Is Not Enough

Books to Go Deeper

Free Resources

What is Dremio? The Unified Lakehouse and AI Platform

Pillar 1: The Federated Query Engine

Pillar 2: The Iceberg Lakehouse Platform

Pillar 3: The Agentic AI Layer

Conclusion

Semantic Layer: The Definitive Guide

What a Semantic Layer Actually Is

The Origin Story: Business Objects, 1991

OLAP Cubes: The Implicit Semantic Layer

The Self-Service Era and the Loss of the Semantic Layer

Dremio: The Semantic Layer Built Into the Query Engine From Day One

The Modern Resurgence: Two Divergent Paths

Path 1: The Semantic Layer as a Standalone Product

AtScale (Founded 2013)

Cube (Open-Sourced 2019)

dbt Semantic Layer (MetricFlow, 2023)

The Structural Tradeoff of Standalone Products

Path 2: The Semantic Layer as a Platform Feature

How Dremio's Semantic Layer Works

Views (Virtual Datasets)

Wikis

Labels

Lineage

Knowledge Graph

Semantic Search

Why the Integrated Approach Changes Everything for AI

Platform vs Product: A Side-by-Side Comparison

When a Standalone Product Fits

When the Platform Approach Fits

Building Your Semantic Layer: A Practical Framework

Free Resources to Go Deeper

The Metadata Structure of Modern Table Formats

Table of Contents

Apache Iceberg: The Metadata Tree

Delta Lake: The Sequential Transaction Log

Apache Hudi: The Timeline

Apache Paimon: Snapshots and LSM Trees

DuckLake: SQL Database as Metadata

Side-by-Side Comparison

Books to Go Deeper

Free Resources

What Are Table Formats and Why Were They Needed?

Table of Contents

The World Before Table Formats

What a Table Format Actually Is

The Five Table Formats

Apache Iceberg

Delta Lake

Apache Hudi

Apache Paimon