Core Platform

Metadata That Works: How Snowflake Is Raising the Bar for Iceberg Performance

Metadata is the unsung hero of modern data processing systems. Whether it's schema definitions, partition stats or column-level min/max values, metadata enables query engines to resolve, validate and optimize SQL workloads efficiently. As open table formats like Apache Iceberg gain popularity for their flexibility and cloud-native design, they also introduce a unique set of metadata challenges — especially when it comes to delivering consistent, high-performance query execution.

At Snowflake, we've embraced Iceberg by building a data format-agnostic query engine that handles open formats with minimal assumptions and optimal efficiency. Our approach enables strong performance even when metadata is partial or approximate — common in most real-world lakehouse environments. 

In this post, we’ll walk through the metadata challenges of Iceberg, share Snowflake’s principles for data format-agnostic lakehouse optimization, and highlight a key optimization — sampling-based distinct value estimation (NDV) — which delivered a 2x improvement in TPC-DS benchmarks. Without requiring manual stats collection or tuning, this technique improves performance showcasing the power of smart metadata inference in the absence of exact statistics.

Principles for metadata-agnostic performance

As the data lakehouse model evolves, systems must be ready to handle diverse table formats, such as Iceberg, without relying on tightly coupled assumptions. At Snowflake, we’ve embraced this shift by defining a clear set of guiding principles for achieving data format-agnostic query performance — enabling our engine to deliver consistently, regardless of metadata completeness.

First, we believe performance should “just work” — without users needing to toggle knobs or run manual stats collection. Second, we treat nonexact metadata as a first-class input, designing our query planner to generate smart execution plans even when metadata is partial or approximate. And finally, we avoid format-specific logic wherever possible, instead relying on a unified abstraction layer that allows our engine to reason about tables generically.

These principles are core to how Snowflake delivers performance for open table formats without compromise, regardless of whether your data is native, Iceberg or anything in between.

How Snowflake boosts lakehouse query performance

To enable world-class performance across any data format — especially in the flexible, evolving world of Iceberg — Snowflake has built a series of format-agnostic optimizations directly into our query compiler.

  • Unified metadata abstraction: Rather than writing format-specific logic for Iceberg, Parquet or Hudi, we've rearchitected our metadata layer to present a consistent interface to the query planner. This enables plug-and-play performance when rich metadata is available — without custom compiler changes for each format.
Figure 1. Metadata abstraction
Figure 1. Metadata abstraction
  • Partial metadata pruning: Real-world Iceberg tables often contain metadata of inconsistent quality. It is a suboptimal design decision for a data processing engine to categorize partial metadata as missing metadata. There may still be a wealth of metadata available for a subset of the data that can help optimize a query. In Snowflake, we treat partial metadata as a distinct form of metadata that the query compiler understands. This allows the query engine to selectively leverage this metadata in optimizations where full accuracy is not required. One specific optimization is adding partial metadata pruning capability. Snowflake takes full advantage of data files’ metadata at compile time to prune the set of files that need to be scanned, based on predicates in the query. We’ve enhanced our pruning capabilities to intelligently apply predicates on the available metadata of a table and defer the remaining work to execution when actual data values can be inspected. This also works on approximate metadata where the min and max values are upper and lower bounds. 

Figure 2. Pruning with partial metadata
Figure 2. Pruning with partial metadata
  • Sampling-based NDVs: Missing distinct value (NDV) stats are a common performance pain point. We addressed this by developing a fast, low-cost sampling technique that estimates NDVs without any user intervention. This improves query planning for joins and aggregates, resulting in significant performance wins out of the box. To optimize performant sampling, we’ve implemented a dynamic intelligent solution that, based on a table’s size, schema and other factors, chooses the optimal way of executing the sampling operation. This solution both enables low latency between table creation and NDV availability, as well as limited user-visible latency when creating or refreshing a table. 

Figure 3. Performance benchmark with NDV Sampling
Figure 3. Performance benchmark with NDV sampling

Conclusion

As the lakehouse ecosystem continues to evolve, Snowflake is committed to delivering best-in-class performance — even when working with diverse formats and inconsistent metadata. By rethinking how metadata is abstracted and used across formats like Iceberg, we've built a foundation that’s both flexible and powerful.

Our sampling-based NDV estimation, which delivered a 2x improvement in TPC-DS benchmarks, is just one example of how we're pushing the boundaries of what’s possible with metadata-aware query optimization. We’re excited to share more soon on how AI-powered optimizations will further elevate performance and simplicity in the lakehouse era.

Share Article

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.