Table formats such as Apache Iceberg are part of what make data lakes and data mesh strategies fast and effective solutions for querying data at scale. Choosing the right table format allows organizations to realize the full potential of their data by providing performance, interoperability, and ease of use. The Iceberg table format is unique among alternatives, providing an engine- and file format–agnosticism with a highly collaborative, transparent open-source project. In this post, we cover the benefits of using a table format for analyzing large data sets and why Iceberg has quickly become one of the most popular open-source table formats.
How External Table Formats Streamline Data Lakes
Data lakes are ideal for storing massive amounts of semi-structured and unstructured data in native file formats. They provide organizations with a comprehensive way to explore, refine, and analyze petabytes of information constantly arriving from multiple data sources.
But the individual files in a data lake don’t contain the information needed by query engines and other applications. As a result, it’s difficult and time-consuming to complete other tasks such as time traveling to a previous version of the data. Table formats solve these issues by providing capabilities and functionalities similar to those offered by SQL tables in a traditional relational database. They explicitly define a table, its metadata, and each file that composes a table. In addition, table formats such as Iceberg ensure ACID compliance, allowing multiple applications to safely work on the same data simultaneously.
What Is Apache Iceberg?
Iceberg is an open-source table format that was originally developed by Netflix to address issues in Apache Hive. After its initial development in 2018, Netflix donated Iceberg to the Apache Software Foundation as a completely open-source, openly managed project. It remedies many of the shortcomings of its predecessor and has quickly become one of the most popular open-source table formats.
Benefits of the Apache Iceberg Table Format
The Iceberg table format offers many features to help power your data lake architecture.
Iceberg fully supports flexible SQL commands. This makes it possible to complete tasks such as updating existing rows, merging new data, and targeted deletes. Iceberg can be used to rewrite data files to enhance read performance and use delete deltas to quicken the pace of updates.
Iceberg supports full schema evolution. Schema updates in Iceberg tables change only the metadata, leaving the data files themselves unaffected. Schema evolution changes include adds, drops, renaming, reordering, and type promotions.
Partitioning divides large tables into small ones by grouping similar rows together, speeding up read and load times for queries that only need to access a portion of the data. A partition spec can evolve without changing the earlier data written with an earlier spec. The metadata associated with each partition version is stored separately.
Time travel and rollback
Iceberg’s time travel feature makes it possible to run reproducible queries on the same table snapshot and allows users the ability to inspect previous changes. This rollback capability allows users to easily walk back errors by resetting tables to their previous state.
Data stored in a data lake or data mesh architecture is available to multiple independent applications across an organization simultaneously. While this is a significant benefit, it can also come with substantial risks, especially if multiple users are writing to the same data at the same time. However, Iceberg enables ACID transactions at scale, allowing concurrent writers to work in tandem. Support for ACID ensures readers are not affected by partial or uncommitted changes from writers. When a writer commits a change, Iceberg creates a new, immutable version of the table’s data files and metadata.
Iceberg is designed for use with huge analytical data sets. It offers multiple features designed to increase querying speed and efficiency including fast scan planning, pruning metadata files that aren’t needed, and the ability to filter out data files that don’t contain matching data.
Vibrant community of active users and contributors
Iceberg is one of the Apache Software Foundation’s flagship projects. Its support for multiple processing engines and file formats including Apache Parquet, Apache Avro, and Apache ORC has attracted a diverse group of talented commercial users eager to contribute to its ongoing success.
Apache Iceberg and Snowflake
The Snowflake Data Cloud makes it easy to execute big data workloads using numerous file formats, including Parquet, Avro, ORC, JSON, and XML. While Snowflake’s internal tables greatly simplify the process of creating a data lake, mesh, or other storage pattern for data stored directly in Snowflake, some organizations with regulatory or other constraints either are not able to store all of their data in Snowflake or prefer to store data in open formats. Apache Iceberg is currently supported in private preview by the Snowflake Data Cloud in two ways: Iceberg Tables and External Tables. Iceberg Tables combine the performance and familiar query semantics of Snowflake tables with customer-managed cloud storage. Iceberg Tables are ideal for use cases requiring full DML, fast performance, and many Snowflake platform features with data kept in external storage. External Tables are ideal for use cases requiring easy, read-only access to query, govern, and share data that cannot be moved from cloud storage.
Snowflake users don’t have to contend with common barriers that stand in the way of realizing the true value of their data. Snowflake makes it possible to eliminate siloed data, securely share complex data sets internally and with outside data partners, and run large-scale analytics tasks on massive data sets quickly and efficiently.
See Snowflake’s capabilities for yourself. To give it a test drive, sign up for a free trial.