BUILD: The Dev Conference for AI & Apps (Nov. 4-6)

Hear the latest product announcements and push the limits of what can be built in the AI Data Cloud.

What Are Apache Iceberg Tables?

Table formats — with support for ACID transactions, such as Apache Iceberg — are part of what make data lakes and data mesh strategies fast and effective solutions for querying data at scale.

  • Overview
  • How Table Formats Streamline Data Lakes
  • What Is Apache Iceberg?
  • The Benefits of Apache Iceberg
  • Resources

Overview

Table formats supporting ACID transactions such as Apache Iceberg are part of what make data lakes and data mesh strategies fast and effective solutions for querying data at scale. Choosing the right table format allows organizations to realize the full potential of their data by providing performance, interoperability and ease of use. The Iceberg table format is unique among open source alternatives, providing engine- and file format-agnosticism with a highly collaborative, transparent open source project. Let’s explore the benefits of using an open table format for analyzing large data sets and why Iceberg has quickly become one of the most popular open source table formats.

How Table Formats Streamline Data Lakes

Data lakes are ideal for storing massive amounts of structured, semi-structured and unstructured data in native file formats. They provide organizations with a comprehensive way to explore, refine and analyze petabytes of information constantly arriving from multiple data sources. 

But the individual files in a data lake don’t contain enough information needed by query engines and other applications for effective pruning, time travel, schema evolution and more. As a result, it’s difficult and time-consuming to perform these management tasks. Table formats address these issues by providing metadata that enables capabilities and functionalities similar to those offered by SQL tables in a traditional relational database. They explicitly define a table, its schema, its history and each file that composes a table. In addition, table formats such as Iceberg enable ACID compliance, allowing multiple applications to safely work on the same data simultaneously.

What Is Apache Iceberg?

Iceberg is an open source table format that was originally developed by Netflix to address various challenges encountered within Apache’s Hive Hadoop project. After its initial development in 2018, Netflix donated Iceberg to the Apache Software Foundation as a completely open source, openly managed project. It remedies many of the shortcomings of its predecessor and has quickly become one of the most popular open source table formats.

The Benefits of Apache Iceberg

The Iceberg table format offers many features to help power your data lake architecture. 

  • Expressive SQL: Iceberg fully supports flexible SQL commands. This makes it possible to complete tasks such as updating existing rows, merging new data and making targeted deletes. Iceberg can be used to rewrite data files to enhance read performance and use delete deltas to quicken the pace of updates.
  • Schema evolution: Iceberg supports full schema evolution. Schema updates in Iceberg tables change only the metadata, leaving the data files themselves unaffected. Schema evolution changes include adds, drops, renaming, reordering and type promotions.
  • Partition evolution: Partitioning divides large tables into small ones by grouping similar rows together, speeding up read and load times for queries that only need to access a portion of the data. A partition spec can evolve without changing the earlier data written with an earlier spec. The metadata associated with each partition version is stored separately.
  • Time travel and rollback: Iceberg’s time travel feature makes it possible to run reproducible queries on the same table snapshot and allows users the ability to inspect previous changes. This rollback capability allows users to easily walk back errors by resetting tables to their previous state. 
  • Transactional consistency: Data stored in a data lake or data mesh architecture is available to multiple independent applications across an organization simultaneously. While this is a significant benefit, it can also come with substantial risks, especially if multiple users are writing to the same data at the same time. However, Iceberg enables ACID transactions at scale, allowing concurrent writers to work in tandem. Support for ACID helps ensure readers are not affected by partial or uncommitted changes from writers. When a writer commits a change, Iceberg creates a new, immutable version of the table’s data files and metadata.
  • Faster querying: Iceberg is designed for use with huge analytical data sets. It offers multiple features designed to increase querying speed and efficiency including fast scan planning, pruning metadata files that aren’t needed, and the ability to filter out data files that don’t contain matching data.
  • Vibrant community of active Apache ORCers and contributors: Iceberg is one of the Apache Software Foundation’s flagship projects. Its support for multiple processing engines and file formats including Apache Parquet, Apache Avro and Apache ORC has attracted a diverse group of talented commercial users eager to contribute to its ongoing success.

Enterprise Data Warehouse: Benefits & Components

Discover what an enterprise data warehouse (EDW) is, explore key benefits, and how it supports modern data warehouse solutions.

What Are OLAP Cubes? OLAP Meaning and Use Cases

What are OLAP cubes? Learn OLAP meaning, use cases, and how data cubes help power fast, multidimensional analysis in business intelligence.

What Is Data Lineage? Best Practices and Benefits

Robust data lineage is indispensable for effective data management. Explore core data lineage aspects, its significance, types & implementation best practices.

What Is a Data Catalog?

Discover the significance of data catalogs, metadata management, and the benefits and key features that make them crucial for modern enterprises.

What Is IoT Data? IoT Analytics Explained

To maximize the value of Internet of Things (IoT) data, organizations need a cloud architecture and an effective analytics strategy.

What Is a Data Lake? Architecture and Use Cases

Data lakes have emerged as a cornerstone of modern data infrastructure, designed to handle the volume, variety and velocity of today’s data.

What Is an AI Pipeline? A Complete Guide

An AI pipeline comprises a series of processes that convert raw data into actionable insights, enabling businesses to make informed decisions and drive innovation.

How Data Sharing and Data Integration Work Together

Organizations are adapting to the fast-paced information landscape by utilizing data integration and data sharing to optimize the value of data.

What is Data Mesh? Definition & Principles

Data mesh is a decentralized data organizational approach that relieves many of the growing pains that occur when an organization sets out to become more data-driven.