Expanding the Data Cloud with Apache Iceberg
Updates have been made since the original publishing of this blog post. For the latest information about Snowflake’s support for Apache Iceberg, please see here.
The Snowflake Data Cloud is a powerful place to work with data because we have made it easy to do difficult things with data, such as breaking down data silos, safely sharing complex data sets, and querying massive amounts of data. As customers move to the Data Cloud, their needs and timelines vary—our goal is to meet every customer where they are on their Data Cloud journey.
This is why we released external tables in 2019, to expand the boundaries of the Data Cloud and to remove barriers to using Snowflake, especially for data that takes time to move or cannot move into Snowflake (for example, some customers have regulatory requirements on where data can be stored). External tables have been popular with customers since they make challenging data tasks easier. Since the initial release, we have expanded external tables to support the object stores of all major cloud providers and proprietary table formats, such as Delta Lake (currently in public preview), for customers looking to migrate from Spark-based platforms.
Recently, as Apache Iceberg has addressed many of the challenges associated with object stores, we have heard significant customer demand to extend external tables support to connect data stored in this format with the Data Cloud. Today, we are excited to announce external table support for Apache Iceberg coming to private preview. Iceberg support will provide additional flexibility and interoperability, while also simplifying customers’ data landscapes.
What is a table format and why is it useful?
Creating data lakes can be challenging and time-consuming, which is why external tables are so useful, because they simplify many of the steps needed to create and use data lakes. Storing individual files in a blob store bucket is a very common way to build a data lake, but presents a series of challenges, such as when using the files in a table with a schema. To solve this challenge, external tables understand many different file formats, such as CSV, XML, ORC, Parquet, Avro, and JSON. External tables help because when files are imported into an external table, metadata about the files is saved and a schema is applied on read when a query is run on a table. While this offers flexibility, there are some limitations and drawbacks. This is where table formats come into play.
Table formats explicitly define a table, its metadata, and the files that compose the table. Instead of applying a schema when the data is read, clients already know the schema before the query is run. Moreover, the table metadata can be saved in a way that offers more fine-grained partitioning. This approach can offer a number of advantages, such as:
- Faster performance due to better filtering or partitioning
- Easier evolution of the schema
- Ability to “time travel” across the table at a given point in time
- Table ACID compliance
Snowflake was designed from the ground up to offer this functionality, so customers can already get these benefits on Snowflake tables today. Some customers, though, would prefer an open specification table format that is separable from the processing platform because their data may be in many places outside of Snowflake. Specifically, some customers have data outside of Snowflake because of hard operational constraints, such as regulatory requirements, or slowly changing technical limitations, such as use of tools that work only on files in a blob store. For these customers, projects such as Apache Iceberg can be especially helpful.
Table formats support new architectures and patterns
Snowflake innovated in its internal table format early on, which enabled all sorts of new capabilities, but there isn’t a one-size-fits-all storage pattern or architecture that works for everyone, and having flexibility to choose the pattern that works for you should be a key consideration when evaluating platforms. Some table formats have been an accelerant for new data management approaches, such as data mesh, that rely on distributed storage and usage of data.
To unpack this, it makes sense to look at the four principles of a data mesh architecture, which we have previously discussed, with and without a table format.
It’s no surprise the popularity of table formats has risen with the growth of the data mesh architecture. Some table formats greatly simplify ownership, production, self-service, and governance of data by removing the messiness of wrangling and interpreting individual files. We say some, because some table formats are explicitly packaged and sold only for one data architecture, an approach we believe is inflexible. This is why we have taken a careful and thorough look at which table formats to support, and how.
Why Apache Iceberg
Iceberg is an open source table format that was developed by Netflix and subsequently donated to the well-known Apache Software Foundation. Along with the benefits offered by many table formats, such as concurrency, basic schema support, and better performance, Iceberg offers a number of specific benefits and advancements to users, including:
- Vibrant ecosystem - Support for multiple file types, technical data metastores, and processing engines, which enables many different storage patterns (lake, mesh, etc.)
- Project velocity - Rapid adoption across a broad range of customers and commercial products, along with contributions from them
- Interoperability - Well-documented specification with clear goals, community input, and the ability to grow through versioning; enables many tools to operate on one set of data
In the past, we have cautioned customers to Choose Open Wisely, because often pursuing open can become the goal instead of the benefit. Specifically, we believe that open formats and projects are useful when they provide a tangible benefit to you, the customer:
“At Snowflake, we think about first principles, about desired outcomes, about intended and unintended consequences and, most importantly, we’re always focused on what is best for our customers.”
In our view, Iceberg aligns with our perspectives on open formats and projects, because it provides broader choices and benefits to customers without adding complexity or unintended outcomes. The Iceberg project is inside of a well-known, transparent software foundation and is not dependent on one vendor for its success. Rather, Iceberg has seen organic interest based on its own merits. Likewise, Iceberg avoids complexity by not coupling itself to any specific processing framework, query engine, or file format. Therefore, when customers must use an open file format and ask us for advice, our recommendation is to take a look at Apache Iceberg. While many table formats claim to be open, we believe Iceberg is more than just “open code,” it is an open and inclusive project. Based on its rapid growth and merits, customers have asked for us to bring Iceberg to our platform. Based on how Iceberg aligns to our goals with choosing open wisely, we think it makes sense to incorporate Iceberg into our platform.
Today, we are announcing that support for creating external tables from Iceberg tables will be coming to enter private preview. Using Iceberg tables is easy because the syntax is similar to other external tables—you tell Snowflake where to find the latest Iceberg snapshot file.
In this example, which may change in the final release, we show our current design for creating an Iceberg external table:
How do external tables fit into my architecture?
External tables have been purposefully designed to be a powerful and flexible tool which enables two key use cases that often make working with complex data patterns hard:
- Import for data migrations - As a mechanism to more easily import data into Snowflake so you can use the full power of the Data Cloud
- Query in place - As a tool to query data that cannot or will not be moved into Snowflake.
The addition of table formats, such as Apache Iceberg, strengthens the power and flexibility of external tables and enhances both of these use cases. This is because table formats are commonly used as a key ingredient in deploying a storage pattern across an organization, such as a data lake or a data mesh. In bringing table formats to external tables, we are reinforcing the usefulness of external tables to a variety of storage patterns, including but not limited to data lakes.
We want to be clear: If you want a data lake, mesh, or other storage pattern in Snowflake, it does not mean you have to use external tables. They are one of the many tools we offer to simplify data use and management. External tables offer greater customer choice and flexibility, and, importantly, do not force you into choosing only one storage pattern, unlike other platforms and providers.
Learn more and get started
We encourage you to learn more about our storage strategy and how open table formats play a key role. Our November 2021 Snowday session is a great place to get a primer on our storage strategy. Want to learn more from the experts? Attend this webinar on March 4th at 10am PT to ask questions and learn more about our support for Iceberg.
If you are interested in using a table format, such as Apache Iceberg, with external tables, please contact your account team to be included in our preview releases.
Forward-Looking Statements
Other than statements of historical fact, all information contained in these materials and any accompanying oral commentary (collectively, the “Materials”), including statements regarding (i) Snowflake’s business strategy and plans, (ii) Snowflake’s new and enhanced products, services, and technology offerings, including those that are under development or not generally available, (iii) market growth, trends, and competitive considerations, and (iv) the integration, interoperability, and availability of products with and on third-party platforms, are forward-looking statements. These forward-looking statements are subject to a number of risks, uncertainties and assumptions, including those described under the heading “Risk Factors” and elsewhere in the Annual Reports on Form 10-K and the Quarterly Reports on Form 10-Q that Snowflake files with the Securities and Exchange Commission. In light of these risks, uncertainties, and assumptions, the future events and trends discussed in the Materials may not occur, and actual results could differ materially and adversely from those anticipated or implied in the forward-looking statements. As a result, you should not rely on any forwarding-looking statements as predictions of future events.
Any future product or roadmap information (collectively, the “Roadmap”) is intended to outline general product direction; is not a commitment, promise, or legal obligation for Snowflake to deliver any future products, features, or functionality; and is not intended to be, and shall not be deemed to be, incorporated into any contract. The actual timing of any product, feature, or functionality that is ultimately made available may be different from what is presented in the Roadmap. The Roadmap information should not be used when making a purchasing decision. In case of conflict between the information contained in the Materials and official Snowflake documentation, official Snowflake documentation should take precedence over these Materials. Further, note that Snowflake has made no determination as to whether separate fees will be charged for any future products, features, and/or functionality which may ultimately be made available. Snowflake may, in its own discretion, choose to charge separate fees for the delivery of any future products, features, and/or functionality which are ultimately made available.
© 2022 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned in the Materials are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used in the Materials are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s).