Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg
Open source file and table formats have garnered much interest in the data industry because of their potential for interoperability — unlocking the ability for many technologies to safely operate over a single copy of data. Greater interoperability not only reduces the complexity and costs associated with using many tools and processing engines in parallel, but it would also reduce potential risks associated with vendor lock-in.
Despite rapid adoption of open file and table formats, many interdependent limitations exist between engines and catalogs, which create lock-in that diminishes the value of Iceberg’s open standards. This leaves data architects and engineers with the difficult task of navigating these constraints and making difficult trade-offs between complexity and lock-in. In an effort to improve interoperability, the Apache Iceberg community has developed an open standard of a REST protocol in the Iceberg project. The open API specification is a big step toward achieving interoperability, and the ecosystem could further benefit from open source catalog implementations to enable vendor-neutral storage.
Today, Snowflake is delighted to announce Polaris Catalog to provide enterprises and the Iceberg community with new levels of choice, flexibility and control over their data, with full enterprise security and Apache Iceberg interoperability with Amazon Web Services (AWS), Confluent, Dremio, Google Cloud, Microsoft Azure, Salesforce and more. Polaris Catalog builds on standards created by the Iceberg community to address the challenges described above.
- Instead of moving and copying data for different engines and catalogs, you can interoperate many engines on a single copy of data from one place.
- You can host it in Snowflake managed infrastructure or your infrastructure of choice.
Polaris Catalog will be both open sourced in the next 90 days and available to run in public preview in Snowflake infrastructure soon. The remainder of this blog post provides more detail on functionality and hosting options.
Cross-engine read and write interoperability
Many organizations either use various processing engines to perform specific workloads or seek the flexibility to easily add or swap processing engines in the future. Either way, they want the freedom to safely use multiple engines on a single copy of data to minimize the storage and compute costs associated with moving data or maintaining multiple copies.
Catalogs play a critical role in a multi-engine architecture. They make operations on tables reliable by supporting atomic transactions. This means that data engineers and their pipelines can modify tables concurrently, and queries on these tables produce accurate results. To accomplish this, all Iceberg table read and write operations, even from different engines, are routed through a catalog.
A standardized catalog protocol for all engines unlocks multi-engine interoperability. Fortunately, the Apache Iceberg community has created an open source specification for a REST protocol. An increasing number of both open source and commercial engines and catalogs are adding support for this REST API specification because of the interoperability it unlocks.
Polaris Catalog implements Iceberg’s open REST API to maximize the number of engines you can integrate. Today, this includes Apache Doris, Apache Flink, Apache Spark, PyIceberg, StarRocks, Trino and more commercial options in the future, like Dremio. You can also use Snowflake to both read from and write to Iceberg tables with Polaris Catalog because of Snowflake’s expanded support for catalog integrations with Iceberg’s REST API (in public preview soon).
Run anywhere, no lock-in
You can get started with running this open source Polaris Catalog, hosted on Snowflake’s AI Data Cloud infrastructure (public preview soon), or you can self-host in your own infrastructure (coming soon) with containers such as Docker or Kubernetes. Regardless of how you deploy Polaris Catalog, there’s no lock-in. Should you want to swap your underlying infrastructure, you can freely do so.
Extend Snowflake Horizon’s governance via Polaris Catalog Integration
Once integration between Snowflake Horizon and Polaris Catalog is set up, Snowflake Horizon’s governance and discovery capabilities, like column masking policies, row access policies, object tagging and sharing, work on top of Polaris Catalog. So whether an Iceberg table is created in Polaris Catalog by Snowflake or another engine, like Flink or Spark, you can extend Snowflake Horizon’s features to these tables as if they were native Snowflake objects.
Looking ahead
Polaris Catalog is intended to provide not just Snowflake customers, but the broader data ecosystem, with fully interoperable storage by building on the standards from the Apache Iceberg community. Using our experience running a global, cross-cloud platform, combined with the incredible, rapidly growing Iceberg community, we will continue to improve Polaris Catalog together. If you’d like to learn more about Polaris Catalog, please attend the AI Data Cloud Summit or register for this webinar to hear more from the team. If you’d like to be the first to know when the code for Polaris Catalog is released, sign up for notifications by watching this GitHub repository.
Forward-Looking Statements
This article contains forward-looking statements, including about our future product offerings, and are not commitments to deliver any product offerings. Actual results and offerings may differ and are subject to known and unknown risk and uncertainties. See our latest 10-Q for more information.