Choosing Open Wisely

Where open helps and where it hurts

Since Snowflake’s inception we’ve had the needs of our customers as our North Star, with a clear focus on security and governance of data, continuous advances on performance and reduction of latencies, and relentless capabilities innovation. As part of our product development, we constantly evaluate where open standards, open formats, and open source can help or hinder our progress towards those goals. In short, it is a matter of degree. Over the years we’ve seen the customer benefits of this approach and in sharing we hope to provide a useful perspective to the breadth of individuals and organizations creating innovative technology.

It is not uncommon to fall into the trap of letting the means get confused with the end. There are instances where a goal is set given a specific intended outcome, and over time the why of the goal is forgotten and its pursuit becomes a goal in itself, oblivious to the original purpose.

We believe this is the case with the pursuit of “open” platforms in our industry. We see strong opinions for and against open, we see table pounding demanding open and chest pounding extolling open, often without much reflection on benefits versus downsides for the customers they serve. We hear mischaracterizations about the negative consequences of the alternatives. Some companies would want everyone to believe that open is what really matters whereas what matters is security, performance, costs, simplicity, and innovation. Using open should be at the service of these goals, not a goal unto itself at customers’ expense.

Open is often understood to encompass two broad characteristics, open standards as well as open source. In the appropriate context, these characteristics can enhance value to users of technology systems. However, these characteristics are not universally positive or without drawbacks. For many organizations who have fallen into the trap of assuming that open is synonymous with innovative and cost-effective, they have learned the hard way that neither is the case.

We feel it is important to share the Snowflake perspective on technology, innovation, and what we have seen work with thousands of customers in regards to the next generation of open. Below we expand on those topics, starting from general principles and lessons learned through several decades of open, and then explaining the treatment of open employed at Snowflake to maximize the benefit to customers.

We want to emphasize our belief in the importance and impacts of the open source community. We are grateful for the community’s efforts, which propelled the big data revolution and much more. We hope, though, that sharing our perspective of where open is beneficial versus counterproductive will help everyone, from individuals contributing to open source projects all the way to enterprise customers.

Open Standards

A key aspect of the discussion of open is the notion of open standards. This encompasses file formats, protocols, and programming models, which itself includes languages and APIs. Even though open standards generally provide value to users and vendors alike, we believe it’s important to understand where they’re helpful to higher-level priorities of customers and where they may hinder such priorities.

File Formats

In the world of data management, the topic of formats receives significant attention. This emphasis is often clustered around portability of data and interoperability through direct file access.

For as long as database systems have been used, file formats have been an important consideration, in particular within the context of making migrations easier or harder and enabling or preventing vendor lock-in. It is true that some organizations have data tied down in legacy systems with massive efforts and user disruptions presented as the cost of ransom. It is true that some legacy systems focus on simplifying data ingestion and do little or nothing to simplify the process of getting data out. It is also true that application code and business logic often contribute more towards migration difficulties than the process of moving the data itself.

Open file formats are often presented as an important part of the answer to vendor lock-in and we agree with that premise. Interoperability and, in general, avoiding lock-in requires enabling ingestion of the most common standard file formats as well as export of data in the same standard file formats. We argue that an important characteristic of open used towards the desired goal of eliminating lock-in is the amount of investment in getting data out of a given system or service.

Where the discussion on file formats takes a turn for the worse is around the belief that those open formats are the optimal way to represent data during processing. To make things even worse, the belief expands to portraying direct file access as a key characteristic of a data platform. Supporters of the argument state that direct file access to standard formats is the best way to enable interoperability and prevent vendor lock-in. We disagree with this premise and, more importantly, history has precedents that have informed our perspective.

At first glance, the idea of any data consumer or any application being able to directly access files in a standard, well-known format sounds appealing. Of course that is until a) the format needs to evolve, b) the data needs to be secured and governed, c) the data requires integrity and consistency, and/or d) the performance of the system needs to improve. What about an enhancement in the file format that enables better compression or better processing? How do we coordinate across all possible users and applications to understand the new format? Or what about a new security capability where data access depends on a broader context? How do we roll out a new privacy capability that reasons through a broader semantic understanding of the data to avoid re-identification of individuals? How do we ensure transactional integrity of data sets made by multiple applications? What about performance optimizations that can be achieved with additional information derived from the data files? Is it necessary to coordinate all possible users and applications to adopt these changes in lockstep? What happens if one of these is missed?

Decades of experience navigating through these very trade-offs give us a strong understanding of and conviction about the superior value of providing abstraction and indirection versus exposing raw files and file formats. We strongly believe in API-driven access to data, in higher level constructs abstracting away physical storage details. It’s not about rejecting open; it is about delivering better value for our customers. We balance this with making it very easy to get data in and out in standard formats.

A good illustration of where abstracting away the details of file formats significantly helps end users is compression. An ability to transparently modify the underlying representation of data to achieve better compression translates to storage saving, compute savings, and better performance. Exposing the details of file formats makes it next to impossible to roll out better compression without causing long migrations, breaking changes, or adding complexity for applications and developers.

Security and data governance are another area where well-intentioned but rigid or even dogmatic application of open can lead to serious issues. Customers want to sleep well at night knowing that policies are centralized and they don’t have to protect against accidental direct access that may bypass defenses. We reject the notion that security of data and enforcement of access control needs to be left for a variety of users and applications to coordinate and get right. Key management done properly is a complex problem in itself; why add complexity and risk? The stakes are too high.

Governance goes beyond security and for us includes integrity, consistency, and privacy of data. Consider maintaining transactional integrity of as many files as needed based on user and application intent. The notion of direct file access needs to wrestle with this type of trade-off and either give up on governance (but then, what is the value of such a solution?) or adopt an appropriate abstraction in the programming model.

The history of database systems has plenty of examples like ISAMs or CODASYL showing us that physical access to data leads to an innovation dead end. More recently, adopters of Hadoop found themselves managing costly, complex, and unsecured environments that didn’t deliver the promised performance benefits. In a world with direct file access, introducing new capabilities translates into delays in realizing the benefits of those capabilities, complexity for application developers and, potentially, governance breaches. This is another point arguing for abstracting away the internal representation of data to enable more value to our customers, while supporting ingestion and export of open file formats.

Open Protocols and APIs

Data access methods are more important than file formats. We all agree that avoiding vendor lock-in is a key customer priority. While some may claim that open formats are the solution, the heavy lifting in any migration effort is about code and data access, whether it’s protocols and connectivity drivers, query languages, or business logic. Those that have gone through a system migration can likely attest that the topic of file formats is just a red herring.

For us, this is where open matters most. This is where the infamous (and costly!) lock-in can be avoided, where data governance can be maximized, and where all sorts of innovation is enabled to the benefit of customers. Focusing on open protocols and APIs is a key element for avoiding complexity for users and enabling continuous and transparent innovation.

Open Source

Started in the late ‘90s, open source represents the practice where the source code of technology solutions is available for others to read, modify, or copy under a variety of license terms, some more permissive than others. Key characteristics and goals of open source include the ability to provide greater control and understanding of technology implemented by an organization, increased security based on transparency, an assumption of lower cost, and a development model that is decentralized with community collaboration at the forefront.

Open source can deliver against these goals primarily when technology solutions are delivered as installable software that is deployed within customers’ data centers or security domains. This was the prevalent deployment model during the last couple of decades. We’re now undergoing one of the most substantial changes in our industry: delivering solutions as managed services, which shifts the need for and impact of the above-mentioned characteristics. Let’s look at each of them in more detail and how the reality of cloud-managed services is changing this.

Control and Understanding

It is reasonable to assert that software delivered as a compiled black box may be difficult to understand and that making source code available is the solution. For certain systems and for certain specialized users this may be true. But, for a large majority of users and organizations this may be not the case.

As an example, the query processor of a sophisticated data platform is typically built by dozens of PhD program graduates, evolved, refined, and optimized over years. Source code availability may not significantly increase the ability to comprehend its inner workings. However, there may be greater value in surfacing data, metadata, and metrics that provide clarity to customers as opposed to publishing source code and declaring the mission accomplished.

Another aspect of this discussion is the desire to copy and modify source code. While we believe this indeed provides value and optionality to certain organizations that can invest to build these capabilities, we’ve seen many unintended or undesired consequences including fragmentation of platforms, incompatible forks, less agility to implement changes, and competitive dysfunctions.

Increased Security

This has traditionally been one of the main arguments in favor of open source delivery. We agree that if an organization is to deploy software within their security perimeter, source code availability can increase the confidence about its security if the organization can inspect what it does and how it does it.

Much has been said recently about the supply chain of software systems and the risks of the traditional embedded deployment model. Complex solutions may aggregate many layers of software subsystems provided by several organizations without an understanding of the full end-to-end impact of the security of the system.

Luckily there is a better model. It applies to every organization. And that model is not the cloud. The cloud in and of itself does not substantially address the security challenge. Many software solutions get adapted to run on a cloud virtual machine or a cloud container and that does not change the security posture or risks. Similarly, deploying all sorts of software solutions within an organization’s virtual private cloud does not change the fundamental risks either.

The better model is the deployment of technology solutions as (cloud) managed services. Encapsulation of the inner workings of the service allows for its fast evolution and speedy delivery of innovation and improvements to customers. With additional focus, services remove configuration burden and eliminate provisioning and tuning effort. The security benefits of this approach are detailed in the later section on how open is used at Snowflake.

Lower Cost

Many also associate open source with lower cost. “You don’t have to pay for the software” is equated to free, casting commercial software as infinitely more expensive. But, this doesn’t take into account many other costs including maintenance and support. What is the cost of deploying and running the software, updating, and break-fixing? There is such complexity in this that in the last few years the entire business model around open source has shifted to companies monetizing the hosting and operation of open source. Increasingly there’s the need to purchase the broader solution that builds on the open source but provides proprietary extensions to simplify or improve the software. It was supposed to be free!

Building on the better model outlined above—cloud managed services—we believe that lower cost should be measured in terms of total cost and price/performance available out of the box for customers. Instead of creating complexity and selling simplification, ease of use needs to be delivered in the first place. Having to manage versions, work around maintenance windows, fiddle with knobs, etc., are all major steps backwards relative to what the cloud enables. Misguided application of open deprives customers of those benefits in the name of an abstract concept.

Community

One of the more powerful aspects associated with open source is the topic of community, by which a group of users works together towards the success of a technology solution, helps improve it, and helps one another. Note however that collaboration does not need to imply source code contribution. We think of community as users helping one another, sharing best practices, and discussing future directions for the technology.

As the migration and modernization of on-premises solutions towards cloud solutions and managed services continue, these topics of control, security, cost and community recur. What is most interesting to us is that the original goals of open source are being met without necessarily providing source code for everyone—which is where we started this discussion: Let’s not lose sight of the desired outcomes by focusing instead on specific tactics that may no longer be the best route to those outcomes.

Open at Snowflake

At Snowflake, we think about first principles, about desired outcomes, about intended and unintended consequences and, most importantly, we’re always focused on what is best for our customers. As such, we don’t think of open as a blanket non-negotiable attribute of our platform, but we are instead very intentional in choosing where and how to embrace open.

Our goals and priorities are clear: 1) Deliver the highest levels of security and governance; 2) Provide industry-leading performance and price/performance through continuous innovation; 3) Continue setting the highest levels of quality, capabilities and ease of use to enable our customers to focus on deriving value from data without the need to manage infrastructure. We also focus on ensuring our customers use and leverage Snowflake because they want to, and not because they’re locked in. To the extent that open standards, open formats, and open source help us achieve these goals, we embrace them, in line with principles and observations outlined in the previous section. But, when open threatens these goals, our priorities and our customer obsession dictate against it.

Open Standards at Snowflake

With those principles in mind, Snowflake has fully embraced standard file formats, standard protocols, standard languages, and standard APIs. We are very intentional and careful about where and how we do so in order to maximize the value we provide to our customers. In particular, we have invested heavily in the ability to leverage the full capabilities of our parallel processing engine to enable customers to get data out of Snowflake quickly, should they choose or need to do so. However, abstracting away the details of our internal low-level data representation allows us to continually improve our compression—completely transparently to our users—and deliver other optimizations. The result for customers is lower costs and better performance.

We are able to advance the controls for security and data governance quickly, without the additional burden of managing direct (and brittle) access to files. Similarly, transactional integrity at Snowflake benefits from the introduced level of abstraction and not exposing underlying files directly to users.

On the topic of integrity of data: Snowflake implemented multi-table transactional consistency since its inception. The query engine coordinates changes and ensures ACID properties are maintained while providing concurrency across workloads and users. But integrity goes beyond transactions: Snowflake uses a variety of capabilities to track physical and logical data attributes which, in turn, are leveraged to deliver a variety of functional and performance benefits.

Also, we have introduced all sorts of new capabilities that improve query and application performance transparently for our customers, and whose introduction benefited greatly from the abstraction indirection. Examples include the transparent matching of materialized views and our Search Optimization Service.

Similarly, we fully embrace open protocols, languages, and APIs. This includes open standards for data access, traditional APIs such as ODBC and JDBC, and also REST-based access. Similarly, supporting the ANSI SQL standard is key to query compatibility while offering the power of a declarative, higher-level model. Other examples embraced by Snowflake include enterprise security standards such as SAML, OAuth, SCIM, etc. and numerous technology certifications.

With proper abstractions and promoting open where it matters in the domain of file access and formats, open protocols allow us to move faster (we do not need to reinvent them), allow our customers to re-use their knowledge, and enable fast innovation due to abstracting the “what” from the “how.”

Open Source at Snowflake

Snowflake delivers a small number of components that get deployed as software solutions into our customers’ systems, such as connectivity drivers such as JDBC or Python connectors or our Kafka connector. For all of these we provide the source code. Our goal is to enable maximum security for our customers, and we do so by delivering our core platform as a managed service, and we increase the peace of mind for installable software through open source.

As noted earlier, misguided application of open can create costly complexity instead of low-cost ease of use. We are very aware of those pitfalls as ultimate usability is our core tenet. Offering stable, standard APIs but not opening up Snowflake’s internals allows us to quickly iterate, innovate, and deliver value to customers. However, customers cannot create—deliberately or unintentionally—dependencies on internal implementation details, as we have encapsulated them behind APIs designed following solid software engineering practices. That is a major benefit for both sides, and it’s key to maintaining our weekly cadence of releases, to continuous innovation, and to resource efficiency, about which we obsess and which directly benefit price/performance. We’ve consistently heard validation from customers who migrated to Snowflake that those choices and decisions are much appreciated.

The interface to our fully managed service, run in its own security perimeter, is the contract between us and our customers. We can do this because we understand every component running and we devote a great amount of resources to engineer for security. Snowflake has been evaluated by security teams across the full gamut of company profiles and industries including highly regulated ones such as healthcare and financial services. The system is not only secure, but the separation of the security perimeter through the clean abstraction of a managed service simplifies the job of securing data and data systems for our customers.

On a final note, we love our user groups, our customer councils, and our user conferences. We fully embrace the value of a vibrant community, open communications, open forums, and open discussions. Open source is an orthogonal concept, from which we do not shy away. For example, Snowflake collaborated on open sourcing FoundationDB (FDB), and made numerous and significant contributions to evolving FDB further. However, we do not extend this very positive experience to state that there is a particular merit to open source software. We could have equally well used a different operational store and a different model of making it suit our requirements if needed. The FDB example illustrates our key thesis well: Open is a great collection of initiatives and processes, but it’s just one of many tools. It is not the hammer to hit all nails and is the best choice only in some situations. In others it hinders progress or is cynically presented as a “higher calling” to promote dubious causes. As always, customers must carefully consider which priorities are most important to them.

Conclusions

At Snowflake we believe in open where open matters. We do not believe in open as a principle to follow blindly without weighing the trade-offs. We believe in the value of open standards and open source, but also in the value of data governance and security; we believe in the value of ease of use, the power of community, and the value of abstractions that enable transparent optimizations and improvements over time. Some companies tout being open and pride themselves on being open source, but in reality their embrace is not 100%; as described in this document, there are good reasons dictating such departures. Our goal is to be clear and transparent about how we think about these topics at Snowflake and to dispel myths and misconceptions.

We enjoy taking complex technology and simplifying it so our customers can spend the bulk of their time getting value out of data rather than managing infrastructure. We remain committed to open sourcing components that get deployed in customer premises or security perimeters, and to import and export open formats. We remain committed to standards-based APIs and programming models. Above all, we remain committed to continue to innovate, to continue to raise the bar of what’s possible, and to elevate standards for our industry with no other goal than increasing the data capability of our customers.

Subscribe to our blog!

Thank you for your submission.