During the 2019 Snowflake Summit World Tour, I was fortunate to speak to very engaged audiences of Snowflake prospects about the challenges many of them were facing implementing a data lake strategy.
I mentioned that the amount of information that has circulated over the years just to define a data lake and clear up confusion is much higher than the amount of information that discussed date lake successes.
As far as definitions go, the gold standard is the definition put forth by James Dixon on his blog back in 2010. Dixon, then CTO of Pentaho (a Hadoop distribution provider at the time), originated the data lake concept. Most data lake definitions are a variation of Dixon’s definition. Even today, ten years later, definitions are still forthcoming. Therefore, in my talks, I focused on an area that is ultimately more interesting to organizations than definitions: the assumptions that went into requiring a data lake computing framework in the first place. The question that needs to be asked is whether these assumptions hold true today, when data strategies are now being adopted in the cloud. The cloud is the future for data and analytics—rather than on premises with a Hadoop-based solution.
As mentioned in Dixon’s blog, the common attributes (among the companies he spoke with) that drove the need for a Hadoop data lake solution were as follows:
- 80%–90% of the companies were dealing with structured or semi-structured (not unstructured) data.
- The data source was typically a single application or system.
- The data was typically sub transactional, i.e., not transactional.
- There were some known questions to ask of the data.
- Many unknown questions would arise in the future.
- Multiple user communities had questions to ask of the data.
- The data was of a scale or daily volume such that it wouldn’t fit technically or economically into an RDBMS.
In the end, he concluded the concept of a data lake computing framework was necessary, because relational database technologies of the time could not technologically or economically handle the volume and variety of data flooding organizations.
From the context of 2020 and the fact that cloud-based approaches will be the dominant deployment type in just a few years (according to IDC), let’s review each of the attributes from Dixon’s blog post and assess whether they hold true today. As we go along, before you read my assessment for each attribute, consider the attribute and decide whether you think it holds true today.
80% to 90% of organizations deal with structured or semi-structured data
True or not true today? It’s practically impossible to be a modern company and not have a variety of data sources generating a mixture of structured and semi-structured data. Weblogs and the information necessary to track how visitors to your website navigate and engage with your web pages usually are contained in semi-structured, document type data formats such as JSON. The same is true for application-generated data, as well as for event-generated data in general from use cases such as IoT sensors and games. From Snowflake’s interactions with companies, this figure is likely well over 90%. Therefore, this attribute is very much true today.
The data source is typically a single application or system
True or not true today? Extending the points from the previous attribute, it’s difficult to be a modern company and not have multiple sources of data—sales data, customer relationship management (CRM) data, marketing data, social media data, and application data, just to name a few. With most large organizations, the data stack easily numbers in the dozens. This makes it necessary to support multiple sources of data, making this attribute not true, but in a positive way. The implications are you’ll need a plan that allows your environment to load multiple data sources, simultaneously, in your analytics environment. To take your environment to the next level of capabilities, adopt a plan that allows you to load and write multiple sources of new data into existing tables while your users are simultaneously reading from those tables.
The data is typically sub transactional
True or not true today? This attribute indicates data is not frequently being changed or updated or perhaps the data set is not operational data. This implies that the data, in the majority of cases, will be read rather than transacted upon. The challenge with this attribute is the meaning of transactional must be defined. For fast moving data-driven organizations, users and communities want data sets to be updated as quickly as possible as new data arrives. The insert and update process to an existing data set is transactional; no one wants to work with stale data. Snowflake frequently polls its prospects, and we know that slow onboarding of new data is a frequent frustration. Furthermore, you might not think of use cases such as live, real-time dashboards and in-game event analysis are the same as financial OLTP use cases, however, dashboard updates and data events are transactional nonetheless. This is especially true for companies in the business of developing products and services based on these use cases. Given these exceptions, this attribute is not true today. Your analytics platform must be ready for ever-changing data from a variety of sources.
There are some known questions to ask of the data
True or not true today? Organizations generate data from known sources and they need to report on it. Organizations know the nature of the data and they know what they want to query or report from it. This statement implies that if the nature of the data is known and the questions are known, it’s easy to create a schema or model for the data in advance and, therefore, traditional RDBMS approaches will work satisfactorily. This attribute remains true today.
Many unknown questions will arise in the future
True or not true today? This is the antithesis of the previous attribute. This remains true in 2020. However, back in 2010 when the data lake concept originated, the implication was that you didn’t know what you didn’t know. Therefore, it was not possible (or at least it was difficult) to model the data in advance for querying an RDBMS. Thus, a data lake was necessary to ingest, store, and analyze the data using a variety of tools from the Hadoop ecosystem that were attached to the data lake. In 2020, this approach in the cloud, even if it is implemented on cloud storage instead of in a Hadoop Distributed File System (HDFS), means you mirror the difficulties and complexities associated with Hadoop-based computing. Technology has evolved. With Snowflake, for instance, the schema can be inferred from many data types, including semi-structured data such as JSON. Data is immediately available for querying with familiar SQL and without the complexity of Hadoop-style computing.
Multiple user communities have questions to ask of the data
True or not true today? This is still true. Not only is it true, for 2020 and beyond, because many companies are striving to be data-driven, the number of user communities within an organization that need to ask questions of data is likely to be much higher today compared to 2010. The implication for organizations is to implement a platform that allows for the easy expansion of users, communities, and workloads—for example, data science, data engineering, BI and analytics, and reporting, all within a common environment operating against a single source of truth. To the extent this can be accomplished without a fight for resources or disruptions to the environment when growing the platform the better.
The data is of a scale or daily volume such that it will not fit technically or economically in an RDBMS
True or not true today? In 2010, many RDBMS systems and data warehouses were indeed challenged with supporting growing and accelerating volumes of data as well as a mix of data types, for example, structured and semi-structured data. This is what spurred the excitement, and necessity, for Hadoop. In 2020, in the cloud, many of these same relational database management systems and data warehouses implement the same architecture as when they were deployed on premises. Therefore, the technical data struggles continue and the cost easily runs into millions of dollars, so these systems are not very economical. If you thought this attribute was true, you are in complete agreement with every audience that answered “true!” in a loud chorus during my talks.
But wait; not so fast. Let’s turn the RDBMS assumptions upside down.
Snowflake offers a different cloud data platform, with an architecture built for any cloud. This sets the Snowflake experience apart from the rest. Any amount of structured tables and semi-structured JSON data fits within Snowflake, which results in highly performant, direct querying. In addition, any number of users and communities can work with the same data without interference and without data transactional inconsistencies. With data storage priced from $23 per TB per month (for compressed data) and with Snowflake compute charges beginning at $.0006 per second (one minute minimum), many organizations using Snowflake experience dramatic cost savings compared to traditional RDBMS platforms and data warehouses hosted in the cloud. If you have a data lake already, Snowflake supercharges your queries with exceptional performance when SQL data transformations and queries are pushed into Snowflake and returned to your data lake. Thus, this attribute is not true—if you use Snowflake.
At Snowflake, we’re constantly innovating and challenging the status quo for what’s expected in the cloud. If you want to learn more, try Snowflake for free. If you want to become part of the data lake discussion, register for Snowflake Summit.
I will continue to elaborate on how the Snowflake cloud data platform and our multi-cluster shared data architecture uniquely drive your data-driven initiatives forward. Stay tuned to the Snowflake blog and the Snowflake Twitter feed (@SnowflakeDB) or with me on Twitter (@miclnixon1).