Data Lake or Data Swamp?
Author: Brad Ganas
I’ve been in many conversations about using a “data lake” as a new component in a big data solution. The idea is to take raw data and land it in a system (often Hadoop and HDFS) where it can be stored and, when needed, processed to create data sets for other applications and users. In theory, a data lake avoids the need to build a complex (and often expensive) data pipeline simply to collect and store diverse data when you don’t necessarily know in advance how it will be used.
The Common Data Lake Project Scenario
So how is that working in practice? I recently met with a customer who is in the midst of deploying a HDFS-based data lake. The description they shared was something that’s becoming increasingly familiar everywhere I go. It went something like this:
“We’ve selected [major Hadoop distro] and have been working on deployment for the past eight months. We’re a little behind because it took a month to get hardware purchased, installed and configured in our data center. We’re not in production yet and probably won’t be anytime soon. Our use case is data and ETL offload, as it’s expensive to store data on our [major MPP database vendor]. Hive performance is not exactly what we thought it would be at scale. Hive is ‘OK’, but it’s not ANSI SQL. Redesigning our ETL processes for data warehousing is more complex than we thought, and getting Hadoop security figured out isn’t exactly straightforward.”
Eight months is a long time in today’s business world -‐ and to be not even close to production at that point is almost guaranteed to prompt questions from the business as to what they invested in.
For organizations interested in bringing together diverse data, a Hadoop-based data lake is often assumed to be the only option. Why? Because for the semi-structured data (e.g. JSON, Avro, XML, etc.) that makes up an increasing share of the data they want to collect, alternatives like relational databases are not appropriate—they require data transformation, don’t handle flexible schemas well (if at all), and can be very expensive.
Hadoop doesn’t require data transformation nor fixed schemas and the software cost is far less than commercial databases, but when it comes to structured and semi‐structured data there are lots of things it can’t do without a lot of custom programming and additional tools. Although a tiny number of sophisticated organizations with specialized resources have deployed a Hadoop-based data lake, for most it’s been many miles of rough road. They’ve come to find that performance, easeof‐use, productivity, and security don’t generally come ‘out of the box’ and can in fact be quite difficult to achieve. Many find it complex to have a completely separate system for the purpose of extracting data they need from semi-structured content only to put that data back into a relational system for analysis.
That’s why a lot of data lake projects become data swamps—projects that suck up significant time and resources, but leave data in a murky pool that’s hard to see into.
Rethinking the Data Lake
What would a system look like that provided the benefits organizations want when considering a data lake, but without requiring additional systems, complexity, and tooling? It would provide:
- Cost-effective, elastic storage of diverse data sources and types. Organizations want to store massive volumes of disparate data in a cost-effective and efficient manner. That requires true elasticity (not just single-direction scalability), able to both grow infinitely and shrink as needed, all at a low cost.
- Flexibility to support ad hoc integration of those data sources. More than just dumping data into a container, organizations need mechanisms to access, analyze, and share insights gleaned from on-the-fly exploration of data–without first going through a months-long, arduous process of building data models and data integration scripts to load it into a monolithic database only for it to become obsolete before any real usage. Flexible data formats like JSON and Avro put a premium on efficient integration of this data with existing structured data.
- Fast SQL-based processing of data. SQL has been and will continue to be the de facto standard for querying business data. Of course, speed counts too. Having SQL is great but if it’s unusably slow, what’s the point? People need performance and concurrency scalability without complex tuning, sharding, or partitioning.
The Snowflake Approach
Snowflake is an entirely new kind of database. It’s a relational database built from scratch, not derived from some other database or data platform. Snowflake allows users to securely and cost-effectively store any volume of data, process semi-structured and structured data together, and provide fast processing using ANSI SQL at any scale. It offers the benefits that organizations are seeking in data lake projects, but without sacrificing ease of use and fast analytics. Let’s look at how in more detail.
Getting Data in
Data can be easily loaded and/or migrated into Snowflake. One option for loading data is to use Amazon S3 for staging before loading data into Snowflake tables using Snowflake’s bulk loading command (a SQL COPY command). Because Snowflake decouples data storage from compute, storage is low-cost, near infinite, and can start as small as you’d like and grow to petabyte scale without any up-front capacity planning.
If you have ETL tools in your organization, they can simply connect to Snowflake through standard ODBC/JDBC connectivity to move data back and forth as needed. This approach works well including for frequent incremental data loads.
Bringing Data Together
A major advantage of Snowflake is that there’s no need to pre‐process semi-structured datasets. JSON, Avro, XML can be loaded as-is and then analyzed via standard SQL. Snowflake accomplishes this through a special data type called ‘VARIANT’. During load, Snowflake examines the data and then stores it in an efficient compressed fashion. Statistics are also captured for use in relational query optimization.
With Snowflake, you can perform SQL-based analysis across disparate databases (not just across schemas). This is a subtle but important capability in Snowflake, as it allows you to just load and query. There’s no requirement to move data sources into one common data model or set of tables. You truly have freedom of exploration across disparate datasets.
Fast Data Processing
Snowflake is designed to provide high performance, SQL-based analysis. To do that, users simply create a compute cluster, called a virtual warehouse, to perform operations. Each virtual warehouse can be resized on‐the‐fly and be set to turn on or off automatically for cost savings.
Data can be explored via ANSI SQL statements on your data including your semi‐structured datasets like JSON, Avro, and XML. For example, you can have a table that stores nothing but JSON data (in its native format without any pre-processing) and write SQL against that document using a simple notation syntax. Users can then easily integrate that data with data in other databases, schemas, tables, and views.
Of course, users can connect any BI tool to a Snowflake database via ODBC or JDBC connections.
A Word About Security
When bringing together lots of disparate data, security is immensely important. A huge difference between Snowflake and software like Hadoop is that Snowflake designed in robust security as the platform was being built, providing and extraordinarily strong offering.
For example, Snowflake offers the following:
- Role‐based access control and privilege management
- Two‐factor authentication
- Encryption of all data in flight via HTTPS
- Encryption of all data at rest. We employ key rotation and rekeying for data at rest.
- End‐to‐end encryption:
- Client provides encrypted data
- Snowflake processes data securely
- Snowflake outputs encrypted data
- Snowflake client tools (Web UI, JDBC/ODBC drivers) use HTTPS to communicate with the Snowflake service
Snowflake is a compelling offering for anyone looking for the benefits of a data lake for structured and semi‐structured data. The performance of SQL‐based processing in Snowflake is extraordinary and provides a cost effective solution. Bottom line, Snowflake is >10x faster than any SQL-on‐HDFS solution, offers better TCO than Hadoop, and is a much simpler environment to use.