Data Lake: A Definition
A data lake is an unstructured repository of unprocessed data, stored without organization or hierarchy. They allow for the general storage of all types of data, from all sources.
Data lakes typically store a massive amount of raw data in its native formats. This data is made available on-demand, as needed; when a data lake is queried, a subset of data is selected based on search criteria and presented for analysis.
The characteristics of data lakes that distinguishes them from other types of big data storage are:
- Open to all data, regardless of type or source
- Data is stored in its original raw, untransformed state
- Data is transformed only when provided for analysis based on matching query criteria
The source- and format-agnostic nature of data stored in a data lake offers several benefits for businesses, including:
- Flexibility, as data scientists can quickly and easily configure queries
- Accessibility, as all users can access all data
- Affordability, as many data lake technologies are open source
- Compatibility with most data analytics methods
- Comprehensive, combining data from all of an enterprise’s data sources including IoT
Both data lakes and data warehouses are big data repositories. The primary difference between a data lake and a data warehouse is in how data is stored. A data warehouse typically stores data in a predetermined organization with a schema. A data lake does not have a predetermined schema. Also, whereas a data warehouse usually stores structured data, a data lake stores structured and unstructured data.
Comparison Chart: Data Lake and Data Warehouse
Type of data
|Structured and unstructured from any source, raw||Structured, curated|
|Data scientists, developers, and data analysts ||Data analysts|
Data Lakes in the Cloud
The sheer volume of big data—particularly the unfiltered data of a data lake—make on-premises data storage unrealistic. Apache Hadoop, Amazon S3, and Microsoft Azure Data Lake are a few cloud-based data storage service providers that enable data storage of varying sizes and speeds for processing and analysis.
Snowflake’s platform provides the benefits of data lakes and the advantages of data warehousing and cloud storage. With Snowflake as your central data repository, your business gains best-in-class performance, relational querying, security, and governance. Alternatively, store your data in cloud storage from Amazon S3 or Azure Data Lake and use Snowflake to accelerate data transformations and analytics.