What is a Data Warehouse?
A data warehouse is a relational database that is designed for analytical rather than transactional work. It collects and aggregates data from one or many sources so it can be analyzed to produce business insights. It serves as a federated repository for all or certain data sets collected by a business’s operational systems.
Data Warehouse vs. Database
A data warehouse focuses on collecting data from multiple sources to facilitate broad access and analysis. They specialize in data aggregation and providing a longer view of an organization’s data over time. A data warehouse is optimized to store large volumes of historical data and enables fast and complex querying of that data. Standard operational databases focus on transactional functions such as real-time data updates for ongoing business processes.
Why Might You Need a Data Warehouse?
A data warehouse has two key functions. First, it serves as a historical repository for integrating the information and data that is needed by the business, which may come from a variety of different sources. Second, it serves as a query execution and processing engine for that data, enabling end users to interact with the data that is stored in the database.
Complex queries are very difficult to run without a temporary pause of database update operations. A frequently paused transactional database will inevitably lead to data errors and gaps. Therefore a data warehouse serves as a separate platform for aggregation across multiple sources and then for analytics tasks across those diverse sources. This separation of roles allows databases to remain focused on purely transactional jobs without interruption.
A data warehouse is usually a relational database, traditionally housed on an enterprise server. Today, cloud-built and hybrid cloud data warehouses are becoming more common and popular. A pure cloud data warehouse allows businesses to easily scale compute resources up, down, or even out to handle increased volume and concurrency demands. It also allows organizations to easily facilitate data sharing without having to move data via ETL or other means. The more sophisticated cloud data warehouses can also easily ingest and aggregate both semi-structured data (such as JSON) and structured data in unified relational SQL views. This allows businesses in the age of mobile and IoT to analyze and share disparate data sources with minimum effort, speeding time to insight. To learn more, read our blog post, “Why You Need a Cloud Data Warehouse.”
A data warehouse usually consists of data sources from operational and transactional systems (ERP, CRM, finance apps, IoT devices, mobile and online systems) as well as:
- A data staging area for aggregation and cleaning
- A presentation/access area where data is warehoused for analytics (querying, reporting) and sharing
- A range of data tool integrations or APIs (BI software, ingestion and ETL tools, etc.)
Most platforms, whether in the cloud or otherwise, use an older “shared nothing” architecture. This architecture tightly couples storage, compute, and database services, which severely hampers the ability of the database administrator to elastically scale the database to respond to the need to store or analyze more data or to support more concurrent users.
Data warehouses can also supply decentralized data marts where a subset is made available for the analytics needs of specific business groups.
Data Warehouse vs. Data Lake
These are both widely used terms for storing big data, but they are not interchangeable. A data lake is a vast pool of raw data —often a mix of structured, semi-structured , and unstructured data — which can be stored in a highly flexible format for future use.. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.
Snowflake: A different architecture
Snowflake has built a pure cloud, SQL data warehouse from the ground up. Designed with a patented new architecture to handle all aspects of data and analytics, it combines high performance, high concurrency, simplicity, and affordability at levels not possible with other data warehouses.
Snowflake physically separates but logically integrates storage, compute, and services (like metadata and user management). Because each one of these components is separate, they can be expanded and contracted independently, enabling Snowflake to be more responsive and adaptable.
Snowflake employs a central persisted data repository that is accessible from all compute nodes. But similar to shared-nothing architecture, Snowflake processes queries using MPP (massively parallel processing) compute clusters. In this set-up, each node in the cluster stores a portion of the entire data set locally.
Snowflake can also serve as your data lake while maintaining at-cost spend for cloud data storage. A Snowflake data lake can natively ingest and query a wide variety of disparate data types — from JSON, CVS, and tables to Parquet, ORCC and more — in a relational format and with full transactional ACID integrity.