The Best Way to Gauge Performance of a Cloud Data Warehouse
Author: Michael Nixon
Market News, Snowflake Technology
Raise your hand if you’re a CTO or platform architect who decided to choose a particular data analytics or cloud data warehouse solution based solely on a performance benchmark report marketed by a vendor or a third-party consultant.
While it’s critical to understand the performance of a data analytics platform before you make your commitment, we’d be willing to bet very few of you raised your hand. So, what is the best method to determine optimal performance of a data warehouse?
Defining a real-world test
Proof of concepts (PoCs), which are an evaluation of a solution within an actual environment, provide the best assurance data analytics and data warehouse solutions under test will satisfy the needs of your organization. PoCs also prevent you from falling into the trap of a third-party benchmark too narrowly defined or one-sided. With PoCs, you define and control the evaluation and queries based on your own real-world scenarios. You also can test the operational performance of a data warehouse (not just query performance) based on your specific data needs.
Because data often arrives into a data warehouse from multiple sources, customers frequently tell us that onboarding new data is a significant pain point. What’s more, in an effort to focus on query performance, benchmark marketing reports often avoid discussing the impact of data ingestion (i.e., data I/O). By under reporting the time and energy necessary to onboard and load new data, these reports aren’t providing a clear picture of the required human resources.
Additionally, in benchmark reports, data partitioning and distribution are mapped in favor of the MPP architecture as a means to speed up the benchmark. This also under reports the time and effort required to distribute or re-distribute data in a real environment.
Executing a PoC will require more effort than reading a report, but the operational and query performance you learn in the end will be worth it in the long run.
How to gauge the performance of a cloud data warehouse delivered as a service
As part of the PoC planning process, ask yourself these questions:
- Does my data warehousing plan include implementing self-service analytics?
- Does my plan include empowering multiple cross-functional teams, including executives, data scientists, data analysts, program managers or BI users?
- Does your data warehouse environment require ingesting new data on a frequent basis?
- Do you work with multiple types of data, such as CSV, JSON, Parquet or ORC?
- Do you require expanding your data warehouse to match anticipated business growth?
If you answered “yes” to any of these questions, make sure your PoC plan specifically accounts for testing the manageability, ease-of-use, and performance of the cloud data warehouse environment for all levels of skill and functional workgroups. This will be especially important for growing businesses and organizations.
Software can scale but people don’t
Query performance is an important factor to test with PoCs, but not the only important factor. For instance, you don’t want to offset query performance gains with time-consuming management and administration overhead. In a real environment, if a data warehouse requires the intervention of a data engineer, a data warehouse admin or a technical support person every time you need to scale a cluster, it would add to your overall costs, delay the time to receive results and hinder your self-service requirements. Not the direction you want to go in this cloud services age.
Some cloud data warehouses take hours, if not days, to scale along with requiring operator intervention. Other cloud data warehouses scale in 3 to 5 minutes, but still require operator intervention. Further, 3-5 minutes is per node added. If you need to add 3 nodes, you may be looking at a total of 9 to 15 minutes of overhead and manual effort by an operator just to be in a position to run queries with a marginally larger configuration. You want to evaluate all of this.
The same caution applies to managing data consistency across separate workgroups and workloads and consolidating different types of data formats. The goal of a cloud data warehouse is to produce the fastest total time to insights possible (and ultimately fast time-to-market), including the time required to manage and scale the environment.
Seek balanced benchmarks
Still, benchmarks can be useful tools for cloud data warehouse performance. Because no two environments are the same, at Snowflake we frequently measure the performance of our data warehouse-as-a-service using balanced benchmark tests derived from the Transaction Processing Council TPC-DS benchmark suite. Doing so reflects the workload variety likely to be seen across different business intelligence and analytical environments.
The TPC-DS benchmark suite includes 99 different queries comprised of reporting, analytics, interactive and mixed workloads. This includes ad-hoc queries with changing sets of data. The diversity of queries provides a better cross-section for organizations to test and evaluate compared to more narrowly defined, one-sided benchmarks using a limited number of queries.
We also run performance tests using 10 TB, 100 TB, and 1 PB data warehouse sizes with data sets provided by the TPC. Concurrency performance and JSON data sets are also tested regularly.
Try, then buy
Dynamic, data-driven organizations of today have constantly changing data needs. Plus, different parts of an organization need specific insights from different collections of data.
We encourage every organization interested in a cloud data warehouse to run their own PoC. Take advantage of our $400 worth of free usage (compute and storage) for 30 days. Or, if you prefer a more structured evaluation, engage with a Snowflake representative to kick off a formal PoC process.
The choice is yours. We provide a true data warehouse-as-a-service experience. There’s nothing like trying Snowflake for yourself and experiencing the cloud data warehouse performance difference.