The Best Way to Gauge Performance of a Cloud Data Warehouse
Author: Michael Nixon
Market News, Snowflake Technology
Raise your hand if you’re a CTO or platform architect who decided to choose a particular data analytics or cloud data warehouse solution based solely on a performance benchmark report marketed by a vendor or a third-party consultant.
While it’s critical to understand the performance of a data analytics platform before you make your commitment, we’d be willing to bet very few of you raised your hand. So, what is the best method to determine optimal performance of a data warehouse?
Defining a real-world test
Proof of concepts (PoCs), which are evaluations of solutions within actual environments, provide the best assurance data analytics and data warehouse solutions under test will satisfy the needs of your organization. PoCs also prevent you from falling into the trap of a third-party benchmark too narrowly defined or too wildly one-sided. With PoCs, you define and control the evaluation and queries based on your own real-world scenarios. You also can test the operational performance of a data warehouse (not just query performance) based on your specific data needs.
Because data often arrives into a data warehouse from multiple sources, customers frequently tell us that on-boarding new data is a significant pain point. What’s more, in an effort to focus on query performance, benchmark marketing reports often avoid discussing the impact of data ingestion (i.e., data I/O). By under reporting the time and energy necessary to onboard and load new data, these reports are not providing a clear picture of the required human resources.
Additionally, in most benchmark reports, data partitioning and distribution are mapped in favor of an MPP architecture as a means to speed up the benchmark. This also under reports the time and effort required to distribute or re-distribute data in a real environment.
Executing a PoC will require more effort than reading a report, but the operational and query performance you learn in the end will be worth it in the long run.
How to gauge the performance of a cloud data warehouse delivered as a service
As part of the PoC planning process, ask yourself these questions:
- Does my data warehousing plan include implementing self-service analytics?
- Does my plan include empowering multiple cross-functional teams, including executives, data scientists, data analysts, program managers or BI users?
- Does your data warehouse environment require ingesting new data on a frequent basis?
- Do you work with multiple types of data, such as CSV, JSON, Parquet or ORC?
- Do you require expanding your data warehouse to match anticipated business growth?
- Do you require analytics on separate workloads?
- Do you require data processing of multiple connections to the same database, simultaneously?
If you answered “yes” to any of these questions, make sure your PoC plan specifically accounts for testing the manageability, ease-of-use, performance and concurrency scaling capacity of the cloud data warehouse environment for all levels of skill levels and functional workgroups. This will be especially important for growing businesses and organizations.
Software can scale but people don’t
Query performance is an important factor to test with PoCs, but not the only important factor. For instance, you don’t want to offset query performance gains with time-consuming management and administration overhead. In a real environment, if a data warehouse requires the intervention of a data engineer, a data warehouse admin or a technical support person every time you need to scale a cluster, it would add to your overall costs, delay the time to receive results and obstruct your self-service requirements. Not the direction you want to go in this data-driven, cloud services age.
Along with requiring operator intervention, some cloud data warehouses take hours, if not days to scale. Other cloud data warehouses scale in 3 to 5 minutes, but still require operator intervention. Further, 3-5 minutes usually is per node added. If you need to add 3 nodes, you may be looking at a total of 9 to 15 minutes of overhead and manual effort by an operator just to be in a position to run queries with a marginally larger configuration.
In addition, many cloud data warehouses struggle with concurrency scaling and parallel processing. Currency can be in the form of multiple connections to the same database and parallelism can be in the form of multiple separate workloads executing simultaneously. These platforms may stall as you push up the concurrency or run isolated workloads at the same time. How much time will it take to overcome these limitations? You want to evaluate all of this.
The same caution applies to managing data consistency across separate workgroups and workloads and consolidating different types of data formats. The goal of a cloud data warehouse is to produce the fastest total time-to-insights possible (and ultimately fast time-to-market), across all queries and all workloads, including the time required to manage and scale the environment to handle these demands.
Seek balanced benchmarks
Still, benchmarks can be useful tools for gauging cloud data warehouse performance. Because no two environments are the same, at Snowflake we frequently measure the performance of our data warehouse-as-a-service using balanced benchmark tests derived from the Transaction Processing Council TPC-DS benchmark suite. Doing so reflects the workload variety likely to be seen across different business intelligence and analytical environments.
The TPC-DS benchmark suite includes 99 different queries comprised of reporting, analytics, interactive and mixed workloads. This includes ad-hoc queries with changing sets of data. The diversity of queries provides a better cross-section for organizations to test and evaluate compared to more narrowly defined, one-sided benchmarks using a limited number of queries.
We also run performance tests using 10 TB, 100 TB, and 1 PB data warehouse sizes with data sets provided by the TPC. Concurrency performance and JSON data sets are also tested regularly.
Try, then buy
Dynamic, data-driven organizations of today have constantly changing data needs. Plus, different parts of an organization need specific insights from different collections of data.
We encourage every organization interested in a cloud data warehouse to run their own PoC. Take advantage of our $400 worth of free usage (compute and storage) for 30 days. Or, if you prefer a more structured evaluation, engage with a Snowflake representative to kick off a formal PoC process.
The choice is yours. We provide a true data warehouse-as-a-service experience. There’s nothing like trying Snowflake for yourself and experiencing the cloud data warehouse performance difference.