JUN 09, 2026/10 min readCore Platform

Efficient Ingest for Query-Ready Data in Snowflake

A good benchmark should make builders smarter. It should show engineers how to design a system, what trade-offs matter, and what it costs to meet your product objectives. The question ClickHouse asked is a good one: What does it actually cost to keep fresh data query-ready for AI agents and analytics?

ClickHouse recently published a benchmark claiming Snowflake costs 22× to 28× more to ingest and serve 100 billion rows of query-ready data. Upon review, the benchmark appears to omit material cost components and uses a Snowflake ingestion setup that does not reflect Snowflake's recommended production architecture for this workload.

As stated in ClickHouse's blog: "Snowflake was loaded via a single continuous stream into the table, using a Gen1 M warehouse." Based on our review, what appears to have been measured is a fully managed server-side batch load exercise on Snowflake, instead of a streaming ingestion workload. For real-time streaming workloads, Snowpipe Streaming provides a Snowflake-native ingestion path that, in the pricing comparison described below, shows an 80% lower ingestion cost than ClickHouse on equivalent ingestion paths.¹

At the same time, our analysis found that material client-side costs for ClickHouse, including file decompression and parsing, were not included in the final reported ingestion cost. In our reproduction, those client-side costs represented more than 60% of the total measured ingestion cost for the tested ClickHouse path.² ClickHouse is also measured using its recommended write path.

The Snowflake configuration used in the benchmark reflected many ingestion anti-patterns rather than Snowflake's recommended production best practices. In this blog, we'll walk through those choices and explain best practices for effectively loading data into Snowflake.

Why this is important

Modern analytics no longer begin at query time. AI agents, observability systems, security analytics, real-time dashboards and customer-facing applications all depend on fresh data that is not merely loaded, but loaded in a way that preserves throughput, controls cost and supports pruning at query time.

Snowflake offers builders a number of flexible options for bringing data into the platform.

For a high-throughput batch, start with `COPY INTO`

When data already exists as files, especially at a large scale, COPY INTO is the most direct and controllable path. It lets builders choose the warehouse, control when the load runs and shape the file batches so the warehouse is well-utilized.

There are two important things to consider to achieve good throughput/low cost:

When using COPY INTO, it's critical to ensure that parallelism cannot exceed the total file count.
We recommend files of roughly 100-250 MB compressed, or larger, and splitting very large files so work can be distributed across a warehouse's compute.

The warehouse should be sized to finish the work efficiently. A larger warehouse can avoid disk spilling due to memory constraints, reducing overall runtime and cost. Gen2 warehouses have a significant set of improvements that make operations like COPY much faster and can be less expensive than Gen1 warehouses for DML-heavy workloads. To find out more, check out this blog post.

Clustering costs in high-volume ingestion: Beyond artificial benchmarks

Snowflake uses Clustering to organize data into shapes that align with your query pattern. For example, if your query asks for the total sales of Cheese, the query engine can skip all the Wine rows in the right side (clustered), but needs to scan all rows on the left side (raw).

To help simplify this process, Snowflake provides auto-clustering capabilities. As a workload's shape changes and as data arrives, Snowflake optimizes the data layout in the background without using the query's cluster resources. Auto-clustering is triggered only when the table would benefit from it, and it avoids unnecessary reclustering.

In ideal scenarios, newly arrived data lands in order, and auto-clustering doesn't have to rewrite any historical data.

But most real-world data sets arrive slightly out of order, and auto-clustering is designed to work well in those scenarios as well.

But in scenarios such as ClickHouse's benchmark, what happened is a forced re-sort of the entire table, up to 10,000 times. (To build to 100 billion rows, the 10 million rows data set is layered on top of each other 10,000 times).

Other anti-patterns to avoid

Anti-pattern 1: Not enough files

Loading a single, large file with a single large warehouse will not parallelize loading the data and is a clear anti-pattern. For best results, we recommend using compressed files of roughly 100-250 MB (or larger) and splitting very large files to distribute the work.

For the most commonly seen CSV files, we offer some additional recommendations to follow, as documented here.

File format	MULTI_LINE	ON_ERROR
Compressed	FALSE	ABORT_STATEMENT
Uncompressed (Following RFC4180)	FALSE	ABORT_STATEMENT
Uncompressed (Compatible mode, slow)	TRUE (Default)	CONTINUE (Default)

Anti-pattern 2: Mixing push streaming with managed pull ingestion

When loading data into Snowflake, the ClickHouse benchmark loaded in a single continuous stream into a table on a Gen1 Medium-sized warehouse, while ClickHouse used 170 parallel clients sending 20,000-row batches with async inserts, which buffer server-side before flushing sorted parts. That is not the same ingestion architecture.

Snowpipe Streaming high-performance architecture offers a scalable, low-latency, fully managed ingestion solution for real-time applications. It can easily scale to 10GB/s and has a low per GB price.

Snowpipe Streaming also supports preclustering with no additional cost during ingestion through a named pipe: In-flight data is sorted before the data is committed, when the target table has clustering keys and CLUSTER_AT_INGEST_TIME = TRUE is set in the pipe's COPY INTO statement (refer to Snowflake documentation for more).

Anti-pattern 3: Choosing the wrong warehouse generation for the job

Snowflake Gen2 warehouses are intended to improve performance for data engineering workloads such as data ingestion. A serious ingest benchmark should use Gen2 for ingestion.

Recommended warehouse type

Workload phase	Warehouse
Ingest and data processing	Gen2
Ad-hoc query and general-purpose	Gen1 and Gen2
High concurrency serving	Interactive

In our optimized configuration, a Gen2 2XL warehouse loaded 100 billion rows in just around five hours, compared with the benchmark's approximately 28-hour controlled 1 million rows/sec run on a medium Gen1 warehouse.

Anti-pattern 4: Suboptimal clustering design

Suboptimal clustering not only leads to high clustering cost, but it also leads to poor query performance.

We recommend choosing clustering keys based on selective filters, common joins and representative query testing. Clustering is most cost-effective when many queries benefit from the same key and the table has a high query-to-DML ratio.

Bad:

CLUSTER BY (user_id)

when the workload is:

WHERE event_date BETWEEN ...
 AND tenant_id = ...

Better:

CLUSTER BY (event_date, tenant_id)

assuming this matches the dominant query pattern.

As outlined in Strategies for Selecting Clustering Keys, Snowflake recommends having fewer than three to four columns in the clustering key. Users should also avoid including high cardinality keys in the system. Those recommendations are important because they balance clustering cost with query performance in large-scale systems.

In ClickHouse's setup, the clustering key included five columns, of which three are high cardinality: UserID, EventTime and WatchID.

Instead, the correct setup for this use case is to use EVENTDATE and COUNTERID. For builders with real-world time series data, you can consider lowering the resolution of the time clustering key to minimize clustering cost: DATE_TRUNC('DAY', time)

For high-cardinality dimensions that you'd like to filter on, Snowflake recommends Search Optimization service instead of clustering.

What the ClickHouse benchmark setup misses

Replicating the ClickHouse benchmark proved difficult due to inconsistent claims in its blog post, and the key source code was not on GitHub (we received a 404 when accessing the files). To understand the discrepancy, we analyzed historical logs to audit its exact setup. Snowflake is committed to Compete with Integrity to help customers make informed decisions, and we encourage all companies to operate with transparency and reproducibility.

The most problematic part is that there is no such thing as a "single continuous stream." Instead, the "benchmark" is layering the time-ordered data 10,000 times on top of each other, effectively forcing the system to constantly reshuffle the data.

At the same time, in our reproduction, ClickHouse's client-side CSV parsing and decompression cost more ($214) than the server-side async insert cost ($131) for a 10 billion row workload.

Here is a summary of lessons to be learned for others who run production workloads:

Benchmark choice	What you should do in production
Client-side parsing/decompression excluded	Measure the entire cost of ingestion; don't ignore the parsing/decompression cost.
Repeat INSERT 10,000 times	Use Snowpipe for batch loading, or Snowpipe Streaming for real-time loads
Gen1 medium ingest warehouse	In general, use Gen2 warehouse for DML and data processing tasks
Controlled 1 million rows/sec stream	Use smaller warehouses if your throughput is low. Snowpipe Streaming bills by GB, no need to keep a warehouse up
Heavy clustering cost reported without optimized load layout	In real-world scenarios, clustering should keep up as your new data lands.

There are also other reproducibility concerns. At least two linked GitHub paths for the ClickHouse ingestion script and command file returned 404 during review, preventing benchmark validation.

The main ClickHouse loading script is 404: https://github.com/ClickHouse/CostBench/blob/main/_INGEST_TEST/ingest_chunks.py
Snowflake cost data is 404: https://github.com/ClickHouse/Bench2Cost/blob/main/snowflake/results_100B/gen2_clustered_4xl_enriched.json
ClickHouse cost data is 404: https://github.com/ClickHouse/Bench2Cost/blob/main/clickhouse-cloud/results_100B/aws.40.236._2.parallel_replicas.json

Trading off cost with reliability

Snowflake provides automatic, out-of-the-box, synchronous redundancy across availability zones (AZs) within a single cloud region. This experience is built on top of a globally consistent cloud service layer. Even with an XS warehouse, customers are protected against outages of AZs without having to run a second replica.

In comparison, on ClickHouse Cloud, configurations that require an additional active replica for comparable redundancy carry additional infrastructure costs, which should be included when comparing total cost, reliability and availability trade-offs across platforms.

Handling high concurrency at low cost with Interactive Warehouses

Interactive tables and warehouses, which launched in 2025, are designed to serve exactly the use case ClickHouse is tailored for. We offer better performance than ClickHouse at a lower cost:

On equivalent hardware, Interactive can serve two to four times as many concurrent queries as ClickHouse through better resource scheduling. This means you can serve the same workload at a lower cost, especially at a larger scale.

You can power user-facing dashboards and other high-concurrency, low-latency interactive workloads without exporting data from Snowflake or managing a separate ClickHouse infrastructure. This setup simplifies platform setup, while providing the same security, governance, reliability (e.g., cross-cloud, cross-region disaster recovery failover), and rich SQL capabilities that come standard with Snowflake.

In the interactive warehouse, queries that run longer than five seconds don't time out. Instead, they are transparently re-executed on a fallback warehouse of your choice. It allows longer-running queries to execute on the regular warehouse without taking resources away from the next low-latency query. This enables transparent balancing of cost and provides performance isolation. (Snowflake Documentation)

Building with you

A benchmark can measure a slice of a system.

Data engineers and AI builders need more than fast ingest. They need fast ingest along with governed data, reliable pipelines, secure access, low-latency serving, observability, cost control and tools that help humans and agents build together. This is the Snowflake platform.

AI agents are powerful, but they need context, permissions, lineage and guardrails. Snowflake's governance model gives builders that foundation: role-based access control, Horizon Catalog for consistent policies and lineage, and Trust Center for monitoring and reducing security risks across accounts.

Builders do not just need a tool. They need a platform that helps them go from ingestion to reliable applications; from SQL to agents; from raw data to governed data products; from experimentation to production.

But AI assistance without trust is not a platform. It is a demo.

The Snowflake standard is to give builders both performance and trust in a single platform, never forcing them to choose one or the other.

Based on public list pricing as of June 8, 2026 for the listed SKUs, comparing Snowpipe Streaming against ClickPipe streaming. ↩
Based on Snowflake's reproduction of the tested ClickHouse ingestion path for 100 Billion rows, client-side file decompression and parsing costs were measured at $214, compared with total measured ingestion costs of $131. Costs are based on AWS us-west region. Results are workload and configuration-specific. ↩

Learn more about the author

Will Xu

Principal Product Manager