コンテンツへスキップ
  • AT SNOWFLAKE
  • Industry solutions
  • Partner & Customer Value
  • Product & Technology
  • Strategy & Insights
Languages
  • Italiano
  • Español
  • Deutsch
  • Français
  • Português
  • 日本語
  • English
  • 한국어
  • Italiano
  • Español
  • Deutsch
  • Français
  • Português
  • 日本語
  • English
  • 한국어
  • AT SNOWFLAKE
  • Industry solutions
  • Partner & Customer Value
  • Product & Technology
  • Strategy & Insights
  • Italiano
  • Español
  • Deutsch
  • Français
  • Português
  • 日本語
  • English
  • 한국어
  • 概要
    • Snowflakeを選ぶ理由
    • カスタマーストーリー
    • パートナー
    • サービス内容
  • 概要
    • プラットフォームの概要
    • Snowflakeマーケットプレイス
    • Powered by Snowflake
    • ライブデモ
  • ワークロード
    • コラボレーション
    • データサイエンス&機械学習
    • サイバーセキュリティ
    • アプリケーション
    • データウェアハウス
    • データレイク
    • データエンジニアリング
    • ユニストア
  • 価格
    • 料金体系
  • 業界
    • 広告・メディア・エンターテインメント
    • 金融サービス
    • ヘルスケア・ライフサイエンス
    • マーケティングアナリティクス
    • 官公庁・公的機関
    • 小売・消費財
    • テクノロジー
  • 詳しく見る
    • リソースライブラリー
    • 資料
    • ハンズオンラボ
    • トレーニング
  • Connect
    • ブログ
    • コミュニティ
    • イベント
    • ウェビナー
    • ポッドキャスト
  • 概要
    • Snowflakeについて
    • 投資家情報
    • 経営陣と取締役会
    • 採用情報
作成者
Mingli Rui Mingli Rui
Share
Subscribe
2020年4月8日 5 min read

Snowflake Connector for Spark Version 2.6 Turbocharges Reads with Apache Arrow

  • 製品 & テクノロジー
    • データエンジニアリング
Snowflake Connector for Spark Version 2.6 Turbocharges Reads with Apache Arrow

The Snowflake Connector for Spark (“Spark Connector”) now uses the Apache Arrow columnar result format to dramatically improve query read performance.

Previously, the Spark Connector would first execute a query and copy the result set to a stage in either CSV or JSON format before reading data from Snowflake and loading it into a Spark DataFrame. Typically, downloading and deserializing the CSV or JSON data consumed the bulk of end-to-end processing time when data was read from a Snowflake Cloud Data Platform data source.

With this 2.6.0 release, the Snowflake Spark Connector executes the query directly via JDBC and (de)serializes the data using Arrow, Snowflake’s new client result format. This saves time in data reads and also enables the use of cached query results.

Benchmark results: Cacheable, speedy reads with Apache Arrow

In this benchmark, we ran a Spark job that reads the data in the LINEITEM table, which has a compressed size of 16.3 GB in Snowflake. The table is a standard TPC-H LINEITEM table. We first captured the increased throughput as a result of the more-efficient columnar binary data format by performing a raw new read from the Snowflake table.

We saw an immediate 4x improvement in the end-to-end performance of this Spark job. This improvement is due to a 10x performance improvement in the time spent by the Spark Connector to fetch and process the results of the Snowflake query.

In addition, Snowflake has a query-result cache for repeated queries that operate on unchanged data. By storing results that may be reused, the database can avoid recomputation and simply direct the client driver to read from the already computed result cache. In previous versions of the Spark Connector, this query result cache was not usable. Beginning in version 2.6.0, the Spark Connector will issue pushdown jobs to Snowflake using direct queries; this means that the Spark Connector is able to take full advantage of the query result cache. We also saw this benefit in our benchmark results, which are shown below. With cached reads, the end-to-end performance for the Spark job described above is 14x faster than when using uncached CSV-format reads in previous versions of the Spark Connector.

We ran a four-worker Spark cluster with AWS EC2 c4.2xlarge machines, Apache Spark 2.4.5, and Scala 2.11. The Snowflake warehouse size was 4X-Large. The Snowflake deployment’s cloud and the Spark cluster deployment were in the same cloud region: US-West-2 (Oregon). The following chart shows the results:

Scala Code Used for the Benchmark Test

The following snippet shows the code used for the benchmark test with Arrow. Configuring “use_copy_unload” as “true” can test the performance without Arrow.

import net.snowflake.spark.snowflake._

// Snowflake Spark Connector options
// The default value of "use_cached_result" is "false".
// It is "true" means to disable this feature.
val sfOptions: Map[String, String] = Map(
"sfSSL" -> "on",
"sfUser" -> "<snowflake_user_name>",
"pem_private_key" -> "<private_key>",
"use_copy_unload" -> "false",
"use_cached_result" -> "true",
"sfDatabase" -> "<snowflake_database>",
"sfURL" -> "<snowflake_account>.<snowflake_cloud_domain>",
"sfWarehouse" -> "snowflake_warehouse",
"partition_size_in_mb" -> "60"
)

// Test table is TPCH LINEITEM which has 600M rows.
// Its compressed size in Snowflake is 16.3GB.
val sourceTableName = "LINEITEM"
val sourceSchema = "TPCH_SF100"

val df = sqlContext.read
.format("net.snowflake.spark.snowflake")
.options(sfOptions)
.option("dbtable", sourceTableName)
.option("sfSchema", sourceSchema)
.load()

// Clear cache before any test
sqlContext.clearCache()

// Execute cache() and show(1) to read all data into cache
// and show one row. The time to show one row can be ignored.
// The DataFrame execution time is regarded as the reading time.
val startTime = System.currentTimeMillis()
df.cache().show(1)
val endTime = System.currentTimeMillis()

val result_msg = s"read time: ${(endTime - startTime).toDouble/1000.0} s"
println(result_msg)

GETTING STARTED WITH ARROW SUPPORT IN SNOWFLAKE CONNECTOR FOR SPARK

The Arrow format is available with Snowflake Connector for Spark version 2.6.0 and above and it is enabled by default. For more details, see the Snowflake Connector for Spark documentation.

Share

Apache Spark Architecture

Spark Architecture is designed for speed and efficiency. The Snowflake Connector for Spark brings Snowflake into the Apache Spark ecosystem

Expand your knowledge
全文を読む

Healthcare Analytics

Learn about the different types of healthcare analytics, from big healthcare data analytics to patient analytics.

Expand your knowledge
全文を読む

Fetching Query Results From Snowflake Got Faster with Apache Arrow

We took our first step toward the adoption of Apache Arrow with the release of our latest JDBC and Python clients. Fetching...

More to follow
全文を読む

What is Apache Spark?

Apache Spark was designed to function as a simple API for distributed data processing, reducing complex tasks from thousands...

Full Details
全文を読む
Snowflake Inc.
  • プラットフォーム
    • メディア&エンターテインメント
    • アーキテクチャ
    • 価格
    • Snowflakeデータマーケットプレイス
  • ソリューション
    • 医療・ライフサイエンスのためのSnowflake
    • 金融サービスのためのSnowflake
    • マーケティング分析のためのSnowflake
    • 小売業のためのSnowflake
    • 教育のためのSnowflake
  • リソース
    • リソースライブラリー
    • ウェビナー
    • 資料
    • コミュニティ
    • 法務
  • もっと知る
    • ニュース
    • ブログ
    • トレンド
  • 企業情報
    • Snowflakeについて
    • 経営陣と取締役会
    • パートナー
    • 求人
    • お問い合わせ

Sign up for Snowflake Communications

Thanks for signing up!

  • Privacy Notice
  • Site Terms
  • Cookie Settings

© 2023 Snowflake Inc. All Rights Reserved