Svg Vector Icons : http://www.onlinewebfonts.com/icon More Guides

Spark Examples

Apache Spark is an open-source, general purpose, cluster-computing system for large scale data processing.  

But what use cases are a good fit for the Spark framework? Below are a few scenarios where using Spark makes sense: 

  • Projects that involve massive sets of disparate data types (such as large TB structured data sets mixed with JSON) 

  • Projects that have massive data volumes and also require quick (even in-stream) analysis 

  • Projects without a budget for proprietary third-party tools 

Why is Spark well-suited for the conditions mentioned above?

  • For those with budget concerns, it is an open-source framework that can run on commodity hardware 

  • It is mainly in-memory (leveraging RRD - Resilient Distributed Datasets), speeding data accessibility and reducing Disk I/O latency 

  • It features a highly extensive API that can drastically reduce data application development times 

  • Spark is easy to program and users can can write simple, object-oriented queries within a distributed computing environment.  

  • It contains eighty high level operators that aid in parallel application development 

  • It helps in graph processing for advanced machine learning, data science, and data mining applications 

Snowflake and Spark

The Snowflake Connector for Spark enables connectivity to and from Spark. It provides the Spark ecosystem with access to Snowflake as a fully-managed and governed repository for all data types, including JSON, Avro, CSV, XML, machine-born data, and more. The connector also enables powerful integration use cases, including:  

  • Complex ETL: Using Spark, you can easily build complex, functionally rich and highly scalable data ingestion pipelines for Snowflake. With a large set of readily-available connectors to diverse data sources, it facilitates data extraction, which is typically the first part in any complex ETL pipeline 

  • Machine Learning:  With Spark integration, Snowflake provides users with an elastic, scalable repository for all the data underlying algorithm training and testing.  Processing capacity requirements or pipelines often fluctuate heavily with machine learning projects. Snowflake can easily expand its compute capacity to allow machine learning in Spark to process large amounts of data.