How to Use AWS Glue with Snowflake

Author: Harsha Kapre

Engineering, How to Use Snowflake, Snowflake Ecosystem

The process of extraction, transformation and load (ETL) is central to any data warehousing initiative. With advances in cloud data warehouse architectures, customers are also benefiting from the alternative approach of extraction, load, and transformation (ELT), where data processing is pushed to the database.

With either approach, the debate continues. Should you take a hand-coded method or leverage any number of the available ETL or ELT data integration tools? While there are advantages to both and some will choose a “one or the other” approach, many organizations select a combination of a data integration tool and hand coding. Code provides developers with the flexibility to build using preferred languages while maintaining a high level of control over integration processes and structures. The challenge has been that hand-coding options are traditionally more complex and costly to maintain. However, with AWS Glue, developers now have an option to easily build and manage their data preparation and loading processes with generated code that is customizable, reusable and portable with no infrastructure to buy, setup or manage.

In this blog, we’ll cover how to leverage the power of AWS Glue with Snowflake and how processing is optimized through the use of query pushdown for ELT.

Why AWS Glue with Snowflake

Snowflake customers now have a simple option to manage their programmatic data integration processes without worrying about servers, Spark clusters or the ongoing maintenance traditionally associated with these systems. AWS Glue provides a fully managed environment which integrates easily with Snowflake’s data warehouse-as-a-service. Together, these two solutions enable customers to manage their data ingestion and transformation pipelines with more ease and flexibility than ever before. With AWS Glue and Snowflake, customers get the added benefit of Snowflake’s query pushdown which automatically pushes Spark workloads, translated to SQL, into Snowflake. Customers can focus on writing their code and instrumenting their pipelines without having to worry about optimizing Spark performance (For more on this, read our “Why Spark” and our “Pushing Spark Query Processing to Snowflake” blogs). With AWS Glue and Snowflake, customers can reap the benefits of optimized ELT processing that is low cost and easy to use and maintain.

AWS Glue and Snowflake in Action

Prerequisites:

Setup

  1. Log into AWS.
  2. Search for and click on the S3 link.
    1. Create an S3 bucket and folder.
    2. Add the Spark Connector and JDBC .jar files to the folder.
    3. Create another folder in the same bucket to be used as the Glue temporary directory in later steps (see below).
  3. Switch to the AWS Glue Service.
  4. Click on Jobs on the left panel under ETL.
  5. Add a job by clicking Add job, click Next, click Next again, then click Finish.
    1. Provide a name for the job.
    2. Select an IAM role. Create a new IAM role if one doesn’t already exist and be sure to add all Glue policies to this role.
    3. Select the option for A new script to be authored by you.
    4. Give the script a name.
    5. Set the temporary directory to the one you created in step 2c.
    6. Expand Script libraries and job parameters:
      1. Under Dependent jars path, add entries for both .jar files from 2b.

[NOTE: You have to add the full path to the actual .jars files. Example: s3://[bucket_name]/GlueJars/spark-snowflake_2.11-2.2.6.jar,s3://[bucket_name]/GlueJars/snowflake-jdbc -3.2.4.jar]

2. Under Job parameters, enter the following information with your Snowflake account information. Make sure to include the two dashes before each key.  

[NOTE: Storing your account information and credentials this way, will expose them to anyone with access to this job. This can be useful for testing purposes but it is recommended that you securely store your credentials as outlined in the section: Store credentials securely.]

  1. Click Next, click Next again, then click Finish.
  2. You will be prompted with a blank script interface.

Sample script

Use the following sample script to test the integration between AWS Glue and your Snowflake account. This script assumes you have stored your account information and credentials using Job parameters as described in section 5.6.2.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake";

## @params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, "net.snowflake.spark.snowflake";)

## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfAccount" : args['ACCOUNT'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
}

## Read from a Snowflake table into a Spark Data Frame
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[table_name]").load()

## Perform any kind of transformations on your data and save as a new Data Frame: df1 = df.[Insert any filter, transformation, or other operation]
## Write the Data Frame contents back to Snowflake in a new table df1.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[new_table_name]").mode("overwrite").save() job.commit()

Securing credentials

To securely store your account information and credentials, see the following article which describes how this is accomplished with EC2: How to Securely Store Credentials with EC2.

Conclusion

AWS Glue and Snowflake make it easy to get started and manage your programmatic data integration processes. AWS Glue can be used standalone or in conjunction with a data integration tool without adding significant overhead. With native query pushdown through the Snowflake Spark connector, this approach optimizes both processing and cost for true ELT processing. With AWS Glue and Snowflake, customers get a fully managed, fully optimized platform to support a wide range of custom data integration requirements.

Start Your 30-Day Free Trial. Receive $400 of credits to try all Snowflake features.

Additional Links

Subscribe to the snowflake blog