Integrating the Snowflake Data Warehouse with R via dplyr

Engineering, Snowflake Technology

While SQL has been the most popular query language for analyzing small, mid-size, and ‘big’ data, other popular languages such as R and Python are being used by a fast-growing community. In particular, R offers sophisticated statistical and graphical techniques to quickly build models and plot data, aligned to how data scientists and data miners think. At Snowflake, we continue to embrace these communities by working on tight integration to allow our users to take advantage of modern programming.

We recently updated our dplyr package and wanted to update this original blog to reflect the latest. (package dplyr.snowflakedb). Snowflake’s unique capabilities are a great fit for everyone who wants to work with any scale of data, workloads, or users while still leveraging their familiar R scripts and functions.

What is dplyr, and Why Should I Care?

Some of the key benefits that the R package dplyr offers are:

  • The ability to leverage processing capabilities and the scalability of external (remote) databases by translating certain R functions (verbs) into equivalent SQL statements executed by those databases so that the heavy lifting is done by the database engine.
  • Improved performance through in-memory processing of data frames.
  • Simple chaining of R verbs and operations combined with lazy loading of data to enable more complex data manipulation workflows without moving data unnecessarily from the remote database.

dplyr & Snowflake: Leveraging the Benefits of an Elastic Data Warehouse Service

dplyr allows the user to interact with remote data from a familiar R environment without moving all the remote data locally, but instead pushing the heavy lifting to the database. Snowflake perfectly aligns with this philosophy. You can try it out yourself via our dplyr.snowflakedb package, now available on our GitHub website.

What Makes This dplyr Package Different from Ones Created for Other Databases?

Instant scale-out elasticity

One key differentiator is unlimited scalability and the freedom to scale up or down dynamically. Not only can R users continue using their favorite R scripts, but they also have the flexibility to run their R scripts with the performance needed over various data set sizes in the cloud — from small and medium-sized to very large.

Processing horsepower is instantly available to R users through Snowflake’s virtual data warehouse concept. Also, in contrast to traditional on-premise MPP DW solutions and other cloud solutions used as a data repository, with Snowflake an R user can (re)size the computational power needed within seconds. For example, if you have a data set that is too large to be processed by your local machine, but you also do not need an expensive on-premise MPP DW solution, you can spin-up a small or mid-size virtual data warehouse through Snowflake. As your data volume grows, you can easily add additional computational power, and your R script will run faster instantly.

An easy solution for creating sandboxes

With Snowflake, R users can explore data at any size and don’t need to be concerned about impacting other workloads or being impacted by other workloads. R users can use a virtual data warehouse for their operations, while users running SQL queries use a different virtual data warehouse, and the daily load process yet another virtual data warehouse — all (potentially) operating on the exact same data without having to worry about data movement across systems.

Build, validate, train, and test all within single system

Combining dplyr with the Snowflake service greatly simplifies and accelerates the end-to-end workflow for data scientists. An R user can now not only quickly build the model and train it, but also backtest the model at scale within seconds, significantly reducing the time it takes to deploy new models to production.

dplyr.snowflakedb in Action

Let’s walk through an example of how customers can use dplyr in combination with the Snowflake service. Examples are based on the dplyr vignette.

  1. First, a few steps to make sure the right libraries are loaded and the classpath is correctly set.
    options(dplyr.jdbc.classpath = "/home/snowman/Downloads/snowflake_jdbc.jar") 
  2. Second, connect to Snowflake and choose the virtual warehouse, database, and schema. Account is just the first portion of the Snowflake URL and host is the full URL without HTTPS.
    my_db <- src_snowflakedb(user=“<username>“, password=“<password>“, 
    account=“<account>“, host=“<snowflake_host_url>“, opts=list(db=“<database>“, 
    warehouse=“<warehouse>“, schema=“<schema>“))
  3. Once connectivity is set-up, copy the data frames from the nycflights13 data set into Snowflake database tables.
  4. List all tables of the data loaded.
    #> [1] "airlines" "airports" "flights" "planes" "weather"
  5. Flights now is a pointer to the database table of the same name.
  6. A few basic R commands to describe the flight’s data frame.
    # dim() returns the number of rows and number of columns
    #> [1] 336776     16
    #> [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time" "arr_delay" "carrier" "tailnum" "flight" "origin" "dest" "air_time" "distance" "hour" "minute"
    #> year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
    #> 1 2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227     1400    5     17
    #> 2 2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227     1416    5     33
    #> 3 2013     1   1      542         2      923        33      AA  N619AA   1141    JFK  MIA      160     1089    5     42
    #> 4 2013     1   1      544        -1     1004       -18      B6  N804JB    725    JFK  BQN      183     1576    5     44
    #> 5 2013     1   1      554        -6      812       -25      DL  N668DN    461    LGA  ATL      116      762    5     54
    #> 6 2013     1   1      554        -4      740        12      UA  N39463   1696    EWR  ORD      150      719    5     54
  7. Here are some basic examples using some dplyr verbs and functions; operations are evaluated lazily, not moving any data to the client until necessary.
  8. We can see the SQL text being built and modified along the way
    > a$query$sql
    <SQL> SELECT "year", "month", "day", "dep_time", "dep_delay", "arr_time", "arr_delay", "carrier", "tailnum", "flight", "origin", "dest", "air_time", "distance", "hour", "minute"
    FROM "flights"
    WHERE "flight" = 5708.0
    > b$query$sql
    <SQL> SELECT "year", "month", "day", "dep_time", "dep_delay", "arr_time", "arr_delay", "carrier", "tailnum", "flight", "origin", "dest", "air_time", "distance", "hour", "minute"
    FROM "flights"
    WHERE "flight" = 5708.0
    ORDER BY "dep_delay"
    > c$query$sql
    <SQL> SELECT "tailnum" AS "tailnum"
    FROM (SELECT DISTINCT * FROM (SELECT "tailnum" AS "tailnum"
    FROM "flights"
    WHERE "flight" = 5708.0) AS "zzz1") AS "zzz2"
  9. Example of chaining using the %>% operator
    df %
          group_by(year, month, day) %>%
          select(arr_delay, dep_delay) %>%
            arr = mean(arr_delay),
            dep = mean(dep_delay)
          ) %>%
          filter(arr > 30 | dep > 30)
  10. SQL query that would be run given this data frame
    > df$query$sql
    <SQL> SELECT "year", "month", "day", "arr", "dep"
    FROM (SELECT "year", "month", "day", AVG("arr_delay") AS "arr", AVG("dep_delay") AS "dep"
    FROM "flights"
    	GROUP BY "year", "month", "day") AS "zzz3"
    WHERE "arr" > 30.0 OR "dep" > 30.0

How to Get Started

For more information about dplyr.snowflakedb, including package instructions, please visit our GitHub repository. Please stay tuned as this is just one more step in giving you the flexibility to use the languages and tools of your choice.