In June, we announced that Java functions and the Snowpark API were available for preview in AWS. Today, we’re announcing a few additions to that preview: We’re expanding both where Snowpark is available as well as what you can do with it.
New Cloud Providers
At Snowflake, we aim to provide the same, great support across all of the clouds and regions in which we operate. Cloud providers can be pretty different from each other, but that should be our problem, not yours.
Today, we’re happy to announce another step toward that goal: Java functions and the Snowpark API are now live in preview across all of Azure.
And the big news is there’s really not much more to say. Since these features work the same across both clouds, you can develop once and deploy where you’d like. And if you’re on GCP, we haven’t forgotten about you: The exact same support is coming soon.
Table Functions
We’re also announcing a large expansion of our Java functions support: Table functions are now in preview on both AWS and Azure.
To understand table functions, let’s take a quick step back. Previously, we only supported scalar functions: functions that operate on each row in isolation, and produce a single (possibly complex) result. Scalar functions are nice: They’re simple to write, they’re simple for Snowflake to scale out as part of your query, and they address a broad array of problems.
But there are things you can’t do with scalar functions. You’ll find that you’ll get stuck if you need to do any of the following:
- Return multiple rows for each input row
- Maintain state across multiple rows
- Return a single result for a group of rows
While you can’t do these with scalar functions, all of these are possible with table functions. The trade-off is that they’re a little more involved to write and use.
Let’s take a look.
An Example: Word Counts
Let’s say that we want to compute how many times each word occurs across a collection of rows. For example, suppose we have data on books, such as:
books
title | author | first_line |
Moby Dick | Herman Melville | Call me Ishmael. |
Billy Budd, Sailor | Herman Melville | In the time before steamships, or then more frequently than now, a stroller along the docks of any considerable seaport would occasionally have his attention arrested by a group of bronzed mariners, man-of-war’s men or merchant sailors in holiday attire ashore on liberty. |
Gravity’s Rainbow | Thomas Pynchon | A screaming comes across the sky. |
Inherent Vice | Thomas Pynchon | She came along the alley and up the back steps the way she always used to. |
A Portrait of the Artist as a Young Man | James Joyce | Once upon a time and a very good time it was there was a moocow coming down along the road and this moocow that was coming down along the road met a nicens little boy named baby tuckoo. |
Ulysses | James Joyce | Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. |
… | … | … |
We’re interested in analyzing the frequency of words in these lines. If we wanted to do this row-by-row, we could do so with a scalar Java function that returned the counts using a variant dictionary.
But we wanted to get the counts across multiple rows; for example, for each author at a time. For that, we need a table function. To build a table function, we need to provide a class that exposes:
- A static getOutputClass() method that the system can use to discover the schema of the table function. (We need this due to Java’s type erasure, meaning that Stream<OutputRow> effectively compiles to Stream<Object>, but that’s a story for another day.)
- A default constructor, which the system will call for each partition in the query. Unlike with scalar functions, where the handler method may be called in parallel and as such should not modify class state, the process() method for table functions is called sequentially for each row in the partition, and may accumulate state.
- A process() method, which takes a single row, does whatever we need with it, and returns a potentially empty Stream of results that the system will associate with the input row.
- An endPartition() method, which is called at the end of the partition, and can return a Stream of rows that the system will associate with the partition as a whole, rather than a specific row in the partition.
Let’s put this together for our example.
The Code
We start with some Java framing and the definition of our output type, OutputRow:
import java.util.HashMap;
import java.util.Map;
import java.util.stream.Stream;
public class WordCount {
// A class to define the schema of our output: pairings of words
// and counts.
static class OutputRow {
public final String word;
public final int count;
public OutputRow(String word, int count) {
this.word = word;
this.count = count;
}
}
// The getOutputClass routine is what actually sets this class as
// the return type for the table function.
public static Class getOutputClass() {
return OutputRow.class;
}
The table of our output will have two columns defined by our OutputRow class: a word and a count of occurrences.
Continuing, we have a Map, wordCounts, that will keep the state of our count, and a constructor that initializes a clean count for each partition:
// Keeps partition-wide counts for each word seen.
private final Map wordCounts;
// Each partition will have a new instance of WordCount
// constructed, and we are allowed to maintain state over the
// partition. In this case, we're going to start the partition with
// an empty map, and keep running counts as we go.
public WordCount() {
wordCounts = new HashMap<>();
}
Our process method takes a string, breaks it up into constituent words, and updates the state with the words it finds.
// The process method is called on each record. In this case, we'll
// split up the line and add the words to our partition-wide count.
// This method could return per-row values if we wished, but in our
// case, we'll just return an empty Stream because our results are
// really for the partition as a whole.
public Stream process(String text) {
// Update the counts with the words in this line.
for (String word : text.toLowerCase().split("[.,!\"\\s]+")) {
// If we don't have an entry for the word, set the count to 1.
// Otherwise, increment the count.
wordCounts.compute(word, (key, value) ->
(value == null) ? 1 : value + 1);
}
// We're waiting until the end of the partition to return per-
// word counts and return nothing here.
return Stream.empty();
}
Note that we don’t actually return any rows when we process a line: For our example, we’re only computing the counts per partition. Let’s do that:
// The endPartition routine is called at the end of the partition.
// In our case, this will return the total counts across all lines
// in the input.
public Stream endPartition() {
// Stream back the word counts for the whole partition. Calling
// stream() on the keySet enables us to iterate lazily over the
// contents of the map.
return wordCounts.keySet().stream().map(word ->
new OutputRow(word, wordCounts.get(word)));
}
}
There we are. We could jar this up, put it on a stage, and register it from there. Or we can write the whole function registration in-line:
create or replace function wordcount(s string)
returns table(word string, count int)
language java
handler = 'WordCount'
as
$$
import java.util.HashMap;
import java.util.Map;
import java.util.stream.Stream;
public class WordCount {
// A class to define the schema of our output: pairings of words
// and counts.
static class OutputRow {
public final String word;
public final int count;
public OutputRow(String word, int count) {
this.word = word;
this.count = count;
}
}
// The getOutputClass routine is what actually sets this class as
// the return type for the table function.
public static Class getOutputClass() {
return OutputRow.class;
}
// Keeps partition-wide counts for each word seen.
private final Map wordCounts;
// Each partition will have a new instance of WordCount
// constructed, and we are allowed to maintain state over the
// partition. In this case, we're going to start the partition with
// an empty map, and keep running counts as we go.
public WordCount() {
wordCounts = new HashMap<>();
}
// The process method is called on each record. In this case, we'll
// split up the line and add the words to our partition-wide count.
// This method could return per-row values if we wished, but in our
// case, we'll just return an empty Stream because our results are
// really for the partition as a whole.
public Stream process(String text) {
// Update the counts with the words in this line.
for (String word : text.toLowerCase().split("[.,!\"\\s]+")) {
// If we don't have an entry for the word, set the count to 1.
// Otherwise, increment the count.
wordCounts.compute(word, (key, value) ->
(value == null) ? 1 : value + 1);
}
// We're waiting until the end of the partition to return per-
// word counts and return nothing here.
return Stream.empty();
}
// The endPartition routine is called at the end of the partition.
// In our case, this will return the total counts across all lines
// in the input.
public Stream endPartition() {
// Stream back the word counts for the whole partition. Calling
// stream() on the keySet enables us to iterate lazily over the
// contents of the map.
return wordCounts.keySet().stream().map(word ->
new OutputRow(word, wordCounts.get(word)));
}
}
$$;
With this in hand, we can use our new table function in a query:
select author, word, count
from books,
table(wordcount(books.first_line) over (partition by author))
order by count desc;
And we get our counts:
There we go! We could improve this by handling stop words, and we’d probably get better insights by analyzing a larger corpus of text.
Conclusion
Snowpark is all about making it easy for you to do interesting things with data, while retaining the simplicity, scalability, and security that you expect from Snowflake. With these extensions, we’re hoping you have even more opportunities to put Snowpark to good use. And as always, we look forward to hearing about the exciting things you come up with.
To get started with Java table functions, please check out the documentation. And don’t miss the documentation for Java functions and the Snowpark API as well.
Happy hacking!