Don’t Ignore ACID-Compliant Data Processing in the Cloud

When moving workloads to the cloud, you should expect high performance and robust security. You should also have high expectations for scalability, workload concurrency, elasticity, ease-of-use, usage-based pricing and resilience. These capabilities and consumption model are more likely available and tend to execute much better in the cloud compared to on-premises implementations.

However, there’s one area you want to take special care to at least match your current capabilities once you migrate your enterprise data warehouse and analytic workloads to the cloud. And that’s ACID-compliant data processing.

What is ACID?

The term and acronym ACID describes a set of processing capabilities (Atomic, Consistent, Isolated and Durable) that ensure a database management system (DBMS) will make changes to data in a reliable, high-integrity way. A DBMS that is ACID-compliant guarantees that all executed and committed transactions, changes and/or updates to data are:

  • Atomic – As required by a transaction, all statements within the transaction are executed in their entirety or not at all. In database terms, a transaction can be a single statement or operation on data, or a logical group of multiple statements and operations. If the DBMS cannot complete the transaction, successfully execute all statements and operations, but some changes have completed, then the DBMS will automatically and completely roll-back the database to its prior state.
  • Consistent – Every transaction to the database will adhere to defined and declared system integrity constraints. This includes requiring all committed changes to data (i.e., data writes) will be up-to-date and immediately available by all subsequent reads of the data.
  • Isolated – Results of concurrent (simultaneous) transactions, whether to the same data or table or to multiple data sets or tables, are independent of one another. In other words, the results of concurrent processing would be the same as if each of the transactions were to execute in a sequential manner, not simultaneously.
  • Durable – Changes to data, once committed, will remain in place and will survive failures to the data management system and platform, even if the hardware failure occurs just as the DBMS is implementing committed transactions.

Data warehousing workloads benefit from ACID-compliant data processing

Often, in the context of data warehouse data processing, it’s argued that transactional ACID-compliant processing is “less relevant” in the data warehouse space and is more appropriate for classic online transaction processing (OLTP). Thus, the argument goes, requirements can be relaxed to more “eventual consistency”, rather than ACID, for data warehouse and analytic workloads.

Snowflake believes differently. While it can be further argued that OLTP processing, such as banking or financial transactions, may have more at stake should transactions lack high integrity, it does not mean that online analytical processing (OLAP) or operational data warehousing in the cloud are not deserving of the highest levels of integrity.

For example, an OLAP application that analyzes the performance of a website will typically create dashboards that reflect customers’ engagement and interaction with a company. If, for this company, the primary source of transactions with customers is from the web, then the integrity of the data captured and subsequent analytics (including dashboards) are critically important to this company.

Not all cloud data warehouses are the same

The issue is that for some cloud data warehousing platforms, without assurance of ACID compliance, there’s risk of data inconsistencies (See Figure 1.).

 

Cloud Architecture
Figure 1 – Range of ACID Capabilities Based on Cloud Architecture Type

 

A cloud data warehouse infrastructure that based on a collection of separate data warehouse clusters has this highest risk. This is because each cluster is independent. Data may have been copied from one cluster to another, but there is no coordination between the cluster. Therefore, data versioning can occur and there’s no single authority for the data from a system wide perspective.

Providing the highest level of ACID capabilities will be a cloud architecture that behaves like a single, integrate data storage system although there may be separate compute engines. With system wide integration and metadata management, data changes can be tracked, databases can be rolled back in the event of transaction failure, and consistency between compute engines can be assured nearly instantaneously.

In the middle are loosely coupled clusters that may have some level of data syncing between nodes of the cluster and central storage. ACID risks will be function of how swiftly data can be synchronized, before read attempts occur.

Maintain ACID processing when migrating analytic workloads to the cloud

In summary, ACID-compliant processing guarantees the highest level of integrity for all transactions on a database and prevents the dirty reads that can lead to invalid results. If you have ACID-compliant processing now with your current on-premises data warehouse, you certainly want to maintain this level of capability as you deploy your data warehouse workloads on a cloud data warehouse platform.

Or, if you’re already in the cloud and desiring ACID capabilities, then consider a single integrated data warehouse solution approach that has been built for the cloud.

Start Your 30-Day Free Trial. Receive $400 of credits to try all Snowflake features.

 

subscribe to the snowflake blog

How to Get Started with Snowflake on Azure

Snowflake is now available on Microsoft Azure for preview in the East US 2 region. Collaborating closely with the Microsoft Azure team, we ensured we could build the familiar scalability, performance and reliability into Snowflake on Azure. We leverage several new Azure features, including limitless storage accounts, accelerated networking, and storage soft delete. The goal is to provide the same Snowflake experience no matter which cloud infrastructure provider customers choose, with no barriers to entry. Snowflake on Azure will make it easier than ever for teams across an organization to become more data-driven, efficient and productive with their most valuable resources: data and people.

The Technical Details

Snowflake on Azure is architected to run on Azure, leveraging Azure compute and storage infrastructure services for data storage and query processing.

Figure 1.

As Figure 1 shows, Snowflake relies on Azure Blob Storage for data storage. To achieve scalable, highly performing data access, Snowflake stripes customer data across many storage accounts in Azure. Customer requests are processed by what we call virtual warehouses. A virtual warehouse is a set of virtual machines provisioned by Snowflake on Azure Compute. Each virtual warehouse with its virtual machines is dedicated to a single customer’s account. Snowflake receives requests via a load balancer. Requests leverage Snowflake’s cloud services and metadata layers for authentication, query compilation, transaction management, security, data sharing and other capabilities.

First Steps on Azure

The most powerful insights often come from analytics that tie together different data sets. For this blog post, we will explore a scenario that uses Snowflake on Azure to correlate clickstream data from a customer-facing website with transactional data from an order processing system and visualize the results. To do that, we’ll use the following services in Azure:

  • Snowflake on Azure: We’ll show you to connect to the Snowflake web UI to manage your Snowflake account, provision warehouses, explore your Snowflake databases, run queries, etc.
  • Azure Blob Storage: In this example, Azure Blob Storage stages the load files from the order processing system.
  • Azure Data Lake Store: The clickstream logs in this examples are stored in Azure Data Lake Store (Gen1) from where we will load them into Snowflake.
  • PowerBI: Finally, we will connect PowerBI Desktop to Snowflake on Azure to visualize the results of the analytics.   

The following paragraphs walk you through the different Azure data services and explain how to use them together with Snowflake on Azure.

Snowflake in Azure: Using Snowflake through the Web UI

Customers access their Snowflake account on Azure using a URL such as https://<accountname>.east-us-2.azure.snowflakecomputing.com/. After authenticating, you can use the familiar Snowflake web UI to manage your databases, warehouses and worksheets, and access your query history and account details. The screenshot below (Figure 2) shows a Snowflake worksheet with the object explorer on the left, the query editor in the center and the query results at the bottom.

 

Figure 2.

In our example scenario, we’ll use the Snowflake web UI to submit the COPY statements to load data from Azure Blob Storage into Snowflake databases – which we will explain next.

Azure Blob Storage: Loading data from Azure Blob Storage into Snowflake

In this example, we assume that you already exported the data from the transactional order processing system into load files in Azure Blob Storage. Once the data is in Azure Blob Storage, you can load it into Snowflake using the existing support for external stages in Azure Storage (https://www.snowflake.com/global-snowflake-loading-data-into-snowflake-from-azure-blob-storage/). Now, you can use these familiar steps to create a stage in Azure storage and run the COPY command to load the data:

  1. Store your load files in an Azure Blob Storage container.
  2. Create a Snowflake stage object referring to the blob storage location in its URL parameter:

    CREATE STAGE azstage
    URL = azure://<storageaccount>.blob.core.windows.net/tpch1000
    CREDENTIALS=(AZURE_SAS_TOKEN=…)
  3. Use the COPY statement to load data files from the stage into a target table (ORDERS in this example) in Snowflake:

    COPY INTO Orders
    FROM @azstage/orders

You can find more information about how to configure your Azure Blob Storage account for data loading with Snowflake in the following documentation page: https://docs.snowflake.net/manuals/user-guide/data-load-azure-config.html.

Consider using Azure Data Factory (https://azure.microsoft.com/en-us/services/data-factory/) for richer orchestration scenarios. Azure Data Factory helps with extracting data from multiple Azure services and persist the data as load files in Blob Storage.

You can use these steps to load the files with the order processing data from Azure Blob Storage. For this example, we have been using TPCH, a common data set, and Figure 3 shows the blob storage account with the data directories. You can use several COPY statements like the one above to populate the order processing data in your Snowflake tables.

 

Figure 3.

Azure Data Lake Store: Connecting Snowflake with Azure HDInsight, Spark and Azure Data Lake Store

Many customers rely on Apache Spark as an integral part of their data analytics solutions. Snowflake natively integrates with Spark through its Spark connector. Below you can see how to use Spark in Azure HDInsight together with Snowflake. The use case implements a data pipeline originating from data stored in Azure Data Lake Store via HDInsight Spark into a Snowflake table.

Our running example will use this approach to make the clickstream data available in Snowflake next to the order processing data. The following screenshot (Figure 4) shows the folders with load files for the clickstream data in Azure Data Lake Store.

Figure 4.

The HDInsight cluster shown in Figure 5 below runs Spark 2.2. The cluster has been configured with a service principal so that it can access the clickstream files in the data lake account.

Figure 5.

Using the built-in Jupyter notebook capability of HDInsight, you can now run the following simple PySpark program to populate a dataframe in Spark with the clickstream data from data lake store and then persist the dataframe into a Snowflake table:

%%configure
{ “conf”: {“spark.jars.packages”: “net.snowflake:spark-snowflake_2.11:2.4.0-spark_2.2” }}

df = spark.read.csv(“adl://azuredemo.azuredatalakestore.net/clickstreams/20180629/*”, header=“true”)

sfOptions = {
 “sfURL”: “<sfaccount>.east-us-2.azure.snowflakecomputing.com”,
 “sfAccount”: “<sfaccount>”,
 “sfUser”: “<user>”,
“sfPassword”: “<password>”,
“sfDatabase”: “<dbname>”,
“sfSchema”: “<schemaname>”,
“sfWarehouse”: “<whname>”
}

SNOWFLAKE_SOURCE_NAME = “net.snowflake.spark.snowflake”

df.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option(“dbtable”, “CLICKSTREAM”).mode(“append”).save()

 

Note the use of the %%configure magic in the first lines of the code snippet. This loads the Snowflake Spark Connector that provides deep integration between Spark and Snowflake for highly optimized performance. After loading the clickstream data into the dataframe df, you can perform further transformations in Spark before writing the result into a Snowflake table. The last line of the code shows how to append data from the Spark dataframe to a table called CLICKSTREAM in Snowflake.

Now that we have loaded both the transactional data and the clickstream logs into Snowflake, we are ready to start analyzing the data sets. The next paragraph explains how to do that with Snowflake on Azure and Microsoft PowerBI.

PowerBI: Visualizing your Snowflake Analytics

You can connect PowerBI Desktop to Snowflake by installing the Snowflake ODBC driver on your PowerBI Desktop machine. Once installed, you can use the built-in Snowflake data source for PowerBI. Click on the ‘Get Data’ button in PowerBI and then choose the ‘Database’ section, as shown in Figure 6. You can find Snowflake at the bottom of the list of supported data sources.

Figure 6.

After connecting to Snowflake using the hostname from your Snowflake account URL, you can explore the tables and develop powerful reports and charts. In Figure 7 below, you can see a chart created from both the transactional and the clickstream data sets. It shows the distribution of daily online customer activity from the clickstream logs over customer market segments from the order processing data.

Figure 7.

You can also publish your Snowflake reports to the PowerBI service. Note that you need the PowerBI gateway to connect PowerBI service to Snowflake.

This completes our running example for Snowflake on Azure. As part of this scenario, we have seen how to load data from both Azure Blob Storage and Azure Data Lake Store into Snowflake on Azure. We also demonstrated how to use Microsoft PowerBI to visualize your analytics in Snowflake.

Resources
Webinar: Snowflake on Azure: Modern Data Analytics, August 23, 11am PT
Blog: Providing Customer Choice: Snowflake on Azure
Partner Solutions: Snowflake on Microsoft Azure

Give it a try!

We’re excited to see what you build with Snowflake on Microsoft Azure. Create your Snowflake  account on Azure today. And, of course, we’d love to hear your feedback. Join the Snowflake on Azure community on the Snowflake website to share your feedback, experiences, tips & tricks and to ask questions.

subscribe to the snowflake blog

Why You Need a Cloud Data Warehouse

Are you new to the concepts of data warehousing? Do you want to know how your enterprise can benefit from using a data warehouse to streamline your business, better serve your customers and create new market opportunities? If so, this blog is for you. Let’s cover the basics.

It begins with production data

Day-to-day, nearly all enterprise-class companies process data as part of core business operations. Banks process debit and credit transactions for account holders. Brick-and-mortar and online retailers process in-store and website purchases, respectively. Insurance companies maintain and update customer profile and insurance policy information for policyholders.

The nature of these production systems is transactional and require databases that can capture, write, update or delete information at the pace of business operations. The systems behind these transactions are online transaction processing (OLTP) databases. For example, OLTP databases for investment firms must operate at lightning speed to keep up with high-volume stock and bond trading activity that occur in fractions of a second.

The need for a data warehouse solution

In addition to capturing transactions, another aspect of business operations is to understand what’s happening, or what has happened, based on the information captured with OLTP databases. By this, I mean companies must not only know how much revenue is coming in, they must know where revenue is coming from, the profile of customers making the purchases, business trends (up or down), the products and services being purchased and when those transactions are taking place. And, certainly businesses need to know what it will take for customers to remain loyal and buy more. Answers and insights to these questions are necessary to develop strategic business plans and develop new products that will keep businesses growing.

Why transactional (OLTP) systems are not optimized for data warehousing

Acquiring these insights requires accumulating, synthesizing and analyzing the influx of data from OLTP databases. The aggregation of all this data results in very large data sets for analytics. In contrast, when OLTP systems capture and update data, the amount of data transacted upon is actually very small. However, OLTP systems will execute thousands upon thousands of  small transactions at a time. This is what OLTP systems are optimized to do; however, OLTP systems are not optimized for the analysis of large to extremely large data sets.  

This is why data warehousing solutions emerged. Data warehouse solutions will hold a copy of data stored in OLTP databases. In addition, data warehouses also hold exponentially larger amounts of data accessed by enterprises, thanks to the enormous amount of Internet and cloud-born data. Ideally, data warehouses should be optimized to handle analytics on data sets of any size.  A typical data warehouse will have two primary components: One, a database (or a collection of databases) to store all of the data copied from the production system; and two, a query engine, which will enable a user, a program or an application to ask questions of the data and present an answer.

Benefits of deploying a data warehouse

As previously stated, with a data warehouse, you ask and find answers to questions such as:

  • What’s the revenue?
  • Who’s buying?
  • What’s the profile of customers?
  • What pages did they visit on our website?
  • What caught their attention?
  • Which customers are buying which products?

With native language processing and other deep learning capabilities gaining popularity, you can even develop insights about the sentiment of prospects and potential customers as they journey towards your enterprise.

Benefits of data warehousing… in the cloud

Many data warehouses deployed today were developed during the 1980s and were built for on-premises data centers typical of the time. These solutions still exist, including availability of “cloud-washed” versions. Both options typically involve upfront licensing charges to buy and to maintain these legacy data warehouses. Yet, neither legacy data warehouses (0r current generation data lakes based on Hadoop) can elastically scale up, down, or suspend as needed to meet the continuously varying demands of today’s enterprises.

 

As result, these types of solutions require a lot attention on low-level infrastructure tasks that divert IT and data science teams from truly strategic analytics projects that advance the business.

With modern, cloud-built data warehouse technology now available, such as Snowflake, you can gather even more data from a multitude of data sources and instantly and elastically scale to support virtually unlimited users and workloads.

All of this is accomplished while ensuring the integrity and consistency of a single source of truth without a fight for computing resources. This includes a mix of data varieties, such as structured data and semi-structured data. As a modern cloud service, you can have any number of users query data easily, in a fully relational manner using familiar tools, all with better security, performance, data protection and ease-of-use that are built-in.

For these reasons, you can expect enterprises to turn to companies like Snowflake to help propel insights from your data in new directions and at new speeds, regardless the size of the business or industry in which you compete.

subscribe to the snowflake blog

How Usage-Based Pricing Delivers a Budget-Friendly Cloud Data Warehouse

When you have budgets and business plans you need to stick to, you want clear pricing when operating your cloud data warehouse. However, it can be a frustrating task to understand how some cloud data warehouses charge for services. Each vendor has its own pricing model. Users expect differences. However, it’s more about the hidden charges, quotas and other penalties that are frustrating.

Here are just some of the ways Snowflake helps you to manage your data warehouse budget easily, cost-effectively and efficiently.

A simple data warehouse pricing model

Snowflake’s pricing model includes only two items: the cost of storage and the cost of compute resources consumed. The charge for storage is per terabyte, compressed, per month. The charge for compute is based on the processing units, which we refer to as credits, consumed to run your queries or perform a service (for example, Snowpipe data loading). Compute charges are billed on actual usage, per second. All of our Snowflake editions and pricing details can be found here on our website.   

All charges are usage-based

As examples, using the US as a reference, Snowflake storage costs can begin at a flat rate of $23/TB, average compressed amount, per month accrued daily. Compute costs $0.00056 per second, per credit, for our Snowflake On Demand Standard Edition. Our Snowflake On Demand Enterprise Sensitive Data Edition (which includes HIPAA compliance, PCI compliance, customer managed encryption keys and other security hardened features) is $0.0011 per second, per compute credit.

Clear and straightforward compute sizing  

Once you’re in Snowflake, you can enable any number of “virtual data warehouses”, which are effectively the compute engines that power query execution. Virtual data warehouses are available in eight “T-shirt” style sizes: X-Small, Small, Medium, Large, and X- to 4X-Large. Each data warehouse size has a compute credit designation. As you go up in size, credits usage will vary. Refer to Table 1.

Table 1 – Snowflake Warehouse Sizes and Credit Usage

Based on our internal benchmarks, performance improves linearly as the size of of warehouses increases.

No charge for idle compute time

All queries run automatically from the data warehouse from which a query is launched.  What’s cool about Snowflake is that it allows you to specify a warehouse to shift into suspend mode if no queries are actively running. Once suspended, charges are also suspended for idle compute time (Figure 1).

Figure 1 – Snowflake Pricing Tracks With Actual Usage

Automatically restart the warehouse to process a new query. No manual re-provisioning required. This is more graceful than terminating a service, as required with other cloud-based solutions, to stop charges.

When you terminate services (to stop billing), with other cloud data warehouse solutions, you lose the data since it will be unloaded from the data warehouse. If you later need to run queries against that data, you will have to re-provision the data warehouse and reload the data all over again. This is disruptive and inefficient. Because of this disruption, these cloud data warehouse solutions must remain on and active, 24×7, with the meter running, whether you are running queries or not. Further, with other cloud data warehouses, if terminated services are restarted, unused credits do not roll over.

No hidden quotas or price bumps

When you run queries in Snowflake, there are no added usage quotas or hidden price premiums. You pay only for what you use.

Other cloud data warehouse solutions may not charge based on time, but rather will charge based on how many terabytes are scanned or how many terabytes are returned from a query. In addition, you may have no ability to control the compute resources available to you because you are given an allocation of processing units or slots. And if the query you run consumes more than the allocation provided to you, you are charged a premium.

True cloud-scale elasticity

On-premises data warehouses, particularly, make it nearly impossible to scale down hardware once installed and provisioned. This is cost prohibitive because most on-premises implementations are pre-sized and purchased for peak demand period, which could be for just few days a month or year, leaving massive investments idle the rest of the time.

Keeping it simple, cost-effective and price-efficient

Unlike traditional data warehouse technology (on-premises or in the cloud) never designed for the cloud, Snowflake’s cloud-built data warehouse-as-a-service makes analytics and budgeting easy. We also make it as cost-effective as possible for you to get all the insight from all your data at a fraction of the time required by your current solution. By avoiding the headaches of traditional solutions, Snowflake enables you to focus on strategic data analytics and engineering projects that advance your business.

 

The Dream Data Warehouse Development Environment

Earlier this month, Snowflake’s Customer Enablement Team was assigned an email from one of our customers. The customer stated that he was not happy about the idea of cloning full copies of production databases for development purposes. “Do we really want to develop and maintain a system to copy a fraction of the production DB to dev?”, citing the reason for his message that, by just copying the entire production database, the dev team would have access to too much data. Being a veteran of Snowflake, I initially dismissed his concern because of Snowflake’s zero-copy clone capability, as outlined in this article. From my perspective, the zero-copy clone would not incur any additional cost for development purposes, so why not give the dev team all of the data?

The answer of course, as the customer pointed out, has to do with making the development effort more efficient. The assumption of zero-copy clone equating to zero-cost development is, of course, incorrect. There’s the cost of querying the data (requiring virtual warehouse credits) and the cost behind each development hour. After all, longer query times lead to longer development iterations and longer testing cycles. To create a “blissful” development environment in Snowflake, we need a more refined approach towards building the development data set.

The approach outlined by the customer was rooted in complimenting Snowflake’s zero-copy clone with the additional technique of using Block Sampling. In other words, they proposed creating a view containing a sample set from the original table. This approach enables an administrator to quickly set up a dev environment with minimal data. Controlling the amount of sample data is a good thing for many development situations because developers seldom require access to the full dataset.

Ok, let’s take a look at SAMPLE / TABLESAMPLE and see how we can do this. The syntax is quite simple:

 
 SELECT ...
 FROM ...
   { SAMPLE | TABLESAMPLE } [ samplingMethod ] ( <probability> ) [ { REPEATABLE | SEED } ( <seed> ) ]
 [ ... ]
 
 -- Where:
    samplingMethod :: = { { BERNOULLI | ROW } | { SYSTEM | BLOCK } }


Note that there are some interchangeable terms we should pay attention to when writing the query. These terms are synonyms and only differ in syntax:

SAMPLE | TABLESAMPLE

BERNOULLI | ROW

SYSTEM | BLOCK

REPEATABLE | SEED

The two main methods of sampling are ROW (or BERNOULLI) and BLOCK (or SYSTEM) sampling. Let’s take a closer look at each one. 

Row Sampling

This approach uses the Bernoulli principle to select data by applying a probability of p/100 to each row. In the Snowflake documentation, we mention that this is similar to “flipping a weighted coin” on each row. The number of sampled (selected) rows should be equal to (p/100) * n, where n is the total number of rows in the table and p is the sample probability value set by the user.

This method implies that we will iterate through each row and calculate a probability at each row to match a row candidate. This iteration is going to impact query performance. However, the advantage here is that we will end up with a formal and concise distribution of samples from our data. Regarding creating a dev environment, using the Bernoulli sampling method to create a view does not result in better performance. However, you could certainly create a new data table based on this sampling method and still scan fewer data. For example:

create table dev_Sales_LineItem_Bernoulli_20 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample bernoulli (20);

This statement will create a development table from the LINEITEM table in our SALES database with 20% of the rows from the original table.

I won’t focus too much on this method in this post. You can feel free to experiment on your own to see if this sampling method would suit your needs. Instead, I will talk more about Block Sampling.

Block Sampling

In this approach, we apply a probability of p/100 to each block of rows. For those who are familiar with Snowflake’s micro partitions (details below), block sampling chooses individual partitions based on a specific probability. If you simply want quick access to production data and to run queries against a small percentage of the rows in a table, leveraging Block Sampling is a good way to go.

Let’s look at this with a specific example using the Block Sampling method on the LINEITEM table in the SALES database. Here are the table metadata:

In this example, I have a developer who is working on an aggregation query against the LINEITEM table. He created a simple query like this:

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

L_linestatus;

Result:

 

This query reports the amount of business that was billed, shipped and returned. If the developer has decided to issue a query against this table, we would see the following outcome in our query plan. (Note: the below result was performed on an XS warehouse):


It took about 19 seconds on an XS warehouse, which is not bad by any means. However, we can see that the query still performed a full table scan and performed a pretty good amount of aggregation.

Now, let’s see how we can improve the experience for the developer by creating a sample set of the LINEITEM table. We’ll create a special view (against our cloned dev database) for this developer and call it V_SALES_LINEITEM_SAMPLE_10. In this view, we’ll use the Block Sampling technique and only scan 10% of the partitions required by this query.

First create the view:

create view v_Sales_LineItem_Sample_10 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample block (10);

Next, let’s update the query to use the view instead.

select

       l_returnflag,

       l_linestatus,

       sum(l_quantity) as sum_qty,

       sum(l_extendedprice) as sum_base_price,

       sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

       avg(l_quantity) as avg_qty,

       avg(l_extendedprice) as avg_price,

       avg(l_discount) as avg_disc,

       count(*) as count_order

  from

       v_Sales_LineItem_Sample_10

where

       l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

       l_returnflag,

       l_linestatus

 order by

       l_returnflag,

       l_linestatus;

Result:

 

Let’s take a look at the query profile for this query:

 

This query ran about 2.5 seconds on an XS warehouse, and we scanned about 10% of the total table partition. This result is much better! Snowflake is sampling 10% of the partition to pull based on the WHERE filter. The result should still be accurate enough to let the developer know this query is working as expected.

In conclusion, leveraging Snowflake’s SAMPLE clause considerably reduces the amount of test dataset for a developer without losing data integrity. Even if a specific id or a timestamp biases the native data, chances are the developer is going to accept the nuance and continue with the development instead of spending additional compute credit and hours waiting for the results to return. The alternative is to use the LIMIT function perhaps to reduce the number of rows being returned. However, this involves modifying the original query or creating a view that still needs to be maintained. Using the SAMPLE clause, you can reliably subset a large table with acceptable result mix and performance. I hope you find this helpful. Feel free to leave comments in our community and let me know your thoughts.

 

Deliveroo Delivers with Real-time Data

In a field of struggling food delivery startups, one notable success story has emerged from the fray. Termed “the European unicorn” by TechCrunch, Deliveroo is a British startup that offers fast and reliable food delivery service from a premium network of restaurants.

Deliveroo recently raised a $385 million funding round, boasts an estimated $2 billion valuation and is credited with transforming the way people think about food delivery. What is this unicorn doing differently? How has it found success where so many others have failed?

“Data is baked into every aspect of the organization,” Deliveroo’s head of business intelligence, Henry Crawford said. “Having instant access to data reveals which geographic areas are experiencing a shortage of restaurants and a shortage of particular cuisines so we can create these hubs right at the consumer’s doorstep.”

Deliveroo analyzes customer behavior, gains insights into market trends and responds with swift decisions and rapid execution by using data-driven insights. Snowflake makes all of this possible.

“With data coming from a variety of sources, including web traffic, transactions and customer behavior, having a data warehouse built for the cloud provides one repository for a single source of truth,” Henry explains.“The shift to Snowflake’s cloud data warehouse has enabled us to make good on our promise that got Deliveroo started: To connect consumers with great food from great restaurants, wherever you are, and whatever it takes.“

Snowflake also accommodates Deliveroo’s 650% growth in 2016. Such rapid momentum prompted Deliveroo to expand its business intelligence team from two employees to 14. Additional team members triggered the need for more access to the same data but without impacting performance.

Since Snowflake is built for the cloud, an unlimited number of users can access all of an organization’s data from a single repository, which is critical to Deliveroo’s success. There’s no replicating data, shifting queries and other workloads to non-business hours, or queueing users to preserve performance. Instead, Snowflake’s true cloud elasticity means Deliveroo can automatically scale up, down and out (concurrency) to load and analyze data without disruption.

“None of these future plans would be possible without real-time, concurrent access to massive volumes of data,” Henry said.

What’s next for Deliveroo? Using real-time logistics algorithms to increase the number and the speed of deliveries. Deliveroo’s expansion plans also include an “Editions” program—delivery-only kitchens so partner restaurants can expand their footprint without opening brick-and-mortar locations.

Learn more about how Snowflake can accelerate your data storage and analytics initiatives.