Three years ago, I predicted that data science and machine learning (ML) would converge with data warehousing, and it would happen in the cloud. The typical reaction was short and sweet: Why would data scientists use a data warehouse, which is often the most constrained resource in an enterprise, when they’re perfectly happy to use files such as Apache Parquet, XML and Comma Separated Files. And how could this scenario ever work in the cloud?
Today, that convergence is a reality. Snowflake is proud to announce a full integration between our built-for-the-cloud data warehouse and Databricks, the leader in unified analytics. With complete accessibility between two unique offerings, this groundbreaking convergence is finally happening.
Since my prediction came true, I’d like to make another one: This convergence will have an accelerating effect on artificial intelligence (AI) and ML. The reason? AI and ML get better with the more data you have. And more data is being produced every minute.
However, many data scientists still don’t think about AI and ML data in the same way analysts think about BI. A common (mis)perception among data scientists using ML is that primitive files, rather than live data, should be used for training models and for applying different kinds of ML functions. These files are typically extracted from the data warehouse so as not to disrupt the production.
With a traditional data warehouse, these beliefs make some sense. However, Snowflake is no ordinary data warehouse, and there’s no reason for legacy practices to continue now that cloud data warehousing is an option. Snowflake opens the door for scalability, full support for semi-structured data such as JSON, and the utilization of real-time data without duplication.
In fact, with the two cloud SaaS platforms working together, Snowflake and Databricks have created a unified solution that arms ML scientists with the right tools to evolve their practices and benefit from:
- Faster performance: Enable high-speed performance with an extended set of automatic pushdown of Spark to Snowflake SQL operations, combined with Snowflake’s always-on pruning of metadata.
- Rapid pace of innovation: Eliminate data wrangling, which currently takes up the majority of ML engineering time. Instead, focus on tuning ML models for faster insights.
- Easy access to up-to-date data: Ditch comma-delimited files or directory-partitioned data. Use SQL, Python or R with libraries of choice to define data transformation.
- Zero duplication: Never settle for stale extract files again. Utilize a single source of data and truth for both AI and ML.
- Scalability: Remove limits on your data usage. Organize vast amounts of data at scale in a performant manner, thanks to built-in cloud elasticity and flexibility.
With Snowflake’s first-class back end and foundation for your data management tier, coupled with Databricks‘ Unified Analytics Platform, everything just works. Data scientists can train models while analysts can run dashboards, all at the same data, while new data continues to flow into the data warehouse without any downside or disruption. Even customer-facing applications can hit the data, all at the same time!
However, the real winners here are the customers, as ML becomes mainstream. Teams that lack the necessary technical expertise are now empowered to run these services without worrying about the infrastructure or where the data comes from. When everything is democratized, organizations are stronger and more data-driven from end-to-end.
I’d be remiss if I didn’t also mention that only Snowflake enables instant access to shared data in real time, across and between organizations. For ML, this zero movement and sharing of real-time data will prove to be powerful for all businesses that need to collaborate and work together to gain deeper insights.
I firmly believe that being data-driven means deciphering patterns in data to make better decisions. In the end, with all data and data relationships converging in one connected place, AI driven by ML will take us to a whole new level by telling us what questions we should ask and what insights we should glean about our data.