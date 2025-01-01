In their prior architecture, Sanofi faced several challenges when they used a managed Spark engine as their computation layer. The engine’s manual deployment required manual updates each time new features were introduced or changes were made to the back-end pipeline. This increased coordination and manual dependency across all processes and, as a result, took longer to run the pipeline from end to end.

Building and configuring a Spark cluster was also very resource-intensive. “We observed that the cluster was not scalable and required manual configuration to spin up a bigger instance to run any complex or intensive queries, leading to performance issues in the pipeline,” says Ratan Roy, Data Engineer at Sanofi. “There were also no automatic optimizations in place, and processing required huge amounts of memory.”

The data team experienced many scenarios of pipeline failure or delay due to lack of computational resources. Because the managed Spark environment was shared, the computation resources were available based on the Spark cluster’s availability, not on demand based on the request.

The web-based platform, which the medical user community uses, encountered concurrency issues when multiple interactive users made requests to simultaneously run web application programs. Considering the heavy processing requirement of over a billion records by the Spark cluster, the average analytical response time for a request was around 15 minutes during peak hours.

While separation of compute and data storage is native to the Snowflake platform, the previous managed Spark solution did not have an integrated data storage layer. Data processing had to occur separately and required additional setup, configuration and data movement between multiple platforms, resulting in longer processing times.

When making their decision to migrate to Snowpark and the Snowflake platform, the Sanofi data team performed a benchmark analysis that found an overall 50% performance improvement from their managed Spark cluster and lower overall TCO. “We are able to have large-scale data processing all within the Snowflake environment, giving us greater agility and speed at a lower cost,” Ratan says. “With Snowflake as the central data storage and Snowpark compute, we’ve reduced data movement costs, which has led to faster performance and decreased compute cost.”