Industry
PharmaceuticalLocation
Bridgewater, NJChasing the miracles of science
With flagship locations in Bridgewater, NJ, and Cambridge, MA, Sanofi in the United States employs more than 13,000 professionals throughout the country. Sanofi U.S. comprises four business units: Specialty Care, Vaccines, General Medicines and Consumer Healthcare. Around the world, more than 100,000 people at Sanofi are dedicated to chasing the miracles of science to improve people’s lives.
Story Highlights
Real-world clinical data for the medical community: Sanofi is creating a reactive web application, built on Snowflake, for medical professionals to analyze real-world clinical data for assessing therapy benefits or risks.
Managed Spark to Snowpark: Migrating from Sanofi’s previous managed Spark solution resulted in 50% performance improvement and eliminated challenges like administration, configuration and concurrency.
Professional Services: Professional Services and the Snowpark Migration Accelerator tool, an automated code conversion tool, helped to convert PySpark code to Snowpark and accelerate migration.
Scaling to support a real-world clinical data application
Sanofi is undertaking a project to develop an enterprise-wide data processing platform aimed at assisting the medical community in their analytical needs, particularly in the context of drug discovery. The project focuses on creating an intuitive web application for medical professionals to input query filters related to diseases, drugs or procedures, and identify patient cohorts meeting specific criteria to faster analyze real-world clinical data for assessing therapy benefits or risks.
The web application processes billions of records to generate the key analytical insights a user is interested in exploring. To support this, Sanofi’s previous data architecture relied on a managed Spark engine as their computation layer. However, the data team faced several challenges with Spark’s manual deployment and maintenance, resource scalability issues, frequent pipeline failures due to limited computational resources, concurrency problems during peak usage, and data movement complexities between multiple platforms.
In order to serve their users better, the Sanofi data team decided to redesign their analytical engine. Architecture and Data Engineering Lead Suku Muramula says, “As we were leveraging Snowflake for various data processing tasks, we saw the opportunity to explore Snowpark as a prospective solution for addressing our upcoming data processing needs.”
Redesigning Sanofi’s analytical engine on Snowflake and Snowpark
Sanofi chose Snowflake and Snowpark, the set of libraries and runtimes to securely deploy Python code, for one of their analytical engine redesigns. Snowflake’s separation of storage and compute, near-zero maintenance and on-demand scalability allowed Sanofi to efficiently handle increased workloads and data volumes without compromising performance—all while keeping costs to an optimal level.
As they began their migration, the data team prioritized a service-focused architecture. The goal was to build a robust and efficient system with independent services, enhancing fault isolation to ensure that issues in one service do not impact the entire system. This was instrumental in expediting their migration journey from a managed Spark cluster to Snowflake because it minimized disruption to the web application.
Figure 1. Current architecture with Snowflake and Snowpark ecosystem.
As shown in Figure 1, instead of using complex pipelines, the new architecture streamlines data processing by keeping the data and compute platform together with Snowflake and Snowpark. This reduced latency and improved overall performance, enabling faster data processing and analytics.
Snowflake’s features for data governance, including granular permissions and role-based access control, provide robust control over data and libraries. This ensures data security and compliance with policies.
In addition to accelerating our data processing speeds, it’s paramount for our industry to safeguard intellectual proprietary data and ensure algorithm security and effective governance. Leveraging Snowpark as the computational layer for Python code within our Snowflake data platform empowers us to eradicate the need for data transfer while granting our administrators comprehensive authority over all data and libraries.”
Suku Muramula
Gaining a 50% performance improvement with Snowpark
In their prior architecture, Sanofi faced several challenges when they used a managed Spark engine as their computation layer. The engine’s manual deployment required manual updates each time new features were introduced or changes were made to the back-end pipeline. This increased coordination and manual dependency across all processes and, as a result, took longer to run the pipeline from end to end.
Building and configuring a Spark cluster was also very resource-intensive. “We observed that the cluster was not scalable and required manual configuration to spin up a bigger instance to run any complex or intensive queries, leading to performance issues in the pipeline,” says Ratan Roy, Data Engineer at Sanofi. “There were also no automatic optimizations in place, and processing required huge amounts of memory.”
The data team experienced many scenarios of pipeline failure or delay due to lack of computational resources. Because the managed Spark environment was shared, the computation resources were available based on the Spark cluster’s availability, not on demand based on the request.
The web-based platform, which the medical user community uses, encountered concurrency issues when multiple interactive users made requests to simultaneously run web application programs. Considering the heavy processing requirement of over a billion records by the Spark cluster, the average analytical response time for a request was around 15 minutes during peak hours.
While separation of compute and data storage is native to the Snowflake platform, the previous managed Spark solution did not have an integrated data storage layer. Data processing had to occur separately and required additional setup, configuration and data movement between multiple platforms, resulting in longer processing times.
When making their decision to migrate to Snowpark and the Snowflake platform, the Sanofi data team performed a benchmark analysis that found an overall 50% performance improvement from their managed Spark cluster and lower overall TCO. “We are able to have large-scale data processing all within the Snowflake environment, giving us greater agility and speed at a lower cost,” Ratan says. “With Snowflake as the central data storage and Snowpark compute, we’ve reduced data movement costs, which has led to faster performance and decreased compute cost.”
“Our entire data engineering pipeline and algorithm is built using Python and Snowpark code. All data queries are processed through Snowpark on the Snowflake platform.”
Ratan Roy
Partnering with Snowflake Professional Services
When Sanofi decided to migrate from Spark to Snowpark, they engaged with the Snowflake Professional Services team as a key part of their journey.
“Frankly, I was very happy with the experience we had with Professional Services. The team was supportive from day one, facilitating us to help manage and identify what’s required for a successful migration,” Muramula says. “Their readiness assessment was nothing short of exceptional, and I would say it helped us gain valuable insights into the migration process and ensure that we are well prepared to establish resources and identify any gaps in the process,” Muramula says.
The Snowpark Migration Accelerator, an automated code conversion tool, converted PySpark code to Snowpark and accelerated the migration process. “This was a game changer and helped us move quickly while ensuring the code stayed pristine. Overall, I strongly encourage others to engage with Professional Services when embarking on such a journey,” Muramula says.
Streamlining data sharing and data science
With the current application, the data team at Sanofi is focusing on a handful of data sources. However, they plan to expand to more data sources to enable the medical community to research additional diseases and therapies.
“As we move forward, our data collection and processing procedures continue to evolve, necessitating the handling of billions of additional records to enhance our analytical capabilities,” Muramula says. “We believe the Snowflake platform will remain as our reliable choice for dynamic scalability and robustness, and to effortlessly cater to the ever-expanding landscape.”