At Snowflake, we support customer AI workloads by offering a diverse range of open source and proprietary large language models (LLMs). As we expand our first-party product offerings in Snowflake Cortex AI, including Cortex Analyst (in public preview), Snowflake Copilot (generally available in select regions) and Cortex Search (public preview), we have found that spinning up multiple dedicated instances for each new individual model can be costly and inefficient. In this article, we will explore model hotswapping, a technique for dynamically swapping LLMs on a static set of inference engines.
Previous model-scaling challenges
Although it might seem that inference is just about writing efficient kernel code, scaling our services to support thousands of customers required significant time and effort. We had to determine how to allocate our machines and how to support new models without increased infrastructure costs. Without the ability to swap models on the fly, we had dedicated pinned instances for each model. To determine the number of replicas for each model to deploy, we had to observe previous traffic patterns and reallocate model instances based on usage.
This resulted in a significant amount of manual work in determining how to allocate our machines to which models, along with releasing the set of model configurations into production, which took a non-trivial amount of time. Moreover, when we wanted to deploy new models to new regions, we commonly had to ask the question: Are there additional GPUs available for provisioning?
Why not only use traditional horizontal pod autoscaling (HPA)?
With the limited availability of high-end GPU resources from the different cloud providers supported by Snowflake, we reserve a fixed number of instances for availability. This practice contrasts with workloads that do not utilize GPU resources, where machines may be abundant and can horizontally scale up to an arbitrary amount of nodes. Because the number of GPU instances is fixed, the cost is also constant. Therefore, today we deploy our instances at their maximum capacity without additional instances to scale up to.
We run our LLM inference infrastructure across 42+ production clusters globally, each configured as a Kubernetes deployment. The easiest way to respond to a large workload or a burst of traffic for a given model is to horizontally autoscale the deployment for that model independently. To effectively utilize HPA, nodes need to be able to scale up quickly in response to real-time metrics. We can address the long cold start issue by keeping a set of reserve nodes hot with the image and model weights already loaded onto the node. This would make it very simple to scale up and down each different model deployment.
Unfortunately, this method was still not fast enough and inadequate for our use case. When scaling up and down pods on these hot nodes, we had to deal with different overheads of scheduling the pod, such as loading the model weights into GPU memory from disk for serving, and gracefully releasing resources upon scaling down. By the time the new pod is ready to serve production traffic, the current load may require an entirely different allocation of models! Very simply stated, our workloads were large enough and dynamic enough that HPA was too slow to respond to our services' needs. Knowing that HPA would continue to be a bottleneck and the number of models we support was likely to grow, we determined that in order to scale up/down resources in response to real-time traffic, what matters most is the speed of resource reallocation.
What does Model Hotswapping solve?
Efficiently serves a large set of models on fewer machines
Much like the name suggests, model hotswapping allows us to serve a large number of models on the same device. Unlike HPA, which allocates individual models to GPUs for active inference, our swappable inference engines contain many model weights from a large set of models in our registry. Since not all models can be loaded into GPU memory, model weights are cached in CPU memory and disk space.
Previously, scaling to meet load required scheduling and provisioning of a new specialized node. Now scaling only involves adjusting which models are active. When more replicas for a model are needed, the inference engine can quickly swap to one of the many models we offer.
Failover quickly
In production, there are sporadically instances when a node becomes unhealthy due to node failures and node pool upgrades. We specify a minimum number of replicas for each model that the model can drop down to. If the minimum number of replicas for a model is not met, we quickly identify a candidate model to swap in for failover.
Dynamic and efficient GPU resource utilization
Our LLM inference service offers more than 30 proprietary and open source models to be served in production today, including Snowflake Arctic, removing filters0>.","results_filtered_by_with_count":"{{count}} Result filtered by","results_filtered_by_with_count_one":"{{count}} Result filtered by","results_filtered_by_with_count_other":"{{count}} Results filtered by","search_field":"Search field","close_filter":"Close filter","other_content_in_this_stream":"Other Content in this Stream","view":"View","grid_view":"Grid view","list_view":"List view","industry":"Industry","location":"Location","product_cat_used":"Snowflake Product Categories Used","learn_more":"Learn More","read_more":"Read More","view_more":"View more","view_less":"View less","register_now":"Register Now","watch_now":"Watch Now","listen_now":"Listen Now","platform":"Platform","region":"Region","select_platform":"Select a platform","select_region":"Select a region","modal_window_collapsed_message":"The modal window (ID: {{id}}) is currently collapsed in the editor view. To reveal it, please ensure that the 'Collapse modal in AEM Editor' option is unchecked.","newest_oldest":"Newest - Oldest","oldest_newest":"Oldest - Newest","a_z":"A - Z","z_a":"Z - A","nearest":"Nearest","farthest":"Farthest","iframe_validation":"Code snippet should start and end with iframe tags.","min_read_with_count":"{{count}} min read","learn_more_about_authors":"Learn more about the authors","learn_more_about_authors_one":"Learn more about the author","learn_more_about_authors_other":"Learn more about the authors","back":"Back","just_for_you":"Just For You","share_article":"Share Article","more_blog_posts":"More blog posts","top_voices":"Top Voices","solution_areas":"Solution Areas","workload_specializations":"Workload Specializations","snowpro_core_certications_with_count":"Snowpro Core Certifications: {{count}}","snowpro_core_certications_with_count_one":"Snowpro Core Certification: {{count}}","snowpro_core_certications_with_count_other":"Snowpro Core Certifications: {{count}}","snowpro_advanced_certications_with_count":"Snowpro Advanced Certifications: {{count}}","snowpro_advanced_certications_with_count_one":"Snowpro Advanced Certification: {{count}}","snowpro_advanced_certications_with_count_other":"Snowpro Advanced Certifications: {{count}}","headquarters_with_value":"<0>Headquarters:0> {{value}}","skip_to_content":"Skip to content","version":"Version","file_name":"File Name","architecture":"Architecture","size":"Size","release_date":"Release Date","sha256_checksum":"SHA256 Checksum","client":"Client","download":"Download","documentation":"Documentation","plus_more":"+More","months_shorthand":{"jan":"jan","feb":"feb","mar":"mar","apr":"apr","may":"may","jun":"jun","jul":"jul","aug":"aug","sep":"sep","oct":"oct","nov":"nov","dec":"dec"},"share":"Share","share_event":"Share Event","twitter":"X","email":"Email","facebook":"Facebook","linkedin":"LinkedIn","password":"Password","password_error":"Incorrect password entered. Please try again.","password_enter":"Enter password to continue","password_content_protected":"This content is password protected. To access, please enter the password into the field below:","need_help":"Need help?","contact_representative":"Contact your Snowflake Account Representative","view_quickstart":"View Quickstart","fork_repo":"Fork Repo","watch_the_demo":"Watch the Demo ({{value}})","by":"By","published_with_date":"Published: {{val, shortDate(datetime)}}","updated_with_date":"Updated: {{val, shortDate(datetime)}}","watch_the_demo_modal_msg":"Please configure a modal for this button. \nModalId: {{modalId}} \nYoutube video ID: {{videoId}}","short_date":"{{val, shortDate(datetime)}}","narrow_date":"{{val, narrowDate(datetime)}}","search_by_address":"Search by an address, city, neighborhood or state","mapbox_address_sr_instructions_with_count":"Type at least {{count}} characters to search. Use arrow keys to navigate results. Press Enter to select. Press Escape to close.","mapbox_address_sr_instructions_with_count_one":"Type at least {{count}} character to search. Use arrow keys to navigate results. Press Enter to select. Press Escape to close.","mapbox_address_sr_instructions_with_count_other":"Type at least {{count}} characters to search. Use arrow keys to navigate results. Press Enter to select. Press Escape to close.","clear_search":"Clear search","finding_your_location":"Finding your location...","search_suggestions":"Search Suggestions","use_my_current_location":"Use my current location","loading":"Loading...","no_results":"No results found","no_results_location_description_reset":"Try expanding the radius, searching a different location, or <0>resetting your filters0>.","no_results_location_description":"Or, explore virtual events which can be attended from anywhere in the world!","results_available_with_count":"{{count}} results available","results_available_with_count_one":"{{count}} result available","results_available_with_count_other":"{{count}} results available","close_error_dialog":"Close error dialog","oops_location_error_title":"Oops, we can’t find your location","oops_location_error_description":"We couldn’t find you quickly enough! Try again later, or search near a city, place or an address instead.","clear_all":"Clear All","distance":"Distance","within_unit":"Within {{value}} {{unit}}","apply_filters":"Apply filters","sort_with_value":"Sort {{value}}","related_content_empty":"The cards won’t render in the author environment!","on_this_page":"On this page","ready_to_get_started":"Ready to get started?","youtube_video_thumbnail":"YouTube video thumbnail","featured":"Featured","more":"More","category":"Category","search":"Search","search_placeholder":"Search Snowflake.com","close_search":"Close search","submit_search":"Submit search","coveo_content_type":"Content Type","coveo_relevance":"Relevance","coveo_newest":"Newest","coveo_oldest":"Oldest","coveo_no_results_title":"No results available for <1>{{query}}1>.","coveo_no_results_subtitle":"Try adjusting the filters or start a new search.","coveo_no_results_cta":"Start a new search","coveo_no_results_image_alt":"","coveo_filters":"Filters","coveo_close_filters":"Close filters","coveo_view_results":"View results"}}},"usedSvgs":["SnowflakeLogoDesktop","SnowflakeLogoMobile","CaretDown","Language","CheckMarkImage","Search","Link","Code","ArrowDown","Cloud","Linkedin","Twitter","Facebook","ArrowRight"],"usedComponents":["snowflake-site/components/markup-editor","snowflake-site/components/mega-header","svg-SnowflakeLogoDesktop","svg-SnowflakeLogoMobile","snowflake-site/components/nav/nav-mega","svg-CaretDown","svg-Language","svg-CheckMarkImage","svg-Search","svg-Link","snowflake-site/components/blog/sub-navigation","svg-Code","svg-ArrowDown","svg-Cloud","snowflake-site/components/blog/breadcrumb","snowflake-site/components/flexible-column-container","snowflake-site/components/flexible-column-container/flexible-column-content-container","snowflake-site/components/blog/blog-hero","svg-Linkedin","svg-Twitter","svg-Facebook","snowflake-site/components/blog/blog-text","snowflake-site/components/blog/blog-title","snowflake-site/components/blog/author-chip","svg-ArrowRight","snowflake-site/components/blog/blog-table-of-content","snowflake-site/components/blog/related-content","snowflake-site/components/form/embedded/marketo"],"isInEditor":false,"deviceType":"desktop"}

