SEP 05, 2024/9 min readGen AI

Model Hotswapping: Optimizing AI Infrastructure and Enhancing LLM Efficiency

At Snowflake, we support customer AI workloads by offering a diverse range of open source and proprietary large language models (LLMs). As we expand our first-party product offerings in Snowflake Cortex AI, including Cortex Analyst (in public preview), Snowflake Copilot (generally available in select regions) and Cortex Search (public preview), we have found that spinning up multiple dedicated instances for each new individual model can be costly and inefficient. In this article, we will explore model hotswapping, a technique for dynamically swapping LLMs on a static set of inference engines.

Previous model-scaling challenges

Although it might seem that inference is just about writing efficient kernel code, scaling our services to support thousands of customers required significant time and effort. We had to determine how to allocate our machines and how to support new models without increased infrastructure costs. Without the ability to swap models on the fly, we had dedicated pinned instances for each model. To determine the number of replicas for each model to deploy, we had to observe previous traffic patterns and reallocate model instances based on usage.

This resulted in a significant amount of manual work in determining how to allocate our machines to which models, along with releasing the set of model configurations into production, which took a non-trivial amount of time. Moreover, when we wanted to deploy new models to new regions, we commonly had to ask the question: Are there additional GPUs available for provisioning?

Why not only use traditional horizontal pod autoscaling (HPA)?

With the limited availability of high-end GPU resources from the different cloud providers supported by Snowflake, we reserve a fixed number of instances for availability. This practice contrasts with workloads that do not utilize GPU resources, where machines may be abundant and can horizontally scale up to an arbitrary amount of nodes. Because the number of GPU instances is fixed, the cost is also constant. Therefore, today we deploy our instances at their maximum capacity without additional instances to scale up to.

We run our LLM inference infrastructure across 42+ production clusters globally, each configured as a Kubernetes deployment. The easiest way to respond to a large workload or a burst of traffic for a given model is to horizontally autoscale the deployment for that model independently. To effectively utilize HPA, nodes need to be able to scale up quickly in response to real-time metrics. We can address the long cold start issue by keeping a set of reserve nodes hot with the image and model weights already loaded onto the node. This would make it very simple to scale up and down each different model deployment.

Unfortunately, this method was still not fast enough and inadequate for our use case. When scaling up and down pods on these hot nodes, we had to deal with different overheads of scheduling the pod, such as loading the model weights into GPU memory from disk for serving, and gracefully releasing resources upon scaling down. By the time the new pod is ready to serve production traffic, the current load may require an entirely different allocation of models! Very simply stated, our workloads were large enough and dynamic enough that HPA was too slow to respond to our services' needs. Knowing that HPA would continue to be a bottleneck and the number of models we support was likely to grow, we determined that in order to scale up/down resources in response to real-time traffic, what matters most is the speed of resource reallocation.

What does Model Hotswapping solve?

Efficiently serves a large set of models on fewer machines

Much like the name suggests, model hotswapping allows us to serve a large number of models on the same device. Unlike HPA, which allocates individual models to GPUs for active inference, our swappable inference engines contain many model weights from a large set of models in our registry. Since not all models can be loaded into GPU memory, model weights are cached in CPU memory and disk space.

Previously, scaling to meet load required scheduling and provisioning of a new specialized node. Now scaling only involves adjusting which models are active. When more replicas for a model are needed, the inference engine can quickly swap to one of the many models we offer.

Failover quickly

In production, there are sporadically instances when a node becomes unhealthy due to node failures and node pool upgrades. We specify a minimum number of replicas for each model that the model can drop down to. If the minimum number of replicas for a model is not met, we quickly identify a candidate model to swap in for failover.

Dynamic and efficient GPU resource utilization