Snowflake recently announced a collaboration with NVIDIA to make it easy to run NVIDIA accelerated computing workloads directly within Snowflake accounts. One interesting use case is to train, customize, and deploy large language models (LLMs) safely and securely within Snowflake. Our new Snowpark Container Services, currently in private preview, together with NVIDIA AI, makes this possible.
Many of our customers would like to run LLMs in Snowflake to use them in conjunction with their Snowflake data, without the need to call external APIs. In addition to simply consuming LLMs, our customers are also interested in fine-tuning pretrained LLMs, including models available with the NVIDIA NeMo framework and Meta’s Llama models, with their own corporate and Snowflake data. In this blog post we will discuss the possibilities this opens for our customers, a high-level view of how it works, and some of the initial use cases we are exploring with early adopters.
Snowpark Container Services: Running NVIDIA GPU compute inside Snowflake
Our new Snowpark Container Services feature enables you to run Docker containers inside Snowflake, including ones that are accelerated with NVIDIA GPUs. NVIDIA provides pre-built and free Docker containers for a variety of AI workloads, including those that use the NeMo framework for LLMs.
By simply spinning up one of the NVIDIA AI Enterprise container images included with this 90-day evaluation license within Snowpark Container Services, you can run them securely, directly inside your Snowflake account on your Snowflake data. Because the container is running inside Snowflake, organizations can provide users with user defined functions (UDFs) that can be called from SQL to run advanced processing inside the container without operational burden.
The flexibility that you get with Snowpark Container Services is near limitless so that you can even run open source development tools like Jupyter notebooks, which provide a convenient way to experiment with and perform LLM fine-tuning.
Pretraining and fine-tuning LLMs within Snowflake
Out-of-the-box LLMs are known as “pretrained models,” meaning someone else has assembled a large amount of training data and trained the model in advance. Usually these models, such as NVIDIA NeMo and Meta’s Llama models, are generalists—they are fluent in English and other languages and have varying degrees of knowledge about various topics. So they may be familiar to some extent with your industry and its lingo, processes, and needs, but likely not in a very deep way. And they generally know little to nothing about your specific company, your policies, your products, or your customers.
They are also usually optimized to perform certain kinds of tasks, like question-answering, summarization, essay-writing, general chat, etc. They are “jacks of all trades,” which they need to be so they are broadly useful, but are often not knowledgeable enough to perform the enterprise use cases you may envision with a high degree of accuracy and skill.
This is where pretraining and fine-tuning of LLMs comes into play. Pretraining generally involves training an LLM from scratch from a large set of training data, which you as the trainer can curate, to produce a model that knows about the things important to you. While some pretrained models can be “further trained” to learn more about a certain domain, often this means fine-tuning.
The NVIDIA NeMo framework, included with the NVIDIA AI Enterprise software platform, allows this fine-tuning on Snowflake. It should be noted that, often, fine-tuning existing models produces sufficient results, given that pretraining can be very data- and resource-intensive. There are various fine-tuning techniques supported by the NeMo framework that range from full fine-tuning to parameter-efficient fine-tuning (PEFT). Some of the full fine-tuning techniques supported are supervised fine-tuning and reinforcement learning with human feedback (RLHF). The PEFT techniques supported include prompt learning, which involves prompt-tuning and p-tuning, adapters, low-rank adaptation, and IA3.
Secure fine-tuning with your corporate data
Fine-tuning a model is useful for teaching it to perform a new task, or adjusting the way it performs an existing task, but is not as useful for teaching a model new base knowledge. If you need to provide the LLM with new base knowledge and don’t want to pretrain a new LLM, you can also explore retrieval-augmented generation approaches, such as the ones shown in this post about our Frosty example chatbot.
But fine-tuning is a great way to change the way an LLM crafts its responses or to adapt it to new tasks. For example, if you were making a customer service chatbot and wanted it to act and write a certain way, you could fine-tune it to do so. This involves taking data from your Snowflake account, creating a training set of examples of how you want it to act and write, and then invoking the fine-tuning process that then creates a new model.
Then you can test the new model versus the base model to see if it performs the way you want it to. Fine-tuning of models is usually relatively fast—a few hours using NVIDIA H100, A100 or A10 Tensor Core GPUs, depending on the base model size and the amount of tuning examples, so you can experiment with it to see what works best. Compare this to pretraining a model from scratch, which can take days or weeks and requires fleets of GPUs working together.
Sourcing fine-tuning training data from Snowflake
What does take time for fine-tuning is the creation of the tuning training set: the examples of how you want the tuned LLM to respond in certain situations. There are example tuning sets, but if you want a bespoke model, you’ll need a bespoke tuning set. For some use cases, you may be able to source this type of tuning data from your Snowflake account. For example, if you have historical customer service chats, transcripts, or email exchanges with your customers stored in Snowflake, you could point the fine-tuner at the historical examples that were most highly rated or that led to the best outcomes.
Since the fine-tuning happens directly on GPU nodes running on Snowpark containers within your Snowflake account, your confidential training data never leaves your account. It stays in the Snowflake tables or staged files where it always lives. Also, the resulting model, which is now based on learnings from your confidential information, itself stays inside Snowflake in a secure Snowflake stage, and can be served up from Snowpark containers.
By keeping the end-to-end pre-training, fine-tuning, storage of the resulting models, and model serving all within Snowflake, you can leverage the latest open source LLM technologies from NVIDIA and others without any worries about your data leaving your organization.
See this Medium post for a technical walkthrough of how p-tuning works using NeMo on Snowpark Container Services.
A variety of possible use cases
Many of our customers have started experimenting with LLMs, and some are now moving on to training and fine-tuning LLMs. Some of the more interesting use cases include:
- Customer service triage, response generation, and eventually full-chat experiences, as described above
- Advertising creative generation personalized for each customer, based on everything you know about each customer in Snowflake
- SQL-drafting and question-answering data analysis chatbots based on your Snowflake data and schemas
- Salesperson assistance to help them understand their customers and prospects and create more effective communications and sales campaigns
- Training and question-answering for new (and existing!) hires based on corporate documents, policies, history, and Snowflake data
- Ingesting and understanding financial data and reports for downstream financial question-answering
- Serving a drug discovery model from Snowpark to generate various molecule and protein combinations
- Curating a multimodal LLM for your many applications
We’re very excited to be partnering with NVIDIA to deliver these—and many other—applications of LLMs to our customers across the Data Cloud. While it is early days for a lot of this, we’re excited to start working with our customers to turn these use cases into a reality.