In a previous blog post, as part of our mission to be an open research team through our Snowflake Arctic Cookbook Series, we shared our data recipes. In this blog post we are going one step further: discussing how we come up with the right proportions of data sets.
The Snowflake Arctic large language model (LLM) was trained just this spring, but in the quickly evolving world of AI and LLMs, that’s a lifetime ago. As we revisit the training mixture used at the time, let’s not only examine its composition but also reflect on the decisions we made. With the benefit of hindsight and the progress that has been made in LLMs over the past months, we can assess whether our choices were validated or if there are lessons to be learned and alternative approaches that could have been taken.
Here is how the story begins
What is the starting point? You have a number of possible data sources to choose from, including Wikipedia, STEM papers and web-based corpora, like Common Crawl or C4. To begin training, you need to select an initial data distribution, which will be used for the entire pretraining process or, in our case, at least for the first phase. A uniform distribution, where each data source contributes a number of tokens proportional to its size, might seem like a natural approach. However, this would be misguided, as the largest source would eclipse smaller (in size but not necessarily in value) sources. Instead, you need to either upsample or downsample selected sources to achieve a more balanced distribution.
A nonuniform distribution is required then. However, a review of recent literature reveals a lack of transparency, with Meta’s Llama series and the open source LLM360 initiative being notable exceptions. Most other technical reports for newly released LLMs vaguely mention using “data from various web sources” without providing further details. For the Llama model series, the technical report for the first model, LLaMA, provides a clear description of the pretraining data composition. Subsequent Llama models were built upon this foundation, and we assessed the original LLaMA data mixture to select our starting point.
LLaMA data composition was a nonuniform data mixture, not the nonuniform one; in other words, there is no simple formula or principle to give you the nonuniform mixture for all data sets. Since we needed to add more data sets to our mixture, we spent time searching for additional approaches and principles to guide our selection of distribution.
DoReMi
We began with the Domain Reweighting with Minimax Optimization (DoReMi) method, which trains a much smaller proxy language model by varying the source weights on the fly, in such a way that no domain is harmed (as measured by perplexity) even if the corresponding data source is downsampled. The promise is that data sources will be used in a more economical manner, yet we will still observe improvements in at least some domains.
We gave DoReMi a try, running numerous ablations on small-scale models (e.g., 350M-parameter language models trained on 300 billion tokens, with various distributions of domain/source weights as indicated by an even smaller LLM trained using the DoReMi recipe). The ablations were benchmarked on a suite of common sense test sets, including MMLU, PIQA, OpenbookQA, BoolQ and GSM8K. Interestingly, DoReMi converged to a data composition similar to the LLaMA nonuniform distribution, even when started from a simple uniform distribution, with notable exceptions: It upsampled Wikipedia at the expense of seemingly high-quality expert data sources, such as STEM papers and StackExchange. However, using that distribution did not improve the benchmark results. Our intuition was that STEM papers were too challenging for a model of this scale; imagine schoolkids deciding on a mixture of textbooks to learn from — books on advanced calculus would be discarded because they wouldn’t understand them anyway.
Streamline it with Mixture of Experts and Streamlit
With all the inconclusive results obtained with DoReMi, we decided to focus on streamlining the whole ablation process to test as many data mixtures as possible— using insights obtained from DoReMi and its variations, the few breadcrumbs left in papers and technical reports and, finally, our intuitions. Here, our Mixture of Experts (MoE) architecture comes in handy, as it is much faster to train, as compared to a dense model. This was the case with Snowflake Arctic, but also with its smaller siblings used during ablations, when the turnaround time for ablations was crucial.
The goal of the experiments was to optimize downstream metrics, including not only common sense metrics mentioned above, but also coding metrics, such as MBPP and HumanEval (and their “plus” versions). By the way, all the benchmark scores were logged to our database at Snowflake (we’re big fans of dogfooding!) and could be easily visualized using a custom Streamlit in Snowflake app. In general, we used a number of small Streamlit applications, for example, for comparing outputs or doing manual labeling.
From whence the web data?
In the pretraining phase of most LLMs, web-based text corpora are the primary contributor. These corpora are almost always based on Common Crawl, an open repository of web crawl data. However, Common Crawl requires significant processing, filtering and deduplication, due to the presence of low-quality and potentially harmful content. The C4 (“Colossal Clean Crawled Corpus”) data set, gleaned from Common Crawl in 2019, has stood the test of time as far as quality is concerned, but its size (~200B tokens) is now insufficient for modern LLMs. Furthermore, the data set’s age means it lacks information on significant global events from the past four years. Similarly, OpenWebText, a high-quality data set that replicates the training set behind GPT-2, is also inadequate. We used C4 and OpenWebText, but also felt that a larger source of high-quality web data was definitely needed.
At the outset of the Arctic project, what was available, in terms of up-to-date Common Crawl-based web data, were an annotated set of 84 different Common Crawl web dumps from Together.AI; and Refined Web, a massive web corpus constructed by filtering and deduplication of Common Crawl. Notably, these data sets exhibited distinct distributions of domains and web pages (even within the same domain), likely due to Refined Web’s sampling approach, which drew from multiple Common Crawls. However, by applying filtering, we were able to improve the quality of the 84 dumps, bringing them more in line with those of Refined Web. Ultimately, we chose to utilize both data sets, leveraging their complementary strengths. In Phase 1, general Common Crawl and Refined Web comprised 20% and 40.5% of the training set, respectively, together accounting for 87% of the web-based data used during this phase (the rest being C4 and OpenWebText).
Let’s pause to consider the developments that followed the release of Snowflake Arctic, in April 2024. In the period since, several new, high-quality web-based data sets have emerged for pretraining LLMs: FineWeb, Dolma, DCLM and more. Our current experiments indicate that DCLM is the best source of high-quality tokens. Another interesting resource is FineWeb EDU, a 1.3 trillion-token subset of educational content, filtered from the FineWeb data set. Interestingly, we’ve observed a recent decline in the quality of Common Crawl, which may worsen in the future as the web becomes increasingly inundated with AI-generated “slop.”
Regardless of the web-based data set used, it’s essential to further refine it through postprocessing, filtering and deduplication. Deduplication, in particular, is crucial, as we discovered during our initial experiments. Interestingly, we found that aggressively deduplicating the data mixture in a uniform setup led to results comparable to the original nonuniform composition.
More than general web content
In addition to general web-based content, we drew from 11 other sources of plain text. A tip for those just starting out with pretraining LLMs: Regularly check Hugging Face for new data sets — not just at the beginning of your project, but every few weeks. We did this throughout our training process and stumbled upon great resources, like Cosmopedia, which we incorporated into later phases.
Our 11 data sets spanned a range of domains, including STEM fields, like mathematics (ProofPile and OpenWebMath). We also covered law (FreeLaw), finance (EDGAR-CORPUS), and patents (Pile-USPTO), among others.
For code, we began with a modest 4.5% allocation in Phase 1, increasing the proportion in subsequent phases. Our primary source for code data was StarCoder.
Phases 2 and 3
LLMs have a temporal bias: The data they’re trained on early on has a different impact than the data they’re exposed to later. This is similar to human learning, where the timing of new information affects how we process and retain it. Just as our ability to learn and understand new concepts depends on our individual life trajectories, LLMs are also influenced by the sequence of their training data.
This insight led us to modify the data mixture during training. Our approach was guided by the intuition that it’s best to start with general web data to establish a broad foundation of knowledge, and then add more specialized data, such as code and STEM-related content (see Table 1). This isn’t just about developing narrow coding skills (e.g., Python, SQL), but also about enhancing general reasoning abilities. Our results validated this approach, as we saw steady progress on key metrics (MMLU, GSM8K, and coding average) across the three pre-training phases.
As mentioned earlier, for Phase 1, we assessed the LLaMA recipe to select our starting point. For Phases 2 and 3, it was mostly an intuition call to have the optimal fractions, given where the metrics end in the previous phase. Our intuition was backed up by extensive experiments and ablations, always aiming to strike a balance between general knowledge metrics, such as MMLU, and good coding skills.
All in all, for Phases 2 and 3 our goal was to develop a powerful enterprise model. To achieve this, we increased the proportion of STEM and code-related data, with a particular focus on SQL. (Actually, given that we would like to be better on SQL, we trained with more epochs on SQL code examples). Again, to make sure we have high-quality sources, we ran an ablation for each individual data set, combining them with a fixed amount of web tokens. We then evaluated the downstream metrics to determine whether each data set was suitable for use. Typically, we trained on each data set for 2-4 epochs. One lesson learned from our experiments with coding data sets was that the run-to-run variance was high, so we considered results within a few percentage points to be equivalent. As always, it was essential to manually review the changes to verify that they were meaningful and sensible.
How did we pick 25% Code and STEM in Phases 2 and 3? As we trained the model, we identified key metrics that were crucial for the enterprise use case, such as coding, SQL and math capabilities. Rather than trying to predetermine the optimal data mixture up front, we took a more adaptive approach. At the end of each phase, we evaluated the metrics and identified areas where we wanted to see further improvement. We then adjusted the data sources accordingly, increasing the proportion of data in those categories. The multiphase approach gave us the flexibility to make these adjustments and refine our strategy as needed.
Summary
The Snowflake Arctic LLM was trained using a nonuniform data mixture, which we developed through a combination of existing literature analysis, experimentation and intuition. Using the same proportions of “ingredients” in our mixture would not be optimal; we needed to adjust the data mixture throughout the pretraining process to ultimately obtain a powerful enterprise model. One important takeaway is that it is indispensable to regularly explore data set repositories, such as Hugging Face, every few days to discover new and exciting data sets to incorporate into your LLM.