Gen AI

Evaluating and Optimizing Search Quality: Seamlessly Tune Your Snowflake Cortex Search Service with an LLM Judge

Recently, we showed that Snowflake’s Cortex Search has unmatched out-of-the-box retrieval quality, but what do you do if your out-of-the-box experience isn’t good enough? You tune your retrieval and ranking algorithms, of course. But how do you know what to tune?  Large tech companies have entire teams dedicated to tuning specific features for numerous query types so that the perfect results are displayed to the user. For example, a navigational, single-word query will probably use more lexical features than a long, complete sentence query, which is more likely to use semantic features. This process of manual tuning is painstaking and costly; most businesses don’t have the resources to devote to this endeavor.

What if there was a way to serve high-quality results to the user without spending days and resources tuning each parameter? What if you could rely on an autonomous pipeline that could do this for you? In this investigation, we find that, with the use of LLMs, you can build high-quality evaluation data sets for iteration without spending precious human hours to improve your search system — all with Snowflake.   

Search quality 101: Beyond the search bar

We all use search engines every day, typing in queries and expecting relevant results. But have you ever stopped to think about how "relevance" is measured and how search engines are constantly improved? This is where the concept of search quality comes in. Essentially, search quality boils down to this: How well does a search system retrieve the most relevant documents from a given collection for a specific query? While striving for a perfect score is a natural goal, the absolute number is less important than the relative improvement between different search systems or modifications to a single system. We're looking for progress, not necessarily perfection.

Measuring search quality: The power of goldens

So how do we quantify "good" search results? The key lies in query relevance labels, often called "goldens." Think of these as the gold standard (pun intended!) for evaluating search performance. Goldens provide a structured way to express how relevant a document is to a specific query. They assign a score that represents this relevance.

These scores can be binary (0 = irrelevant, 1 = relevant), offering a simple yes/no assessment. However, they are more often scaled, providing a more nuanced evaluation. A typical scaled system might look like this:

  • 0 = Irrelevant

  • 1 = Fair

  • 2 = Good

  • 3 = Perfect

Having these goldens is crucial. They allow us to evaluate different retrieval systems or tweaks to existing ones. By comparing the results against the goldens, we can get numerical metrics that reflect the relative retrieval quality of each system. These metrics provide a quantifiable way to track progress and make informed decisions about system tuning. An example of this metric is NDCG, or Normalized Discounted Cumulative Gain. NDCG measures how well search results are ranked by rewarding relevant results that appear higher on the list while giving less credit to those ranked lower. The result's relevance is defined by the golden set the system is being evaluated on. The process is iterative: Goldens go in, and a better search system comes out. Goldens can be formatted like the following:

Query Document Relevancy
What are apples? Apples are sweet, crisp fruits that grow on trees and come in various colors like red, green and yellow. 3
What are apples? Discover the best hiking trails in Yosemite. 0
What are apples? Many people enjoy eating apples as a snack. 1
What are apples? The apple tree produces round fruits that can be eaten raw or used in desserts, like pies. 2

Table 1. An example of a golden set and how relevancy labels can be assigned. A 3 represents something perfectly relevant; 2 is good relevance; 1 is partially relevant; and 0 is irrelevant.

The million-dollar question: Where do goldens come from?

If goldens are so essential, you might wonder where they come from. Is there some magical Golden Search Goose? Do they magically appear? Unfortunately, no. Creating high-quality goldens is a significant undertaking. Goldens are where businesses encode their core logic into a search system, a subjective representation of what is relevant for their use case. Goldens typically involve trained human assessors evaluating the relevance of documents to a wide range of queries, based on a well-defined and explicit labeling guidance, such as Google’s (181 pages long!). Maintaining a good set of goldens is an ongoing effort, as document collections and user search behavior evolve. Goldens are not about randomly tuning knobs until your performance on a public benchmark goes up. Instead, goldens are about creating representative samples of what you care about so that your search system is optimized to behave the way your business needs. 

The rise of LLMs: A new era for golden creation?

Given the effort involved in creating goldens, researchers are exploring ways to automate or augment the process. This is where large language models (LLMs) enter the picture. LLMs, with their ability to understand and generate natural language text, hold the potential to revolutionize how we create and use goldens. Could LLMs act as judges, automatically assessing the relevance of documents to queries? This is a hot area of research with exciting possibilities. Imagine a future where LLMs assist in generating goldens, making the process faster, cheaper and more scalable. This could lead to even more rapid improvements in search quality, benefiting users worldwide.

To infinity and beyond: How Snowflake Cortex Search improves your retrieval quality automatically 

In the past few paragraphs, we have been talking at a fairly high level, but what does that mean for our Cortex Search users? Luckily Cortex Search retrieval performance is easily tunable if you have a useful signal. Cortex Search uses a variety of search features, such as topicality, semantic similarity, popularity and more, to determine whether a result is relevant to a query. These features interact differently with one another depending on what corpus has been indexed. With our Streamlit app eval tool, customers can evaluate how well their search service does on its indexed contents. Additionally, customers can autotune their search service, and the tool provides different sets of parameters that are more optimal than the default.

Figure 1. Snowflake’s Streamlit app eval tool for Cortex Search.
Figure 1. Snowflake’s Streamlit app eval tool for Cortex Search.

Although the eval tool utilizes customer-provided query doc pairs to evaluate the search service, the autotune pipeline uses an LLM to suggest hyperparameters for features used to rank search results. Using an LLM judge provides flexibility with customers: If goldens are not provided, customers can still tune their system. However, to help ensure that the system autotunes correctly, the LLM judge needs to be at a similar judgment level as a human. With the following experiments, we show that our LLM judge is reliable in tuning a search system in the correct direction.

The results

Let’s examine the results of our experiments to prove this claim. First, we show that our LLM-based relevance assessor (commonly referred to as LLM as Judge) produces high-quality labels. Next, we show that with their high-quality labels LLM-based judges are reliable enough to autotune.

Our first experiment compares our in-house LLM judge labels with labels from NIST’s TREC DL21, TREC DL22 and TREC DL23 data sets. TREC (Text REtrieval Conference) data sets are benchmarks for passage and document ranking tasks, designed to evaluate deep learning-based retrieval models using human-annotated relevance labels. We also compared the labels generated by UMBRELA, which is an open source toolkit that uses GPT-4o to perform document-relevance assessments, replicating human judgment for information retrieval evaluation. These labels were generated using the SNOWFLAKE.CORTEX.COMPLETE function with the llama3.1-405b model. Each label is on a scale of 0 to 3:

  • 0 = the doc is not relevant to the query at all

  • 1 = the doc is slightly relevant to the query

  • 2 = the doc is somewhat relevant to the query

  • 3 = the doc is very relevant to the query

For this experiment, we measured an off-by-1 metric. Off-by-1 measures the percentage of predictions within ±1 of the ground truth. When prompting an LLM to generate a label for a query doc pair, the reasoning of the LLM could cause a query doc pair to be rated a 1 when the true label is a 0. When humans assign labels, being off by 1 is expected given different reasoning and interpretation of the query doc pair. In other words, off-by-1 lets us know whether there is a disagreement in magnitude or due to a complete concept misunderstanding.

Data set Judge off-by-1
TREC DL21 In-house 0.915
UMBRELA 0.883
TREC DL22 In-house 0.660
UMBRELA 0.678
TREC DL23 In-house 0.888
UMBRELA 0.875
Average In-House 0.821
UMBRELLA 0.812

Table 2. Exploration on performance of LLM as Judge as compared to professional human rates from NIST’s TREC.

From these numbers, we do see that there is a general agreement in query doc pair relevancies; that is, the LLM labels in both methods tend to agree with the ground-truth human labels.

From analyzing human and synthetic labels for query doc pairs, we saw that the majority either agreed or were one off. For TREC DL22 specifically, we saw that the synthetic labels were more accurate.

Data Set Query Document Human Rating In-house Synthetic Rating
TREC DL 21 What classes do I need to take to go to vet school?
Therefore, begin developing your competitive advantage during high school. Get good grades, complete the prerequisites, and gain experience working with animals. Math and science are important for getting into veterinary school, so join high school and college math and science clubs, as well as taking advanced classes in these areas.
3 3
TREC DL 21 What is a Kanima?
So, what is Kanikama (Surimi)? Kanikama is the imitation or fake crab meat produced from surimi paste that is made by grinding various species of white fish (mostly Alaska Pollock).
1 0
TREC DL 22 How many people watch NBA basketball?
Fans are looking forward to the return of HBO’s Succession for Season 3 after the series’ critically acclaimed sophomore season ended in October 2019.
3 0
TREC DL 23 How to cook thin sliced home fries?
For a healthy alternative to potato chips, you can dehydrate sweet potatoes into crunchy snacks using your oven or a dehydrator.
2 1

Table 3. Examples of human and synthetic ratings (generated with our in-house judge) for query-doc pairs. While model disagreement in absolute terms can differ when compared side-by-side, the model is within 1 point of the human judgment more than 82% of the time.

LLM as Judge works! How can I use it for Snowflake Cortex Search?

The second experiment dives into exploring the directionality of the LLM judge. We want to show that if a Cortex Search service is autotuned on synthetic labels, the search quality of that system will improve. In this experiment, we generate synthetic labels for a query set that already has human labels. First, we evaluate the performance of a Cortex Search service on the human labels to get a baseline metric. Then, we autotune the Cortex Search service on the synthetic labels. Autotuning lets us know what coefficients of the search features need to be tuned up or down. Finally, we set the new coefficients for our Cortex Search service that the autotune job suggested, and reevaluate the performance of this Cortex Search service on the human labels. This experiment also used the TREC data sets and generated labels using Snowflake’s COMPLETE API with Llama 3.1 405b as the LLM.

Figure 2. NDCG@10 of a search service with two different parameter coefficients.
Figure 2. NDCG@10 of a search service with two different parameter coefficients.

As shown in Figure 2, we see a 3% gain in NDCG@10. What this shows is that if you had human labels for a query set that was autotuned using synthetic labels, you would see an improvement in search quality with those human labels. This means that our synthetically generated labels improve search quality in the same, positive direction as human labels. 

Cortex Search autotuning: All the gains, none of the pain

In this post we have gone deep into describing why you should try out autotuning in Cortex and why it works, but what does this mean for you? Simple, the search quality of your Cortex Search service can be improved with just a click of a button. Not only does your search performance improve, you can do so without increasing the cost of your deployment. No need to finetune a model, make sure the finetuning parameters are right, wait for your search service to re-embed your entire corpus, or worry about how to serve it at scale. Your deployment’s relevance gains are all done without any new model, and this approach will continue to work even if you change your underlying embedding model. Regardless of size and resources available to our customers, they can iterate and build the best search service for their use case. This functionality is available on our Cortex Search eval tool offering, allowing customers to evaluate and tune their search services.


Appendix

Below is the in-house prompt we used for evaluation.

You are an expert search result rater. You are given a user query and a search result. Your task is to rate the search result based on its relevance to the user query. You should rate the search result on a scale of 0 to 3, where:

    0: The search result has no relevance to the user query.

    1: The search result has low relevance to the user query. In this case the search result may contain some information which seems very slightly related to the user query but not enough information to answer the user query. The search result contains some references or very limited information about some entities present in the user query. In case the query is a statement on a topic, the search result should be tangentially related to it.

    2: The search result has medium relevance to the user query. If the user query is a question, the search result may contain some information that is relevant to the user query but not enough information to answer the user query. If the user query is a search phrase/sentence, either the search result is centered around about most but not all entities present in the user query, or if all the entities are present in the result, the search result while not being centered around it has medium level of relevance. In case the query is a statement on a topic, the search result should be related to the topic.

    3: The search result has high relevance to the user query. If the user query is a question, the search result contains information that can answer the user query. Otherwise if the search query is a search phrase/sentence, it provides relevant information about all entities that are present in the user query and the search result is centered around the entities mentioned in the query. In case the query is a statement on a topic, the search result should be either be directly addressing it or be on the same topic.

    

    You should think step by step about the user query and the search result and rate the search result. You should also provide a reasoning for your rating.

    

    Use the following format:

    Rating: Example Rating

    Reasoning: Example Reasoning

    

    ### Examples

    Example:

    Example 1:

    INPUT:

    User Query: What is the definition of an accordion?

    Search Result: Accordion definition, Also called piano accordion. a portable wind instrument having a large bellows for forcing air through small metal reeds, a keyboard for the right hand, and buttons for sounding single bass notes or chords for the left hand. a similar instrument having single-note buttons instead of a keyboard.

    OUTPUT:

    Rating: 3

    Reasoning: In this case the search query is a question. The search result directly answers the user question for the definition of an accordion, hence it has high relevance to the user query.

    

    Example 2:

    INPUT:

    User Query: dark horse

    Search Result: Darkhorse is a person who everyone expects to be last in a race. Think of it this way. The person who looks like he can never get laid defies the odds and gets any girl he can by being sly,shy and cunning. Although he\'s not a player, he can really charm the ladies.

    OUTPUT:

    Rating: 3

    Reasoning: In this case the search query is a search phrase mentioning \'dark horse\'. The search result contains information about the term \'dark horse\' and provides a definition for it and is centered around it. Hence it has high relevance to the user query.

    

    Example 3:

    INPUT:

    User Query: Global warming and polar bears

    Search Result: Polar bear The polar bear is a carnivorous bear whose native range lies largely within the Arctic Circle, encompassing the Arctic Ocean, its surrounding seas and surrounding land masses. It is a large bear, approximately the same size as the omnivorous Kodiak bear (Ursus arctos middendorffi).

    OUTPUT:

    Rating: 2

    Reasoning: In this case the search query is a search phrase mentioning two entities \'Global warming\' and \'polar bears\'. The search result contains is centered around the polar bear which is one of the two entities in the search query. Therefore it addresses most of the entities present and hence has medium relevance.

    

    Example 4:

    INPUT:

    User Query: Snowflake synapse private link

    Search Result: "This site can\'t be reached" error when connecting to Snowflake via Private Connectivity\nThis KB article addresses an issue that prevents connections to Snowflake failing with: "This site can\'t be reached" ISSUE: Attempting to reach Snowflake via Private Connectivity fails with the "This site can\'t be reached" error

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search result is a search query mentioning \'Snowflake synapse private link\'. However the search result doesn\'t contain information about it. However it shows an error message for a generic private link which is tangentially related to the query, since snowflake synapse private link is a type of private link. Hence it has low relevance to the user query.

    

    Example 5:

    INPUT:

    User Query: The Punisher is American.

    Search Result: The Rev(Samuel Smith) is a fictional character, a supervillain appearing in American comic books published by Marvel Comics. Created by Mike Baron and Klaus Janson, the character made his first appearance in The Punisher Vol. 2, #4 (November 1987). He is an enemy of the Punisher.

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search query is a statement concerning the Punisher. However the search result is about a character called Rev, who is an enemy of the Punisher. The search result is tangentially related to the user query but does not address topic about Punisher being an American. Hence it has low relevance to the user query.

 

    Example 6:

    INPUT:

    User Query: query_history

    Search Result: The function task_history() is not enough for the purposes when the required result set is more than 10k.If we perform UNION between information_schema and account_usage , then we will get more than 10k records along with recent records as from information_schema.query_history to snowflake.account_usage.query_history is 45 mins behind.

    OUTPUT:

    Rating: 1

    Reasoning: In this case the search query mentioning one entity \'query_history\'. The search result is neither centered around it and neither has medium relevance, it only contains an unimportant reference to it. Hence it has low relevance to the user query.

    

    Example 7:

    INPUT:

    User Query: Who directed pulp fiction?

    Search Result: Life on Earth first appeared as early as 4.28 billion years ago, soon after ocean formation 4.41 billion years ago, and not long after the formation of the Earth 4.54 billion years ago.

    OUTPUT:

    Rating: 0

    Reasoning: In the case the search query is a question. However the search result does is completely unrelated to it. Hence the search result is completely irrelevant to the movie pulp fiction.

    ###

    

    Now given the user query and search result below, rate the search result based on its relevance to the user query and provide a reasoning for your rating.

    INPUT:

    User Query: {query}

    Search Result: {passage}

    OUTPUT:\n

Subscribe to our blog newsletter

Get the best, coolest and latest delivered to your inbox each week

Start your 30-DayFree Trial

Try Snowflake free for 30 days and experience the AI Data Cloud that helps eliminate the complexity, cost and constraints inherent with other solutions.