Gen AI

JAN 15, 2025|10 min read

Improving Ecommerce Search: Intelligent Ranking Made Simple with Snowflake Cortex Search

At first glance, a search query like "affordable ergonomic office chair" looks straightforward enough, right? But this is a deceptively challenging query for a search system. To address user needs, the search system must simultaneously understand price ranges, ergonomic features and product categories, as well as the relative importance of each: What's most important? The price? The product? The ergonomics?

Such queries contain several sub-intentions, commonly coined as “multi-aspect” queries. In the world of catalog/ecommerce search, a “catalog” — more generally, a “corpus” — can mean any organized collection of records with multiple aspects (e.g., product listings with size and price attributes; property listings with location and price; customer support tickets with interaction details; or business documents with interconnected fields like department and compliance tags). Multi-aspect queries that naturally arise in these contexts would require the search system to juggle multiple — sometimes competing — criteria all at once. Historically, meeting these needs required painstaking, manual tuning of numeric weights across different metadata fields or training with behavioral signals (such as click-logs).

With Cortex Search, we've rethought the multi-aspect problem from the ground up, taking an AI-forward approach that delivers better search quality on enterprise search use cases by unifying these signals through advanced embeddings, query boosts, aspect-aware ranking and automated tuning.

In this blog, we will first highlight the quality of Cortex Search against common enterprise search providers for both out of the box and best-effort tuning. Second, we'll walk through the novel capabilities we've built into and around Cortex Search that make the best performance possible.

see example code for this project

Raising the out-of-the-box performance bar

We evaluated four industry-standard ecommerce search benchmarks: TREC Product Search 2023 and 2024, the English subset of Amazon ESCI data set (product search across millions of Amazon products) and WANDS data set (Wayfair product search relevance data set).

These data sets are full of multi-aspect queries. For example:

1.4 cubic feet small refrigerator without freezer
levis 513 slim straight stretch jeans

First, we compare out-of-the-box Cortex Search performance (native hybrid search) with an Azure AI Search stack (hybrid search using fulltext keyword, vector retrieval and semantic reranking). We evaluate Azure hybrid search stack on a variety of embedding models (Cohere Embed v3, OpenAI text-embedding-3-small and OpenAI text-embedding-3-large) with or without semantic reranking. Our Cortex Search setup can be found here.

Figure 1. Comparison of Cortex Search and Azure Search with different configurations across four ecommerce data sets. All setup, except for Cortex Search, is evaluated on Azure AI Search with specified embedding model, keyword retrieval with/without Azure semantic rerank model.

On average, Cortex Search outperforms Azure AI Search with all embedding models, as measured by NDCG@10¹. More impressively, Cortex Search does so at a fraction of latency — this is due to the efficient Snowflake Arctic Embed model which has a smaller size compared to the other models.

Only Azure's most powerful combination — OpenAI's large embedding model with Azure AI Search’s semantic reranker — edges out the Cortex Search baseline on two of the four selected data sets. However, this comes at substantially higher computational and operational costs: On the same TREC ESCI data set (1 million documents, on average ~500 tokens), an Azure Search index with OpenAI embedding models, namely text-embedding-3-small and text-embedding-3-large, takes ~100x longer to create, as compared to Cortex Search.²

Outperforming state-of-the-art: Technical innovations in multi-aspect search

Figure 2. Comparison of the best config on Azure with Cortex Search on all queries and hard queries (defined as queries failing to retrieve relevant documents in top 10 results in baseline).

Figure 2. Search quality comparison across ecommerce data sets between Cortex Search and the best config on Azure AI Search. The left-side graph shows NDCG@10 performance on all queries and the right-side graph shows “hard queries” performance (where “hard queries” are ones failing to retrieve relevant documents in top 10 results in baseline).

Without tuning, Cortex Search performs well on product search use cases with challenging multi-aspect data. However, to make our comparisons as fair as possible, we then opted to compare the most optimal “tuned” versions of the Azure AI Search and Cortex Search stacks. For Azure we grid search over the weights of the different vector columns³, and for Cortex Search we use query understanding and autotuning, as described below.

In our experiments, we found that the tuned Cortex Search stack showed consistent improvements across the board, outperforming all of the tuned Azure AI Search stacks even with reranking enabled. As measured by NDCG@10 on the same data sets, Cortex Search outperforms Azure with Cohere embeddings by 5.73%, OpenAI small embeddings by 6.27% and OpenAI large embeddings by 3.64%.

Most notably, on challenging queries, where all of the untuned systems fail to retrieve any relevant documents in their top 10 results (i.e., "hard queries" in Figure 2), Cortex Search outperforms Azure's best configuration by more than 11.3%, as measured by NDCG@10. This would suggest that the innovations in multi-aspect query handling have a greater impact on challenging product search queries, over simply using larger, more expensive embedding models.

To achieve state-of-the-art performance on product catalog search benchmarks, we leveraged several novel techniques to “tune” Cortex Search, each of which will be detailed in following sections:

Smart query boost: Dynamically emphasizes relevant fields in queries without requiring manual field weighting or expensive multiple-doc vector computations.
Automatic tuning with LLM-judged labels: Uses LLMs to evaluate search results and automatically optimizes ranking configuration.
Intelligent ranking signals: Combines multiple factors, such as popularity and ratings, to determine result relevance.
Aspect-aware boost: Considers document structure (e.g., headers, subheadings) when determining match relevance.

Smart query boost

Search relevance often hinges on understanding which aspects of a document matter most for a given query. Other search systems, such as Elasticsearch, AWS OpenSearch or Azure AI Search, tackle this through manual field weighting — requiring administrators to explicitly configure weights (i.e., title: 1.0; brand: 0.8; description: 0.5). This process is not only time consuming but also imprecise, as optimal weights vary significantly across different query types.

Cortex Search provides a much more intuitive and simpler way to emphasize query terms. Instead of guessing weights for different query terms, users can specify which terms to boost per query, the logic of which can be dictated by business rules or mined via query logs (sample logic).

For example, when searching for Daiwa Liberty Club Short Swing with soft query boost, we should correctly assign the product type (e.g., fishing equipment) with client-provided enhanced context and get more relevant results:

my_service.search(
    query="Daiwa Liberty Club Short Swing",
    experimental={
        "softBoosts": [
            {"phrase": "Brand: Daiwa, Rods & Accessories, Fishing, Hunting & Fishing"}
        ]
    }
)

Notice that after applying query boost with the correct aspects, previously irrelevant results are now replaced with fishing-related product results. Here, simply assigning more weight on brand or product category fields on the document side will not work well, given what is lacking here is the correct contextualization of the query — understanding that 'club short swing' in this query does not refer to golf equipment. Thus, smart-query boost is a great mechanism to incorporate additional information available from supporting subsystems (e.g., category prediction) or business logics.

Automatic tuning with LLM-judged labels

Building an effective search system traditionally requires labeled data sets and extensive manual tuning. Cortex Search brings a novel approach with automated optimization that works with minimal user input. The entire optimization process works with just three steps:

Upload your query set
The LLM-powered relevance system intelligently evaluates search results, mimicking how a real customer would judge product relevance. Instead of relying on expensive and time-consuming manual evaluations, this automated system provides consistent ratings at scale.
Bayesian optimization automatically tunes your ranking config based on the above LLM-labeled relevance. More precisely, we assign the value of the objective function for any particular ranking config to be the performance metrics of the search system with that configuration. Then, perform optimization on this objective function.

The Streamlit app used for Cortex Search tuning is open sourced here (more details in a future blog). In our experiments, this approach consistently improved performance across different search domains:

	Amazon ESCI	TREC23 Product Search	TREC24 Product Search
Baseline (ndcg@10)	0.3059	0.7578	0.7033
Autotuned (ndcg@10)	0.3372 (+10.23%)	0.7605 (+0.36%)	0.7316 (+4.02%)

Figure 3. Streamlit App to autotune Cortex Search.

Figure 3. Streamlit app used to evaluate and automatically tune Cortex Search for quality improvements. Tuning Cortex Search starts with simply 2 inputs: Cortex Search Service and a set of user queries.

By combining advanced machine learning techniques with an intelligent evaluation mechanism, Cortex Search democratizes search optimization. Users can now achieve state-of-the-art search performance with unprecedented ease, regardless of their information-retrieval expertise.

Intelligent ranking signals: Reimagining relevance

Beyond text matching: Leveraging product engagement signals

While text matching is important, enterprise product search needs to also consider what items customers actually engage with and purchase. Cortex Search enhances search quality by intelligently incorporating user engagement signals into ranking.

For example, given an ambiguous user query keyboard, it’s unclear whether it refers to computer peripherals or musical instruments. However, with help from user engagement signals, we can improve search ranking experience:

Rank	Baseline Product	Baseline Product Ratings & Stars	Experiment Product	Experiment Product Ratings & Stars
1	Das Keyboard 4C TKL Mechanical Keyboard	73 ratings, 4.4 stars	RockJam 54 Key Keyboard	35,746 ratings, 4.5 stars
2	Mustar 61 Key Electric Piano	1,266 ratings, 4.5 stars	Dell Wired Keyboard	27,989 ratings, 4.6 stars
3	Das Keyboard X50Q RGB Mechanical Keyboard	215 ratings, 3.9 stars	Mustar 61 Key Electric Piano	1,266 ratings, 4.5 stars

Notice how products with stronger engagement signals (i.e., high number of stars and ratings) rank higher in experiment compared to baseline. This reflects an important reality: While all these products may be relevant matches, products with more customer validation tend to be safer recommendations. The ranking incorporates:

Popularity dynamics: Total number of ratings and interactions identify products that customers frequently choose and trust.
Quality indicators: Star ratings indicate popular products also maintain high customer satisfaction.

By combining text relevance with these engagement signals, Cortex Search helps surface products that are not just relevant matches but are also proven customer favorites.

Aspect-aware boost

Figure 4. Comparison of Precision@1 scores across E-Commerce data sets between Cortex Search out of the box and Cortex Search with aspect-aware boost.

Aspect-aware boost in Cortex Search processes how information is presented within products — applying configured boost factors to different product fields — which enables:

Recognition of contextual importance in query term matches
Intelligent weighting based on textual emphasis and structure
Dynamic ranking adjustments that reflect information hierarchy

For ecommerce and product searches, this structural awareness is crucial. When searching for ergonomic office chair, the system understands the difference between:

Primary product information where ergonomic chair appears in key descriptive elements
Incidental mentions of ergonomic chair in supplementary sections like "related products" or "customer comments"

We indeed observe that aspect-aware field weight tuning results in more relevant top results across e-commerce data sets (Figure 4). By understanding these structural nuances, Cortex Search can better interpret the true relevance of a product to the user's query, going beyond simple text matching to comprehending how information is emphasized and organized within product listings.

Conclusion: Unifying ranking intelligence through Cortex Search

By unifying multiple innovative approaches — smart query boosting, LLM-guided tuning, aspect-aware boost and engagement-based ranking signals — Cortex Search achieves superior search quality across challenging enterprise scenarios. Our experiments show that this combination of techniques delivers particularly strong results on complex multi-aspect queries, outperforming other search systems by 5%-10% on standard benchmarks — with even larger gains of 17.75% on challenging queries, where other search systems struggle to return relevant results.

Taking a modern approach, Cortex Search provides best-in-class search quality with minimal operational overhead. Now, Cortex Search users don’t have to spend excessive time focused on manual field weight tuning or get bogged down with tedious infrastructure and operations management.

Instead, Cortex Search liberates engineering teams to focus on what truly matters: building high-accuracy search experiences that effectively serve end users. With Cortex Search, your team can concentrate on understanding user needs, and translate that insight into innovation to improve your product.

¹ Normalized Discounted Cumulative Gain (NDCG@10) is a ranking quality metric. It compares rankings to an ideal order where all relevant items are at the top of the list.

²Based on Azure’s throughput guideline, which caps OpenAI embedding model at 350k tokens per minute, embedding 1,118,658 documents, each of 512 tokens, takes 1 day and 2.5 hours. Additionally, indexing takes an additional 4 hours for openai-large; however, we did not try with using a pod in AKS (azure kubernetes service), which could potentially speed up the indexing (though embedding would be unaffected by it). In comparison, using a 2XL warehouse in Snowflake, Cortex Search takes 15 minutes to create a search service on the same data set.

³We add three additional vector columns for the important columns such as title, description and category. We then perform a grid search on four different weights for each. Due to the combinatorial nature of weights on different columns, we perform 4^3 = 64 search experiments for each data set.