Eval-Guided Optimization of LLM Judges for the RAG Triad

In 2023, as part of the TruLens open source project developed by the TruEra team, we introduced the RAG Triad.1, 2, 3 This triad comprises a set of three metrics — context relevance, groundedness and answer relevance — that measure how well each step of a retrieval-augmented generation (RAG) system is performing.
Each of these metrics is automatically computed using an LLM-as-a-Judge (a carefully prompted LLM), thus providing a scalable evaluation method for the common case in enterprises where ground truth data sets are often limited in scope. This kind of LLM-as-a-Judge can also be thought of as an agent that reviews and reasons about the quality of the retrieval and generation steps of a RAG.
The RAG Triad introduced a modular approach to specifying requirements for each step of a RAG and either verifying that these requirements are met or enabling debuggability by localizing errors. As this approach has gotten widespread adoption with RAGs increasingly moving into production in enterprises, we have consistently heard two related questions from our users:
How can we evaluate an LLM-as-a-Judge and build confidence in its trustworthiness?
How can we optimize an LLM-as-a-Judge to further improve it?
In a companion blog post, we address the first question. Specifically, we share the results of benchmarking the three LLM Judges of the RAG Triad on standard ground truth data sets — TREC-DL for context relevance, LLM-AggreFact for groundedness and HotpotQA for answer relevance — reporting precision, recall, F1 scores and Cohen’s Kappa4. Our benchmarking results indicate that our LLM Judges are comparable to or exceed the existing state of the art for groundedness and MLflow for the other two metrics in the RAG Triad.
In this blog post, we focus on the second question: How can we optimize an LLM-as-a-Judge to further improve it? We address this question by developing a new method for Eval-Guided Optimization that leverages the results of the benchmarking to guide an end-to-end agentic prompt optimizer with an appropriate choice of a loss function and a data slice. We implemented this method using TextGrad as the prompt optimizer and observed significant improvements.
For the LLM Judge for groundedness, precision increased by roughly 16% with a 2.5% drop in recall, leading to an F1 score increase of 8% on the LLMAggreFact data set. This placed it above the SOTA fine-tuned, proprietary Bespoke-MiniCheck-7B model on precision, recall and F1 score, as well as above the related LLM Judge from MLflow with respect to precision and F1 score.
For the LLM Judge for context relevance, precision increased by 4.26% with a 3.7% drop in recall, leading to an F1 score increase of 2.4% on the TREC-DL data set. This placed it above the LLM Judge with the UMBRELA prompt and the corresponding MLflow Judge on the F1 and recall metrics.
For the LLM Judge for answer relevance, recall increased by 5% with a 0.76% drop in precision, leading to an F1 score increase of 3.5% on the HotpotQA data set. This makes it comparable to the MLflow Judge for the related metric.
We have released the updated prompts in TruLens (see here). We encourage you to try them out as you build and evaluate RAGs. Here’s a notebook to get you started!
Leveraging evals to guide the TextGrad prompt optimizer with an appropriate choice of a loss function and a data slice was essential to see these improvements. Without this guidance, TextGrad failed to improve the LLM Judges.
The rest of the blog post is organized as follows. We begin with a quick overview of the RAG Triad, with more details in a companion blog post. Then we describe our Eval-Guided Optimization method, point out how it addresses observed challenges with TextGrad, and illustrate the method using an effective optimization of an LLM Judge for groundedness. Finally, we summarize our experimental results, including comparison with the LLM Judges from related projects.
The RAG Triad
In a simple RAG, there are three primary artifacts we can use to evaluate quality: query, retrieved context and generated output. Common failure modes of RAGs, including poor retrieval quality, hallucination and irrelevant answers, can all be traced back to the interactions between those three artifacts. We proposed the RAG Triad of metrics — context relevance, groundedness and answer relevance — as a system of reference-free evaluations to identify and root-cause these common failure modes of RAG systems.

The LLM Judges for the RAG Triad are described in detail in a previous blog post. Prompting for these LLM Judges is composed of a few key parts: the system prompt, judging criteria, few-shot examples, output scale and a user prompt containing the text to be evaluated. TruLens provides an easy way to configure the judging criteria, few-shot examples and output scale. The LLM Judge also makes use of an LLM. We used GPT-4o as the default LLM in all our experiments. In this blog post, we focus on the method for optimizing the prompts for these judges, which we describe next.
Eval-Guided Optimization
End-to-end prompt optimization frameworks, such as DSPy and TextGrad, became a natural starting point for our work. However, we observed that out-of-the-box optimizers from DSPy and TextGrad failed to improve the LLM Judges. For the DSPy optimizers, it appeared that augmenting prompts with few-shot demonstrations did not generalize well to new inputs. With TextGrad, a challenge we observed was that textual gradients over a set of training data weren’t producing consistent feedback to improve the prompts; a second challenge was to determine what loss function to use in the optimization.
These observations motivated us to develop our Eval-Guided Optimization method, which consists of the following steps:
Evaluate an LLM Judge on a benchmark data set.
Identify a data slice — a subset of the full data set — on which a metric of interest is performing poorly.
Automatically construct the loss function for the prompt optimizer from a textual description of the objective.
Run the prompt optimizer on the identified data slice from Step 2 with the loss function from Step 3 to produce an optimized prompt for the LLM Judge.
Re-evaluate the LLM Judge on the entire benchmark data set and report the results of the optimization.

In our experiments, we use TextGrad as the prompt optimizer. Note that Step 2 addresses the first challenge with TextGrad by focusing it on a data slice with a low-performing metric that offers an opportunity for consistent improvement. Further, Step 3 addresses the second challenge by automatically constructing the loss function from a textual description of it.
We will now illustrate our Eval-Guided Optimization method by walking through its application to an LLM Judge for groundedness.
Step 1: Evaluate an LLM Judge on a benchmark data set
We use LLMAggreFact as the benchmark data set to evaluate the TruLens LLM Judge for groundedness and compare it against the SOTA Bespoke-Minicheck-7B model.
Evaluator |
Precision |
Recall |
F1 score |
Bespoke-MiniCheck-7B |
0.7610 |
0.8038 |
0.7771 |
TruLens groundedness (un-optimized) |
0.6238 |
0.8779 |
0.7232 |
We notice that the precision of the TruLens LLM Judge for groundedness on the benchmark falls behind the SOTA model. Precision is an important metric for the LLM Judge to excel at since we want to avoid situations where the LLM Judge says that a sentence is well-grounded when in fact it is not. Thus, precision became our primary optimization target.
Step 2: Identify a data slice on which a metric of interest is performing poorly
Since the LLMAggreFact data set consisted of 11 smaller data sets, it was easy for us to observe the precision of the LLM Judge on these 11 data slices. We selected the RAGTruth subset as our slice to perform prompt optimization on, since the LLM Judge exhibited low precision on it (0.57 vs. overall precision of 0.62) and it had a relatively large number of samples.
As noted earlier, we have found that slice selection is critical to the success and generalization ability of the optimization steps. Our hypothesis is that by identifying a slice with a consistent error trend (e.g., low precision) we increase the chance that the textual gradients are similar and thus provide consistent feedback to improve the prompts. Indeed, using the entire benchmark data set for optimization (or a random sample thereof) — as one would normally use with TextGrad — did not work well in our experiments.
Step 3: Automatically construct the loss function for the prompt optimizer from a textual description of it
In the prompt optimization with TextGrad, analogous to the back propagation in deep learning, we define TextLoss = ∂L/∂x = ∇LLM(x, y, ∂L/∂y ) ≜ “Here is a conversation with an LLM-judge: {x|y}.” + LLM(Here is a conversation with an LLM-judge: {x|y}. Below are the criticisms on {y}: ∂L/∂y Explain how to improve {x}.),
where x = accuracy of LLM judge and y = LLM-judge prediction.
Since both the overall precision and slice precision are low, this translates to a high false-positive rate or LLM judges being too lenient for failing examples. We draft a separate prompt with documentation of TextGrad API references, asking to specifically design a function that returns a numerical loss [0,1] to penalize false positives, and use GPT-4o to generate a weighted groundedness loss function to serve as the TextLoss function for the optimizer. This function places greater weight on false positives, thus steering TextGrad to optimize for precision improvements.
Step 4: Run the prompt optimizer on the identified data slice from Step 2 with the loss function from Step 3 to produce an optimized prompt for the LLM Judge
The diff below shows the prompt edits added by the auto prompt optimizer using the previously defined loss function after 15 iterations. Note the addition of the sentence — “Be cautious of false positives; ensure that high scores are only given when there is clear supporting evidence” — which was added through the Eval-Guided Optimization process that sought to reduce false positives. Interestingly, the optimization process also added in a sentence to correct false negatives: “Consider indirect or implicit evidence, or the context of the statement, to avoid penalizing potentially factual claims due to lack of explicit support.”
You are an INFORMATION OVERLAP classifier; providing the overlap of information (entailment or groundedness) between the source and statement.
Respond only as a number from 0 to 3, where 0 is the lowest score according to the criteria and 3 is the highest possible score.
You should score the groundedness of the statement based on the following criteria:
- Statements that are directly supported by the source should be considered grounded and should get a high score.
+
+ - Consider indirect or implicit evidence, or the context of the statement, to avoid penalizing potentially factual claims due to lack of explicit support.
+ - Be cautious of false positives; ensure that high scores are only given when there is clear supporting evidence.
Step 5: Re-evaluate the LLM Judge on the entire benchmark data set and report the results of the optimization
On the LLM-AggreFact holdout set of 11,000 examples, we observed a significant improvement in overall precision (+15%) and F1 score (+8%), with a smaller decrease in the overall recall (-3%). The optimized LLM Judge beats the SOTA Bespoke-Minicheck-7B model on all three metrics, as well as the related LLM Judge from MLflow on precision and F1 score.
Evaluator |
Precision |
Recall |
F1 score |
Bespoke-MiniCheck-7B |
0.7610 |
0.8038 |
0.7771 |
MLflow faithfulness |
0.6693 |
0.8902 |
0.7545 |
TruLens groundedness (un-optimized) |
0.6238 |
0.8779 |
0.7232 |
TruLens groundedness (optimized) |
0.7830 |
0.8515 |
0.8082 |
Confusion matrix changes. Figure 3 shows the confusion matrices of the TruLens LLM Judge for groundedness before and after the optimization. They show the significant reduction in false positives (bottom left cell changes from 2504 to 1088) accompanied by a smaller increase in false negatives (top right cell changes from 561 to 624). This leads to higher precision after optimization, a smaller drop in recall and an overall increase in the F1 score on the LLM-AggreFact data set.

Data splits. We employ a data splitting strategy to generate a 40/20/20 or 30/30/40 training/validation/testing split for all experiments. We find that for prompt optimization, we don't usually need the train split to be more than a few hundred high-quality examples. Newly proposed prompts/prompt edits are accepted only if we see improvements on the test splits, and we re-evaluate on the entire data set to report final results. Thanks to the large-scale annotation we have in LLM-AggreFact, we use the original dev split as our holdout set for evaluating pre- vs. post-optimization, and we sample from the original test split (29,000) to generate train/val/test splits and perform prompt optimization.
Results for context and answer relevance LLM Judges
In this section, we summarize our experimental results for the two other RAG Triad LLM Judges and include a comparison with the LLM Judges from related projects.
Context relevance. Context relevance is closely related to the task of relevance prediction in information retrieval. For the benchmark data set, we used a sample of TREC-DL passage retrieval data sets with human annotations from the years 2021 and 2022 with a fair distribution of labels from each relevance score {0, 1, 2, 3}.
The original relevance scores are then unified to binary labels {0, 1}, where {2,3} are converted to 1 (relevant) and {0, 1} are converted to 0 (nonrelevant), following the instructions from the original TREC passage retrieval challenge.
As shown in the table below, we see a similar and even more obvious low-precision/high-recall phenomenon. We also see that despite having low precision, the off-by-1 accuracy score of LLM Judges is high, highlighting an opportunity for prompt optimization, in particular, by aligning criteria between human labels and evaluation prompts.
For the LLM Judge for context relevance, after prompt optimization, precision increased by 4.26% with a 3.7% drop in recall, leading to an F1 score increase of 2.4% on the TREC-DL data set. This placed it above the UMBRELA Judge on the recall and F1 metrics and above the MLflow Judge on recall and F1 score.
Evaluator |
Precision |
Recall |
F1 score |
Off-by-1 acc |
UMBRELA |
0.6000 |
0.6449 |
0.6216 |
0.8945 |
MLflow relevance |
0.5973 |
0.6885 |
0.6396 |
N/A |
TruLens context relevance (unoptimized) |
0.4723 |
0.9034 |
0.6203 |
0.8634 |
TruLens context relevance (optimized) |
0.5129 |
0.8660 |
0.6443 |
0.8902 |
You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.
Respond only as a number from 0 to 3, where 0 is the lowest score according to the criteria and 3 is the highest possible score.
A few additional scoring guidelines:\n\n- Long RESPONSES should score equally well as short RESPONSES.
- RESPONSE must be relevant to the entire PROMPT to get a maximum score of 3.
- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.
- RESPONSE that is RELEVANT to none of the PROMPT should get a minimum score of 0.
- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 3.
- RESPONSE that confidently FALSE should get a score of 0.\n- RESPONSE that is only seemingly RELEVANT should get a score of 0.
- Answers that intentionally do not answer the question, such as "I don't know"" and model refusals, should also be counted as the least RELEVANT and get a score of 0.
+ - Be cautious of false negatives, as they are heavily penalized. Ensure that relevant responses are not mistakenly classified as irrelevant.
Answer relevance. We also include answer relevance evaluation results on HotpotQA samples and provide comparisons with MLflow. The benchmark examples are sampled with both classes balanced, where ground truth answers are assumed to be relevant and we shuffle answers to queries to create negative examples. Both TruLens and MLflow achieve strong precision on the benchmark but weaker recall numbers, where we see higher false negatives. Comparing the evaluation prompt side by side with TruLens’, MLflow’s answer-relevance instructions mention more aspects (appropriateness, applicability) than TruLens, where only relevancy of the answer with respect to the query is specified. Our hypothesis is that the higher specificity makes MLflow’s eval more strict, resulting in lower recall but higher precision.
For completeness and showcasing generalization ability, we include prompt optimization results on TruLens’ answer-relevance metric where we are able to improve and match the performance of MLflow via eval-guided optimization with interpretable prompt edits, shown below as a diff at the end of the evaluation prompt.
Evaluator |
Precision |
Recall |
F1 score |
MLflow answer relevance |
1.0000 |
0.6650 |
0.7988 |
TruLens answer relevance (unoptimized) |
1.000 |
0.6050 |
0.7539 |
TruLens answer relevance (optimized) |
0.9924 |
0.6550 |
0.7892 |
Conclusions and future work
We addressed the challenge of building trust in LLM Judges by benchmarking the quality of the baseline TruLens LLM Judges for the RAG Triad against standard ground truth data sets — TREC-DL for context relevance, LLMAggreFact for groundedness and HotPotQA for answer relevance — reporting precision, recall and F1 scores. We developed a new method for Eval-Guided Optimization that leverages the results of the benchmarking to guide an end-to-end prompt optimizer with an appropriate choice of a loss function and a data slice. We implemented this method using TextGrad as the agentic prompt optimizer and observed significant improvements.
For the LLM Judge for groundedness, precision increased by roughly 16% with a 2.5% drop in recall, leading to an F1 score increase of 8% on the LLMAggreFact data set. This placed it above the SOTA fine-tuned, proprietary Bespoke-MiniCheck-7B model on precision, recall and F1 score, as well as above the related LLM Judge from MLflow with respect to precision and F1 score.
For the LLM Judge for context relevance, precision increased by 4.26% with a 3.7% drop in recall, leading to an F1 score increase of 2.4% on the TREC-DL data set. This placed it above the LLM Judge with the UMBRELA prompt and the corresponding MLflow Judge on the F1 and recall metrics.
For the LLM Judge for answer relevance, recall increased by 5% with a 0.76% drop in precision, leading to an F1 score increase of 3.5% on the HotpotQA data set. This makes it comparable to the MLflow Judge for the related metric.
We also identified common trends across evaluation tasks where LLM Judges tend to be more liberal or lenient in their judgments than human counterparts.
In future work, we plan to develop the methodology further and apply it to tasks beyond RAGs, including but not limited to agentic workflows and multimodal use cases. In addition, as we strive to push the frontier of novel evaluation frameworks and make enterprise AI enabled by Snowflake more trustworthy, we will explore incorporating eval-guided optimization into Snowflake’s upcoming product offerings.
1 Shayak Sen, LLMs: Consider Hallucinatory Unless Proven Otherwise, AI Transformation Summit – Pinecone, July 2023.
2 Anupam Datta, Jerry Liu with Andrew Ng, Building and Evaluating Advanced RAG.
3 TruLens docs, The RAG Triad.
4 Cohen's Kappa measures inter-rater reliability between humans and LLM judges tasked in the same experiment. Cohen’s Kappa, ranging from -1 to 1, takes into account agreement by chance between human and LLM judges.