Is Your RAG Consistent?
In highly regulated enterprise environments, predictability is non-negotiable: the same question to your RAG application should return consistent answers every time. With Open-RAG-Eval’s new Consistency-Adjusted Index, you can now precisely quantify that level of reliability.
10-minute read time
Introduction
In a world where decision-making is increasingly driven by data, RAG promises to combine the best of information retrieval and generative AI, pulling in up-to-date facts from your own knowledge sources and weaving them into coherent, context-aware responses.
Yet as companies continue to integrate RAG into customer support, research assistants, and compliance workflows, an uneasy question arises: if you ask the same question more than once, will your RAG system give you the same answer?
Inconsistent outputs are particularly critical for regulated industries like financial services, healthcare, and insurance, as they can undermine trust and pose significant hurdles to successful production deployments.
In this blog post, we’ll explore why reproducibility matters in RAG pipelines, identify the factors that introduce variation, and introduce the new consistency metric in Open-RAG-Eval.
Why can RAG be inconsistent?
When you issue a query in a RAG stack, a typical processing flow involves embedding models, vector search, hybrid search, reranking, and a final call to a generative LLM. Each of these components may behave differently even with the same input query.
Let’s start with the generative LLM call. We know that LLMs responses are often inconsistent, and even with the temperature set to 0.0 and a fixed random seed, they may result in different outcomes. There are many reasons for this, including mathematical rounding errors in GPUs and TPUs, the impact of techniques like model parallelism and pipeline parallelism, or using architectures like Mixture of Experts (MoE) that add further unpredictability.
Yet the LLM is not the only reason RAG responses can be inconsistent.
How you implement and deploy vector search, hybrid search or reranking may also result in minor changes in the retrieved chunks, causing further randomness. For example, your vector database may have a latency guard implemented, wherein after a certain time limit it only returns partial responses. This may lead to overall different results used as input to your LLM and thus a different response.
When responses from RAG need to be under regulatory scrutiny - this matters.
A lot!
To address this, we introduced the Consistency-Adjusted Index in Open-RAG-Eval - a metric that captures both the quality and stability of model responses across repeated generations.
Measuring RAG Consistency with Open-RAG-Eval
Open-RAG-Eval (ORE) is an open source package for RAG Evaluation, known for its ability to measure retrieval, generation, and citation accuracy, as well as detect hallucinations, without the need to provide golden answers.
Today we are excited to introduce a new addition to Open-RAG-Eval: the Consistency Evaluator, designed to measure how stable your RAG system is across multiple invocations of the same query.
This evaluator can be chained with the existing components to measure the level of consistency among the existing scores. By running your RAG pipeline (via the ORE connector) N times per query and computing the mean, median, standard deviation, and range, you get a clear picture of how stable your system’s performance is across generations.
In addition, the evaluator introduces two new metrics that focus on answer consistency across generations. For each query, it computes relevance scores across all possible pairs of generated responses (e.g., 3 pairs for 3 generations), enabling analysis of consistency at the semantic and lexical levels. These pairwise scores are then aggregated using statistical summaries (mean, median, standard deviation, and range) to capture the variability in similarity across generations.
The current implementation supports two such metrics for answer consistency:
- BERTScore: This neural similarity metric assesses the semantic alignment between pairs of generated answers. We utilize the multilingual roberta-large model, which supports over 100 languages.
- ROUGE-L: A lexical overlap metric that quantifies surface-level similarity by identifying the longest common subsequence between pairs of answers.
While both BERTScore and ROUGE-L effectively measure semantic and lexical consistency across generations, they do not ensure factual correctness or groundedness. In other words, it is possible for all generations to exhibit high similarity (high BERTScore and ROUGE-L values) yet still contain factual inaccuracies or ungrounded information. Therefore, it is crucial to measure system consistency using the existing ORE metrics like groundedness score and HHEM as well.
Generation Variability Example:
To demonstrate generation variability, we present five outputs for a single query generated using Vectara’s RAG pipeline. These generations were produced with two different decoding temperatures, 0.0 and 0.7, to highlight how generation settings can impact consistency, even when the retrieved context remains constant, and to demonstrate how the Consistency Evaluator works in practice
Example generations for query: “Can GenAI solutions completely replace human labor in manufacturing?”
Table 1: Five generations for the same query using Vectara’s RAG pipeline with decoding temperatures 0.0 and 0.7. Color annotations indicate exact matches(🟩) and semantically consistent but non-identical responses(🟨).
As table 1 above illustrates, even at a decoding temperature of 0.0, minor phrasing variations occur across generations. This shows that decoding can introduce subtle variability, even in low-randomness settings. At higher temperatures, this variability becomes more pronounced, resulting in greater lexical diversity.
These patterns underscore the critical importance of evaluating consistency. Even with minimal randomness, generation behavior can vary, impacting reliability, user experience, and subsequent processing. Quantifying the stability of a model's outputs across multiple generations offers a deeper insight into its behavior, beyond simply the quality of a single response.
How to Measure Stability and Quality
To assess not just the quality but also the stability of model outputs across multiple generations, we introduce a Consistency-Adjusted Index (CAI) for each evaluation metric (retrieval, groundedness, hallucination, citations and lexical and semantic similarity metrics). This index fuses the mean and standard deviation into a single score, reflecting how reliably the model performs by penalizing high variance across generations, and is defined as follows:
Consistency-Adjusted Index = μ / (1 + σ)
- μ represents the mean of the metric scores across generations, reflecting the average quality.
- 𝝈 represents the standard deviation(Std Dev), indicating the degree of variation.
This index ranges from 0 to 1. A score approaching 1.0 signifies high-quality outputs that are tightly clustered, indicating both stability and strength. Conversely, a score around 0.5 or lower suggests unstable quality, considerable variation, or consistent failure.
Example Scenario
To demonstrate the practical application of the CAI, let’s review an example where we show both BERTScore and Groundedness metrics, applied to three generations for a single query.
Pairwise Similarity (BERTScore):
BERTScore evaluates internal semantic consistency by comparing each pair of responses.
Table 2: BERTScore consistency examples.
These examples show how the index rewards outputs that are both high-quality and well-aligned. Even a single outlier can substantially reduce the value.
Per-Generation Metric (Groundedness):
For metrics such as Groundedness, which are evaluated once per generation, the consistency-adjusted index reflects the stability and quality of these per-generation values for each query. To illustrate, let's consider two scenarios: one with high deviation and another with low deviation
Table 3: Groundedness consistency examples.
While Scenario B might initially seem "more consistent" due to zero variation in scores, all generations are uniformly poor. This represents a uniform failure, not meaningful consistency. In contrast, Scenario A exhibits some variation across generations, but the responses are, on average, of significantly higher quality. Consequently, Scenario A achieves a superior CAI despite its higher standard deviation.These examples underscore a crucial point: Low variation does not equate to true consistency. A model that consistently fails in the same manner remains unreliable. Genuine consistency demands both stability and quality.
Assessing Stability and Quality: Approach and Findings
To put the Consistency-Adjusted Index (CAI) into practice, we evaluated our Vectara RAG system's consistency on an internal corpus, using 100 distinct queries. Our main goal was to understand how various generation settings affect output stability and quality. We focused on two key analyses: consistency across different generation models at a fixed temperature, and consistency within a given model across varying temperature values.
We applied the index to six core evaluation metrics: BERTScore and ROUGE-L (which measure semantic and lexical similarity, respectively); and four others previously discussed in our earlier blog post - Groundedness Score (measuring support from retrieved results), Citation Score (F1 score for citation accuracy), Factuality Score (quantifying freedom from hallucinations), and Retrieval Score (assessing the relevance of retrieved passages).
Figure 1 illustrates per-metric CAI distribution at temperature 0.0, comparing GPT-4o and GPT-3.5.

Figure 1: Per-metric CAI comparison between GPT-4o and GPT-3.5 at temperature 0.0. Each boxplot summarizes Consistency-Adjusted Index (CAI) scores computed across 100 queries. Higher CAI values indicate more consistent and higher-quality responses for a given query, while tighter boxplots reflect more stable performance across different queries. In other words, models with high and tightly clustered CAI scores demonstrate reliable behavior across a wide range of inputs. Note: The Citation boxplot for GPT-4o appears compressed because most scores are exactly 1.0, indicating highly consistent citation behavior with minimal variation across queries.
GPT-4o consistently shows higher CAI scores and significantly tighter distributions across most metrics, indicating superior stability and quality across queries. The most striking difference is in the Factuality metric, where GPT-4o is highly consistent, while GPT-3.5 is notably lower and more variable across queries. GPT-4o also demonstrates stronger and more consistent performance in Citations, BERTScore, and ROUGE-L. Groundedness Consistency is broadly similar across both models, with a slight edge to GPT-3.5. Since the retrieved passages are fixed per query and reused across all generations, Retrieval Consistency remains identical for both models, as anticipated.
Figure 2 shows the per-metric CAI distribution for GPT-4o at temperatures 0.0 and 0.7.

Figure 2: Per-metric consistency of GPT-4o at temperatures 0.0 and 0.7. Each boxplot summarizes Consistency-Adjusted Index (CAI) scores computed across 100 queries. Higher CAI values indicate more consistent and higher-quality responses for a given query, while tighter boxplots reflect more stable performance across different queries. In other words, models with high and tightly clustered CAI scores demonstrate reliable behavior across a wide range of inputs. Note: The Citation boxplot for GPT-4o appears compressed because most scores are exactly 1.0, indicating highly consistent citation behavior with minimal variation across queries.
The Consistency-Adjusted Index reveals expected shifts: Factuality, BERTScore, and ROUGE-L show slightly lower quality and consistency at higher temperatures, reflecting increased variability and a drop in average response quality due to more diverse sampling. In contrast, Citation, Retrieval, and Groundedness CAI remain largely stable, indicating robustness in these aspects regardless of temperature.
Measuring both quality and stability, the CAI illuminates subtle differences in how systems behave. This diagnostic power helps identify model inconsistencies that average scores might miss, leading to more robust decisions regarding generation parameters and overall RAG system reliability.
Conclusions
In this blog post, we introduced the Consistency-Adjusted Index (CAI) in Open-RAG-Eval - a metric designed to capture not just the quality of your RAG system’s outputs, but how reliably that quality holds across multiple runs. This kind of measurement is especially critical for RAG and agentic RAG applications in regulated environments, where both consistency and trustworthiness matter.
We encourage you to give this new metric a try, or contact our team if you need our help with implementing trusted RAG applications.
As always, if you have any further suggestions for Open-RAG-Eval, please let us know or submit a PR.