Is Your RAG Consistent?

Introduction

In a world where decision-making is increasingly driven by data, RAG promises to combine the best of information retrieval and generative AI, pulling in up-to-date facts from your own knowledge sources and weaving them into coherent, context-aware responses.

Yet as companies continue to integrate RAG into customer support, research assistants, and compliance workflows, an uneasy question arises: if you ask the same question more than once, will your RAG system give you the same answer?

Inconsistent outputs are particularly critical for regulated industries like financial services, healthcare, and insurance, as they can undermine trust and pose significant hurdles to successful production deployments.

In this blog post, we’ll explore why reproducibility matters in RAG pipelines, identify the factors that introduce variation, and introduce the new consistency metric in Open-RAG-Eval.

Why can RAG be inconsistent?

When you issue a query in a RAG stack, a typical processing flow involves embedding models, vector search, hybrid search, reranking, and a final call to a generative LLM. Each of these components may behave differently even with the same input query.

Let’s start with the generative LLM call. We know that LLMs responses are often inconsistent, and even with the temperature set to 0.0 and a fixed random seed, they may result in different outcomes. There are many reasons for this, including mathematical rounding errors in GPUs and TPUs, the impact of techniques like model parallelism and pipeline parallelism, or using architectures like Mixture of Experts (MoE) that add further unpredictability.

Yet the LLM is not the only reason RAG responses can be inconsistent.

How you implement and deploy vector search, hybrid search or reranking may also result in minor changes in the retrieved chunks, causing further randomness. For example, your vector database may have a latency guard implemented, wherein after a certain time limit it only returns partial responses. This may lead to overall different results used as input to your LLM and thus a different response.

When responses from RAG need to be under regulatory scrutiny - this matters.

A lot!

To address this, we introduced the Consistency-Adjusted Index in Open-RAG-Eval - a metric that captures both the quality and stability of model responses across repeated generations.

Measuring RAG Consistency with Open-RAG-Eval

Open-RAG-Eval (ORE) is an open source package for RAG Evaluation, known for its ability to measure retrieval, generation, and citation accuracy, as well as detect hallucinations, without the need to provide golden answers.

Today we are excited to introduce a new addition to Open-RAG-Eval: the Consistency Evaluator, designed to measure how stable your RAG system is across multiple invocations of the same query.

This evaluator can be chained with the existing components to measure the level of consistency among the existing scores. By running your RAG pipeline (via the ORE connector) N times per query and computing the mean, median, standard deviation, and range, you get a clear picture of how stable your system’s performance is across generations.

In addition, the evaluator introduces two new metrics that focus on answer consistency across generations. For each query, it computes relevance scores across all possible pairs of generated responses (e.g., 3 pairs for 3 generations), enabling analysis of consistency at the semantic and lexical levels. These pairwise scores are then aggregated using statistical summaries (mean, median, standard deviation, and range) to capture the variability in similarity across generations.

The current implementation supports two such metrics for answer consistency:

BERTScore: This neural similarity metric assesses the semantic alignment between pairs of generated answers. We utilize the multilingual roberta-large model, which supports over 100 languages.
ROUGE-L: A lexical overlap metric that quantifies surface-level similarity by identifying the longest common subsequence between pairs of answers.

While both BERTScore and ROUGE-L effectively measure semantic and lexical consistency across generations, they do not ensure factual correctness or groundedness. In other words, it is possible for all generations to exhibit high similarity (high BERTScore and ROUGE-L values) yet still contain factual inaccuracies or ungrounded information. Therefore, it is crucial to measure system consistency using the existing ORE metrics like groundedness score and HHEM as well.

Generation Variability Example:

To demonstrate generation variability, we present five outputs for a single query generated using Vectara’s RAG pipeline. These generations were produced with two different decoding temperatures, 0.0 and 0.7, to highlight how generation settings can impact consistency, even when the retrieved context remains constant, and to demonstrate how the Consistency Evaluator works in practice

Example generations for query: “Can GenAI solutions completely replace human labor in manufacturing?”

Temperature = 0.0 (Lower randomness; not strictly deterministic)	Temperature = 0.7 (Diverse)
🟩 Generative AI (GenAI) solutions hold significant potential in the manufacturing sector, particularly in areas like product design, aftermarket services, and supply chain management. They can help reduce costs and serve as a tool for navigating a challenging labor market. However, GenAI is more about complementing human labor rather than completely replacing it. It offers opportunities to upskill and reskill the workforce, streamline processes, and address economic and regulatory challenges. Therefore, while GenAI can enhance efficiency and innovation, it is not positioned to completely replace human labor in manufacturing [1], [2].	🟨 Generative AI solutions have significant potential in manufacturing by enhancing efficiency, innovation, and safety. They can assist with product design, aftermarket services, and supply chain management, ultimately reducing costs. However, they serve primarily as a tool to upskill and reskill the workforce and streamline processes rather than completely replace human labor. GenAI can navigate labor market challenges but does not fully substitute human involvement. Implementing these solutions is complex and involves considerations around hallucination, security, and cost [1][2].
🟨 Generative AI (GenAI) solutions hold significant potential in the manufacturing sector, particularly in areas like product design, aftermarket services, and supply chain management. They can help reduce costs and serve as a tool for navigating a challenging labor market. However, GenAI is primarily seen as a means to upskill and reskill the labor force, streamline processes, and enhance efficiency rather than completely replace human labor. The focus is on creating a win-win scenario by integrating GenAI with human capabilities to improve overall productivity and adapt to economic and regulatory challenges [1], [2].	🟨 Generative AI (GenAI) solutions hold significant potential in the manufacturing sector, primarily by enhancing efficiency, innovation, and safety. They are particularly promising in areas like product design, aftermarket services, and supply chain management, potentially reducing costs and helping navigate a challenging labor market. However, GenAI is more about augmenting human labor rather than completely replacing it. It offers opportunities for upskilling and reskilling the workforce, streamlining processes, and aiding in adapting to economic and regulatory challenges. While GenAI can transform and significantly enhance manufacturing operations, it is not positioned to completely replace human labor [1][2].
🟩 Generative AI (GenAI) solutions hold significant potential in the manufacturing sector, particularly in areas like product design, aftermarket services, and supply chain management. They can help reduce costs and serve as a tool for navigating a challenging labor market. However, GenAI is more about complementing human labor rather than completely replacing it. It offers opportunities to upskill and reskill the workforce, streamline processes, and address economic and regulatory challenges. Therefore, while GenAI can enhance efficiency and innovation, it is not positioned to completely replace human labor in manufacturing [1], [2].	🟨 Generative AI solutions hold significant potential in the manufacturing sector by enhancing innovation, efficiency, and safety, particularly in areas like product design, aftermarket services, and supply chain management. They can reduce costs and aid in managing a challenging labor market. However, GenAI is primarily seen as a tool to upskill and reskill the workforce, streamline processes, and navigate economic challenges, rather than completely replacing human labor. The focus is on creating a synergistic environment where human labor and GenAI complement each other [1][2].
🟨 Generative AI (GenAI) solutions hold significant potential in the manufacturing sector, particularly in areas like product design, aftermarket services, and supply chain management. They can help reduce costs and serve as a tool for navigating a challenging labor market. However, GenAI is more about complementing human labor rather than completely replacing it. It offers opportunities to upskill and reskill the workforce, streamline processes, and address economic and regulatory challenges. While GenAI can transform business operations, it is not positioned to entirely replace human labor in manufacturing [1], [2].	🟨 Generative AI in manufacturing holds significant potential for innovation, efficiency, and safety improvements, particularly in areas like product design, aftermarket services, and supply chain management. It can help reduce costs and serve as a tool to navigate the challenging labor market. However, it is primarily seen as a means to upskill and reskill the labor force rather than completely replace human labor. GenAI can streamline processes and cut costs, but it is not positioned to entirely replace human jobs, instead, it complements human efforts by enhancing productivity and addressing labor challenges [1], [2].
🟩 Generative AI (GenAI) solutions hold significant potential in the manufacturing sector, particularly in areas like product design, aftermarket services, and supply chain management. They can help reduce costs and serve as a tool for navigating a challenging labor market. However, GenAI is more about complementing human labor rather than completely replacing it. It offers opportunities to upskill and reskill the workforce, streamline processes, and address economic and regulatory challenges. Therefore, while GenAI can enhance efficiency and innovation, it is not positioned to completely replace human labor in manufacturing [1], [2].	🟨 Generative AI solutions are expected to hold significant potential in various areas of manufacturing, such as product design, aftermarket services, and supply chain management. They can help in reducing costs and navigating a challenging labor market by upskilling and reskilling the workforce, streamlining processes, and enhancing safety and efficiency [1], [2]. However, there is no indication in the search results that GenAI can completely replace human labor. Instead, it is positioned as a tool to complement human efforts and improve overall manufacturing capabilities [1], [2], [3].

Table 1: Five generations for the same query using Vectara’s RAG pipeline with decoding temperatures 0.0 and 0.7. Color annotations indicate exact matches(🟩) and semantically consistent but non-identical responses(🟨).

As table 1 above illustrates, even at a decoding temperature of 0.0, minor phrasing variations occur across generations. This shows that decoding can introduce subtle variability, even in low-randomness settings. At higher temperatures, this variability becomes more pronounced, resulting in greater lexical diversity.

These patterns underscore the critical importance of evaluating consistency. Even with minimal randomness, generation behavior can vary, impacting reliability, user experience, and subsequent processing. Quantifying the stability of a model's outputs across multiple generations offers a deeper insight into its behavior, beyond simply the quality of a single response.

How to Measure Stability and Quality

To assess not just the quality but also the stability of model outputs across multiple generations, we introduce a Consistency-Adjusted Index (CAI) for each evaluation metric (retrieval, groundedness, hallucination, citations and lexical and semantic similarity metrics). This index fuses the mean and standard deviation into a single score, reflecting how reliably the model performs by penalizing high variance across generations, and is defined as follows:

Consistency-Adjusted Index = μ / (1 + σ)

μ represents the mean of the metric scores across generations, reflecting the average quality.
𝝈 represents the standard deviation(Std Dev), indicating the degree of variation.

This index ranges from 0 to 1. A score approaching 1.0 signifies high-quality outputs that are tightly clustered, indicating both stability and strength. Conversely, a score around 0.5 or lower suggests unstable quality, considerable variation, or consistent failure.

Example Scenario

To demonstrate the practical application of the CAI, let’s review an example where we show both BERTScore and Groundedness metrics, applied to three generations for a single query.

Pairwise Similarity (BERTScore):

BERTScore evaluates internal semantic consistency by comparing each pair of responses.

Scenario	All responses are similar	One response diverges	All responses are different
BERT Scores	[0.93, 0.94, 0.92]	[0.93, 0.60, 0.40]	[0.30, 0.30, 0.30]
Mean	0.93	0.640	0.30
Std Dev	0.008	0.219	0.000
Consistency-Adjusted Index	0.924	0.529	0.30
Interpretation	High-quality and very stable	A substantial divergence in one generation leads to a decreased CAI.	BERTScore measures semantic similarity; lower scores mean less similarity. Uniformly low scores for diverse model responses might seem misleading, but here, the model is consistently inconsistent.

Table 2: BERTScore consistency examples.

These examples show how the index rewards outputs that are both high-quality and well-aligned. Even a single outlier can substantially reduce the value.

Per-Generation Metric (Groundedness):

For metrics such as Groundedness, which are evaluated once per generation, the consistency-adjusted index reflects the stability and quality of these per-generation values for each query. To illustrate, let's consider two scenarios: one with high deviation and another with low deviation

Scenario	A	B
Groundedness Scores	[0.80, 0.60, 0.40]	[0.30, 0.30, 0.30]
Mean	0.60	0.30
Std-dev	0.16	0.00
Consistency-Adjusted Index	0.52	0.30

Table 3: Groundedness consistency examples.

While Scenario B might initially seem "more consistent" due to zero variation in scores, all generations are uniformly poor. This represents a uniform failure, not meaningful consistency. In contrast, Scenario A exhibits some variation across generations, but the responses are, on average, of significantly higher quality. Consequently, Scenario A achieves a superior CAI despite its higher standard deviation.These examples underscore a crucial point: Low variation does not equate to true consistency. A model that consistently fails in the same manner remains unreliable. Genuine consistency demands both stability and quality.

Assessing Stability and Quality: Approach and Findings

To put the Consistency-Adjusted Index (CAI) into practice, we evaluated our Vectara RAG system's consistency on an internal corpus, using 100 distinct queries. Our main goal was to understand how various generation settings affect output stability and quality. We focused on two key analyses: consistency across different generation models at a fixed temperature, and consistency within a given model across varying temperature values.

We applied the index to six core evaluation metrics: BERTScore and ROUGE-L (which measure semantic and lexical similarity, respectively); and four others previously discussed in our earlier blog post - Groundedness Score (measuring support from retrieved results), Citation Score (F1 score for citation accuracy), Factuality Score (quantifying freedom from hallucinations), and Retrieval Score (assessing the relevance of retrieved passages).

Figure 1 illustrates per-metric CAI distribution at temperature 0.0, comparing GPT-4o and GPT-3.5.

Figure 1: Per-metric CAI comparison between GPT-4o and GPT-3.5 at temperature 0.0. Each boxplot summarizes Consistency-Adjusted Index (CAI) scores computed across 100 queries. Higher CAI values indicate more consistent and higher-quality responses for a given query, while tighter boxplots reflect more stable performance across different queries. In other words, models with high and tightly clustered CAI scores demonstrate reliable behavior across a wide range of inputs. Note: The Citation boxplot for GPT-4o appears compressed because most scores are exactly 1.0, indicating highly consistent citation behavior with minimal variation across queries.

GPT-4o consistently shows higher CAI scores and significantly tighter distributions across most metrics, indicating superior stability and quality across queries. The most striking difference is in the Factuality metric, where GPT-4o is highly consistent, while GPT-3.5 is notably lower and more variable across queries. GPT-4o also demonstrates stronger and more consistent performance in Citations, BERTScore, and ROUGE-L. Groundedness Consistency is broadly similar across both models, with a slight edge to GPT-3.5. Since the retrieved passages are fixed per query and reused across all generations, Retrieval Consistency remains identical for both models, as anticipated.

Figure 2 shows the per-metric CAI distribution for GPT-4o at temperatures 0.0 and 0.7.

Figure 2: Per-metric consistency of GPT-4o at temperatures 0.0 and 0.7. Each boxplot summarizes Consistency-Adjusted Index (CAI) scores computed across 100 queries. Higher CAI values indicate more consistent and higher-quality responses for a given query, while tighter boxplots reflect more stable performance across different queries. In other words, models with high and tightly clustered CAI scores demonstrate reliable behavior across a wide range of inputs. Note: The Citation boxplot for GPT-4o appears compressed because most scores are exactly 1.0, indicating highly consistent citation behavior with minimal variation across queries.

The Consistency-Adjusted Index reveals expected shifts: Factuality, BERTScore, and ROUGE-L show slightly lower quality and consistency at higher temperatures, reflecting increased variability and a drop in average response quality due to more diverse sampling. In contrast, Citation, Retrieval, and Groundedness CAI remain largely stable, indicating robustness in these aspects regardless of temperature.

Measuring both quality and stability, the CAI illuminates subtle differences in how systems behave. This diagnostic power helps identify model inconsistencies that average scores might miss, leading to more robust decisions regarding generation parameters and overall RAG system reliability.

Conclusions

In this blog post, we introduced the Consistency-Adjusted Index (CAI) in Open-RAG-Eval - a metric designed to capture not just the quality of your RAG system’s outputs, but how reliably that quality holds across multiple runs. This kind of measurement is especially critical for RAG and agentic RAG applications in regulated environments, where both consistency and trustworthiness matter.

We encourage you to give this new metric a try, or contact our team if you need our help with implementing trusted RAG applications.

As always, if you have any further suggestions for Open-RAG-Eval, please let us know or submit a PR.

Is Your RAG Consistent?

Introduction

Why can RAG be inconsistent?

Measuring RAG Consistency with Open-RAG-Eval

Generation Variability Example:

How to Measure Stability and Quality

Example Scenario

Assessing Stability and Quality: Approach and Findings

Conclusions

Connect with
our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

Example Scenario

Related posts

Connect with our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

Connect with
our Community!