Vectara
Back to blog
Hallucination

Hallucination Detection: Commercial vs Open Source - A Deep Dive

HHEM-2.1-Open is the de facto open-source hallucination evaluation model for RAG, but how much better is the commercial version available in Vectara’s platform?

8-minute read timeHallucination Detection: Commercial vs Open Source - A Deep Dive

Introduction

Vectara's Hallucination Leaderboard has become the go-to reference for evaluating hallucination rates of AI models in enterprise RAG and Agentic applications.

Many organizations rely on its evaluations to determine which LLMs are least likely to hallucinate in their responses, as seen in these articles from TechRepublic and Digital Information World. The hallucination leaderboard is powered by Vectara's commercial-strength hallucination detection model, called HHEM-2.3 (at the time we write this blog), which is accessible via Vectara's API.

Vectara also has an open-weights version of this model, called HHEM-2.1-Open, which is available on Hugging Face and Kaggle.

In this blog post, we show that for mission-critical enterprise applications, HHEM-2.3 provides significant improvements over HHEM-2.1-Open that lead to better performance in hallucination detection.

Perhaps the most important distinction between these two models is their effective context window. Both models technically allow you to provide unlimited context, but performance at longer context lengths differs. HHEM-2.1-Open was derived from Google's T5 model, which was primarily pre-trained on relatively short input sequences. In contrast, HHEM-2.3 is based on a more advanced model that has a much longer context window.

Furthermore, HHEM-2.3’s multilingual support allows it to detect hallucinations in 11 languages, providing additional functionality and capabilities over HHEM-2.1-Open (which only supports English).

HHEM-2.1-Open vs. HHEM-2.3: Experimental Design

We benchmarked HHEM-2.3 against HHEM-2.1-Open to measure their hallucination detection performance. In the context of RAG, we define a response from an LLM to be hallucinated if the text generated by the LLM is not supported by the provided premise (aka context).

Given a premise and a generated response, both HHEM models output a score between 0 and 1, with 0 representing a response that is not evidenced at all by the premise and 1 representing a response that is fully supported by the premise. For this experiment, we use a threshold value of 0.5 to classify a response as either hallucinated (score <= 0.5) or factual (score > 0.5)

Figure 1: HHEM works by comparing the facts in the RAG context (or premise) to the generated response, and outputs a factual consistency score between 0…1.

To compare the two models, we used the question answering and news summarization subsets (denoted RAGTruth-QA and RAGTruth-Summary, respectively) from the RAGTruth benchmark and the MediaSum and MeetingBank subsets (denoted TofuEval-MediaSum and TofuEval-MeetingBank, respectively) from the TofuEval benchmark.

For this analysis, we aimed to not only quantify performance in general but also investigate how both models differ in performance with varying premise lengths. To do so, we split each dataset at specific character thresholds, creating two subsets per threshold: one with premise lengths below the threshold and one with premise lengths above it. Both models were then evaluated separately on each subset.

We used the following thresholds for each dataset:

  • RAGTruth-QA: 1024, 1536 characters
  • RAGTruth-Summary: 2048, 4096 characters
  • TofuEval-MediaSum: 4608 characters
  • TofuEval-MeetingBank: 4608 characters

These thresholds were chosen as multiples of 512 characters, while ensuring there are at least 1000 samples per subset.

RAGTruth: HHEM-2.3 vs HHEM-2.1-Open

The overall takeaway from our experiment is that HHEM-2.3 is much better at hallucination detection than HHEM-2.1-Open for these two benchmark datasets, across all the metrics we recorded (balanced accuracy, precision, recall, and F1 score), and HHEM-2.3 outperforms HHEM-2.1-Open across all data splits.

We observe specifically that HHEM-2.3 is much better than HHEM-2.1-Open in terms of its ability to provide low scores to hallucinated responses (correctly identifying hallucinations), achieving better recall, while at the same time the precision of HHEM-2.3 is higher then HHEM-2.1-Open across all sequence lengths; HHEM-2.3 also produces lower scores for these responses, demonstrating higher confidence in being able to identify these kinds of hallucinated responses.

Let's take a look at some of the results from the RAGTruth-QA dataset to demonstrate this difference. As we can see in Figures 2 and 3 below, HHEM-2.3 has a much higher recall than HHEM-2.1-Open on this dataset while its precision is higher as well in the same setting. As the premise length increases, HHEM-2.3's ability to identify these hallucinations (its recall) improves while HHEM-2.1-Open exhibits diminishing performance.

Figure 2: Recall for HHEM-2.3 s HHEM-2.1-Open with varying premise (context) lengths in characters for RAGTruth-QA

Figure 3: Precision for HHEM-2.3 s HHEM-2.1-Open with varying premise (context) lengths in characters for RAGTruth-QA

Figure 4 plots the median HHEM score for hallucinated responses, and as we can see it demonstrates a similar result: we expect these models to output scores close to 0 for hallucinated responses, and we see that HHEM-2.3 does so, with median output scores well below 0.1. Conversely, HHEM-2.1-Open outputs much higher median scores, demonstrating reduced confidence in its identification of hallucinated responses.

Figure 4: Median model outputs for HHEM-2.3 vs HHEM-2.1-Open (RAGTruth-QA) as a function of the premise length in characters.

We can see this even more clearly in Figure 5. Among the hallucinated samples with the highest premise length in RAGTruth-QA, HHEM-2.3 correctly identifies most of the responses as hallucinations and outputs scores that are close to 0. On the other hand, HHEM-2.1-Open outputs scores that are close to a uniform distribution, with some higher bars at the edges, indicating that its evaluation of these responses is not as helpful in identifying the responses as hallucinations.

Figure 5: Distribution of HHEM scores for HHEM-2.3 vs HHEM-2.1-Open in RAGTruth-QA for hallucinated responses with the premise length > 1536 characters.

The difference is even more striking when we look at the RAGTruth-Summary dataset. Across all data splits, HHEM-2.3 achieves a recall of at least 0.758 while HHEM-2.1-Open achieves a maximum recall of 0.377, as shown in Figure 6 below:

Figure 6: Recall for RAGTruth-Summary for various premise length in characters, comparing HHEM-2.3 with HHEM-2.1-Open

Among hallucinated responses where we expect the models to output scores close to 0, HHEM-2.3's median score across all data splits is at most 0.063, while HHEM-2.1-Open's median scores are at least 0.643, which is above the binary threshold value of 0.5. This means that, on average, HHEM-2.1-Open classifies a hallucinated response from this dataset as factual.

Table 1: Median hallucination score for HHEM-2.3 vs HHEM-2.1-Open for hallucinated responses in RAGTruth-Summary

Figure 7 shows the distribution of output scores from the two models across the entire dataset. We see a clear difference in the two models' outputs, with consistently low scores from HHEM-2.3, indicating good performance, and consistently high scores from HHEM-2.1, indicating poor performance.

Figure 7: Distribution of HHEM scores for HHEM-2.3 vs HHEM-2.1-Open for Hallucinated responses on RAGTruth-Summary

TofuEval: HHEM-2.3 vs HHEM-2.1-Open

While the types of hallucinations in the RAGTruth benchmark are pretty straightforward, focusing primarily on clear factual errors or baseless information in the response that is not found in the premise, the types of hallucinations in the TofuEval benchmark dataset are more complex. These include nuanced meaning shifts, misreferences, and stating opinions as facts, among others. In that sense, TofuEval represents a more difficult benchmark dataset.

As we see in Figure 8, the balanced accuracies for HHEM-2.1-Open and HHEM-2.3 on the TofuEval-MediaSum dataset among samples with premise length less than 4608 characters were 71.57% and 78.48%, respectively, indicating similar performance between the two models.

Figure 8: Balanced Accuracy of HHEM-2.1-Open vs HHEM-2.3 for TofuEval-MediaSum.

However, when we evaluated the samples with longer premise lengths, HHEM-2.1-Open's balanced accuracy dropped significantly to 62.58% (a drop of 8.99%) while HHEM-2.3's balanced accuracy remained about the same at 78.2% (with only a minor drop of 0.28%).

HHEM-2.1-Open's diminishing performance in balanced accuracy can again be connected to its recall, dropping from 0.526 on the shorter premise samples to 0.328 on the longer premise samples.

On the TofuEval-MeetingBank dataset, both models saw an increase in performance as the average premise length increased. As shown in Figure 9, HHEM-2.1-Open saw a jump in balanced accuracy from 63.44% to 78.6% (+15.16%) while HHEM-2.3's balanced accuracy rose from 77.92% to 86% (+8.08%). Both models also saw improvements in recall as premise length increased, with HHEM-2.1-Open nearly doubling from 0.345 to 0.668 (+0.323) and HHEM-2.3 exhibiting a more modest increase from 0.586 to 0.732 (+0.146).

Figure 9: Balanced accuracy for HHEM-2.3 vs HHEM-2.1-Open with TofuEval-MeetingBank

One surprising result we found in TofuEval-MeetingBank was HHEM-2.1-Open's shift in median output score among hallucinated responses. For samples with lower premise length, its median output was 0.82, once again indicating poor ability to detect hallucinations. However, among the samples with higher premise length, its median output score was 0.082, comparable to HHEM-2.3's median output score of 0.018 among those same samples.

Table 2: Median hallucination score for HHEM-2.3 vs HHEM-2.1-Open for hallucinated responses in TofuEval-MeetingBank.

Conclusion

In this blog post, we compared the differences between Vectara open-weights HHEM model called HHEM-2.1-Open, and its commercial strength production-quality model, HHEM-2.3, which is available via the Vectara platform’s API.

We tested both models, comparing results across the RAGTruth and TofuEval datasets, and our results clearly demonstrate the superiority of HHEM-2.3 with much higher recall and precision, as well as the actual distribution of factual consistency scores.

If you would like to explore HHEM-2.3 for yourself, you can sign up for a 30-day free trial of Vectara and try HHEM using our API.

Before you go...

Connect with
our Community!