HCMBench: an evaluation toolkit for hallucination correction models

Introduction

Hallucinations in RAG systems lead to inaccurate and unreliable output, distort context, erode user trust, complicate evaluation, and increase error rates, particularly in domains where accuracy is critical, like financial services or healthcare. One of the solutions to this problem is to add a post-editing step after the RAG system generates its initial response, using a Hallucination Correction Model (HCM).

In our previous blog post, we shared about Vectara's Hallucination Corrector. As we work to further improve the performance, we are excited to share HCMBench - Vectara’s open-source HCM evaluation toolkit. The evaluation toolkit streamlines the assessment and development of hallucination correction models. It helps you measure the effectiveness of different models/prompts in mitigating an LLM’s hallucinations.

Toolkit components

Our toolkit has four components: the Dataset, Hallucination Correction Model (HCM), Postprocessor, and Hallucination Evaluation Model (HEM).

Figure 1. A typical pipeline run with HCMBench.

Dataset

To determine the effectiveness of an HCM, we need data of hallucinated generations from LLM, and the contexts that are being used for the generation. At the time the blog post is written, we’ve integrated four public datasets in our toolkit: RAGTruth, FaithBench, FAVABENCH and FACTS Grounding. We converted the datasets into a unified format where each sample has two fields: context and LLM Response.

Hallucination Correction Model

This component represents any model designed to identify and rectify hallucinations in text. These models are the primary focus of the evaluation process. As shown in Figure 1, the HCM takes input from the context and the LLM response and outputs edited responses.

In the toolkit, we provide an “identity” correction model and an example wrap for FAVA. The “identity” correction model makes no edit and returns the exact same LLM response as input. This is used as a baseline for any other correction models. The FAVA model is an open-source 7B LLM finetuned from Llama-2 and trained based on the FAVABENCH data.

Postprocessor

The postprocessor prepares the input data, ensuring it's in the correct format for the evaluation models. This step is crucial for data consistency and accurate results. The hallucination evaluation for a generated response can happen at different levels: response-level, sentence-level, and claim-level.

Response-level evaluation takes the entire generated response as input and assesses the overall accuracy and factuality.
Sentence-level evaluation breaks down the response into individual sentences and evaluates each one for hallucinations, providing more fine-grained insights.
Claim-level evaluation goes further by extracting atomic facts or claims from the response and assessing the veracity of each individual claim, offering the most detailed analysis.

Different metrics may rely on different levels to achieve the greatest accuracy. In the toolkit, we can choose the desired evaluation level by using postprocessors:

Sentencizer, breaking the LLM response into sentences. Note that some consecutive sentences may be dependent on each other, so we also include an option to decontextualize the sentences by asking an LLM to rephrase.
Claim Extractor, extracting the atomic facts from an LLM response.

Hallucination evaluation model

This model assesses the accuracy and efficacy of the hallucination correction models. It provides metrics and insights into how well the correction models perform. The metrics for evaluating hallucination are in active development, as the field of hallucination detection continues to evolve and improve.

We included in HCMBench multiple hallucination metrics that can be used to evaluate the edited responses made by HCM for a comprehensive result: HHEM, Minicheck, AXCEL, and FACTSJudge. Among them, HHEM and Minicheck are classifiers specifically fine-tuned for hallucination detection. AXCEL and FACTSJudge are two different prompts used in LLM-as-a-judge, where the base LLM can be swapped as needed.

In addition to hallucination metrics, we also include ROUGE as a similarity metric to monitor the similarity between the edited response and its corresponding original response. This metric serves as a sanity check to make sure that the edited response does not deviate from the original content.

Using the toolkit

Measure HCM quality

HCMs use contextual information to correct the hallucinations within an LLM’s generated text. At a high level, HCMBench uses the contextual information fed to an LLM to evaluate how well it corrected those hallucinations. Users can tailor their method of evaluation by defining which HCM to use.

Example pipeline

The toolkit is designed for ease of use. Users can integrate their hallucination correction models, leverage the built-in evaluation model, and utilize the postprocessor to manage data efficiently. Here we show an example configuration for running an evaluation pipeline:

Simply run ``python run.py sample_run.yaml``, the toolkit will run the pipeline as configured in the yaml file. The yaml configuration has three main sections:

output_path: specifies the output location for the experiment run.
eval_datasets: specifies the datasets to be used.
pipeline: the list of processors (HCM, Postrocessor and HEM) to be run sequentially.

In this sample configuration, we specified the experimental result to be saved in the ``output/Identical`` folder, and run evaluation on the RAGTruth and FaithBench datasets. The two datasets are loaded with HuggingFace’s datasets library with two columns “context” and “claim” (LLM Response) included.

Pipeline execution

The processors in the pipeline section run sequentially. Each processor receives samples from the dataset as input and outputs its results as new columns in the dataset. In the sample configuration pipeline:

Correct hallucinations: The specified HCM, IdenticalCorrectionModel, outputs a corrected version of the LLM’s response. This output is stored in the ``corrected`` column by default.
Extract claims: Then the specified Postprocessor, ClaimExtractor, derives a set of atomic facts from the corrected output, and stores it in the ``extracted`` column. An atomic fact is a short sentence containing a single piece of information.. ClaimExtractor uses Llama-3.3-70b to derive these facts.
Measure similarity: We need to measure the similarity between the corrected response and the original response.. In this step, we measure this by calculating ROUGE scores, using the ``claim`` column as references and ``corrected`` column as predictions.
Measure factuality with HHEM: Then an HHEM score is calculated at the claim-level, based on the ``extracted`` column.
Measure factuality with Minicheck: Then a Minicheck score is also calculated at claim-level, based on the ``extracted`` column.
Measure factuality with FACTSJudge: Last, a FACTSJudge score is calculated at the response-level, based on the ``corrected`` column. In this example, this step uses Llama-3.3-70b to judge each response.

Note that the pipeline itself is highly configurable. Although in Figure 1 and in the sample configuration, we set the pipeline as HCM -> Postprocessor -> HEM. In fact, the processors may run in arbitrary order. For examples:

Use a sentencizer first to break the original response into sentence-level and then utilize an HCM to edit the response sentence-by-sentence.

Apply an HEM first to know which samples are hallucinated, and then use an HCM to correct only the hallucinated samples.

Finally, to visualize the experimental results, we provide a Jupyter notebook (display_results.ipynb) for averaging the metric scores across the dataset and aggregate different metrics into overall scores (Figure 2).

Vectara Hallucination Corrector on HCMBench

Leveraging the toolkit, we evaluated the performance of our own Vectara Hallucination Corrector (VHC) on the HCMBench. Two variants of the VHC are tested: VHC-OnPrem, a small but powerful model balancing the efficiency and accuracy for deployment in on-premise solutions; VHC-SaaS, the strongest model providing the most accurate results available for SaaS only. In the evaluation setting, we used the following three metrics as HEM:

Vectara’s HHEM-Open at claim level
Bespoke-MiniCheck-7B at claim level
FACTSGrounding LLM-as-judge with Llama-3.3-70B at response level.

In Figure 3, we report the majority voting score of the three metrics above. Both variants of VHC improve the HEM scores compared with the “Identity” response by a great margin.

Conclusion

HCMBench offers a robust and flexible toolkit for evaluating hallucination correction models, addressing a critical challenge in RAG systems.

By providing a standardized framework with diverse datasets, models, postprocessors, and evaluation metrics, it simplifies the assessment of HCM effectiveness. The toolkit's modular design allows for customizable pipelines and various evaluation levels, catering to different research and development needs. With ongoing contributions and updates, HCMBench aims to advance the field of hallucination correction and improve the reliability of LLM-generated content.

Explore HCMBench further by visiting our repository. We encourage you to contribute issues, submit pull requests, and help us enhance this toolkit. You can also connect with the broader RAG evaluation community through Open-RAG-Eval.

To experience the power of Hallucination Correction Models in action, sign up for Vectara here. We look forward to your involvement and collaboration.