Introducing the Next Generation of Vectara's Hallucination Leaderboard

Introduction

Since its release nearly two years ago, the Vectara Hallucination Leaderboard has served as the de-facto resource for the AI community to understand the extent to which various Large Language Models (LLMs) hallucinate.

By providing a standardized benchmark, the leaderboard has enabled developers, researchers, and enterprises to assess and compare the propensity of various LLMs to "hallucinate" or generate factually incorrect information for RAG and Agentic RAG use-case, such as conversational AI or document generation.

The leaderboard has been instrumental in driving progress towards more reliable and trustworthy AI, and we’ve been very happy to see frontier models improve their ranking on the leaderboard over time. With the increased adoption of AI Agents across all industries (financial services, healthcare, insurance, manufacturing, etc), the importance of hallucination detection and correction is more important than ever.

While the original leaderboard and its underlying dataset have been crucial in establishing the need for measuring hallucination, the landscape of generative AI is constantly shifting: models continue to evolve and perform better overall, while new capabilities, e.g. reasoning and longer context windows, are released at rapid pace. Over time, and as a result, we observed clustering of models at the top of the current leaderboard, diminishing the leaderboard’s ability to separate LLM hallucination performance. In addition, academic research has expanded the landscape of hallucination evaluation, producing new benchmarks and datasets, including FaithBench, RAGTruth, TofuEval, and many others.

To keep pace with this multi-faceted evolution and to continue to provide meaningful - and more granular - insights into the propensity of LLMs to hallucinate, we recognized the need to refresh our leaderboard with a new, richer dataset and an updated evaluation methodology. The original dataset, while effective at its time, has served its purpose. It’s time to introduce a new, more rigorous benchmark that reflects the current state-of-the-art LLMs and anticipates the challenges of tomorrow.

We are thrilled to announce the launch of the new Vectara Hallucination Leaderboard, powered by a dataset that is larger, more robust, and significantly more challenging. This new benchmark is designed to push the limits of even the most advanced models like GPT-5, Gemini-2.5-Pro, Anthropic Sonnet 4.5, and Grok-4, providing a more granular and accurate picture of their factual consistency in complex enterprise applications.

We believe this will not only provide a more accurate ranking of the likelihood of current frontier models to hallucinate, but also spur the development of more reliable and hallucination-resistant LLMs, ultimately benefiting the entire AI ecosystem.

Can you guess which five LLMs have the lowest hallucination rate? Let’s dive in to see the results.

How Does Vectara’s Hallucination Leaderboard Work?

The leaderboard defines a (non-public) dataset of articles (segments of text intended for summarization). Evaluating an LLM on the leaderboard follows these 3 simple steps:

Provide the LLM with the text of each article, and use the leaderboard’s prompt which instructs the LLM to summarize the article.
Use Vectara’s commercial strength Hallucination Detection Model (HHEM) to evaluate hallucination in the generated summary. The HHEM score is between 0…1 and any value < 0.5 is considered a hallucination.
Measure the hallucination rate - namely, the percent of articles for which the summary is deemed hallucinated by HHEM.

This is illustrated in Figure 1 below:

Figure 1: how the leaderboard works. (1) summarize articles (2) evaluate each summary with Vectara commercial HHEM (3) compute hallucination rate

In addition to the hallucination rate, the leaderboard also measures the following:

Answer rate: an LLM can respond with a refusal to summarize, and in such a case we do not count that answer against its hallucination rate. The answer rate reflects the percentage of articles that were properly summarized - a low value here reflects an LLM with high refusal rate.
Average Summary Length (words): is the average response length in words over all articles in the dataset, and is sometimes helpful for additional context.

We calculate the hallucination rate only on articles that the LLM successfully summarizes. This is why the answer rate provides essential context. For example, an LLM might refuse to summarize difficult articles to avoid a high hallucination rate, but its answer rate would drop as a result.

The evaluation process involves a specific prompt, instructing the LLM to summarize the passage. With our new leaderboard we’ve updated the prompt to be as follows:

💡 Leaderboard Prompt

Your task is to provide a concise and factual summary for the given passage.
Rules:
1. Summarize using only the information in the given passage. Do not infer. Do not use your internal knowledge.
2. Do not provide a preamble or explanation, output only the summary.
3. Summaries should never exceed 20 percent of the original text's length.
4. Maintain the tone of the passage.

If you are unable to summarize the text due to missing, unreadable, irrelevant or insufficient content, respond only with: "I am unable to summarize this text."

Here is the passage:
{passage}

This new prompt, which differs slightly from the one used in the original leaderboard, is designed to give each LLM more precise instructions for the summarization task.

What’s In The New Dataset?

At the heart of our new leaderboard is a meticulously curated dataset designed to reflect the complexity and diversity of information sources encountered in the real world across consumer and enterprise applications of generative AI.

First, to enhance the benchmark's statistical power and depth, we have expanded its scale from 1,000 to over 7,700 unique articles. This substantial increase in volume ensures that our evaluations are not only rigorous but also highly reliable, minimizing the chance that a model's score is due to chance or a narrow set of evaluation criteria.

Second, we've significantly increased the difficulty of the task by including articles that are longer (up to 32K tokens), as you can see in Figure 2. This is an important change, as it tests a model's ability to maintain factual consistency over extended contexts which are commonplace in the most recent LLMs and are increasingly becoming a requirement for enterprise-grade RAG and Agentic applications.

Figure 2: distribution of article size in words. The x-axis shows the number of words and the y-axis the number of articles.

On top of that, the dataset incorporates a deliberate mix of both low and high complexity text, allowing us to evaluate how models perform when summarizing simple facts versus comprehending nuanced, complex information.

For example, here is text we consider straightforward (taken from a news article):

When I had my first panic attack at the age of 19, I believed with absolute certainty that I was in mortal danger. I lie in my dorm room bed for what felt like hours, clutching my pounding heart and gasping for air. Fifteen interminable minutes later, it was as if it had never happened, and I felt relatively normal — but that wouldn't be the last incident. I went on to have many more panic attacks, and have since been diagnosed with a panic disorder (PD). I'm among two to three percent of Americans with PD; while 18.1 percent of Americans have anxiety disorders in general — the most common mental illness. Since that day, I’ve treated my condition both with therapy and medication. Despite managing my PD, I do still suffer the occasional panic attack, but with professional guidance (a must), I’ve learned that there are simple things I can do to stop a panic attack in its tracks. I talked with mental health professionals to discuss why my own techniques work and what more those of us living with panic attacks can do. A panic attack comes out of nowhere and is not an anxiety attack Though we tend to use the terms “panic attack” and “anxiety attack” interchangeably, it’s worth noting that professionally speaking (and when referencing the Diagnostic and Statistical Manual of Mental Disorders, aka, DSM–5), there’s actually no such thing as an anxiety attack, per se. “Anxiety is an excessive persistent worrying over an imminent event that can last a while. A panic attack is a burst of intense fear that typically lasts fewer than 30 minutes,” Dr. Carolyn Rodriguez, assistant professor of psychiatry and behavioral sciences at Stanford tells NBC News BETTER.

In contrast, a more challenging text is demonstrated in this article (taken from a technical reference manual of an electronic device):

The state of the device after boot is determined by sampling the input states of the BTMODE[15:0] pins when device reset (POR or RESET) is de-asserted. The sampled values are latched into the CONTROL_STATUS register, which is part of the Control Module. The BTMODE[15:11] values determine GPMC CS0 Default Data Bus Width, Wait Enable, and Address/Data Multiplexing. For additional details on BTMODE[15:11] pin functions, see Table 4-1, Boot Configuration Terminal Functions.

The BTMODE[4:0] values determine the boot mode order according to Table 5-1, Boot Mode Order. The 1st boot mode listed for each BTMODE[4:0] configuration is executed as the primary boot mode. If the primary boot mode fails, the 2nd, 3rd, and 4th boot modes are executed in that order until a successful boot is completed.

The BTMODE[7:5] pins are RESERVED and should be pulled down as indicated inTable 4-1, Boot Configuration Terminal Functions.

When the EMAC bootmode is selected (see Table 5-1), the sampled value from BTMODE[9:8] pins are used to determine the Ethernet PHY Mode selection (see Table 5-7).

When the XIP (MUX0), XIP (MUX1), XIP w/ WAiT (MUX0) or XIP w/ WAiT (MUX1) bootmode is selected (see Table 5-1), the sampled value from BTMODE[10] pin is used to select between GPMC pin muxing options shown in Table 5-2, XIP (on GPMC) Boot Options [Muxed or Non-Muxed].

For more detailed information on booting the device, see the ROM Code Memory and Peripheral Booting chapter of the TMS320DM814x DaVinci Digital Media Processors Technical Reference Manual

In our dataset we have a roughly even split between low and high complexity articles - with 3792 being low complexity and 3939 being high complexity.

Finally, a key pillar of this new dataset is its breadth of coverage across different domains. We have introduced a range of distinct categories including technology, stocks, sports, science, politics, medicine, law, finance, education, and business, as shown in Figure 3. This diversity is intentional and serves as a deliberate stress test for regulated industries - factual accuracy in a general news summary is important, but in a healthcare or legal context, it becomes non-negotiable.

Figure 3: distribution of number of articles per category

By evaluating LLMs across these specialized fields, we can provide far more granular insights into their likelihood to hallucinate in each field, in addition to their overall hallucination rate.

Ultimately, the value of this new dataset lies in its realism and rigor. It provides a more comprehensive benchmark that pushes the state-of-the-art and prevents models from "overfitting" to a single style of evaluation.

For enterprises and end-users, this new dataset delivers a more trustworthy and nuanced assessment of which model is best suited for their specific needs.

Leaderboard Results

We re-evaluated a large set of LLMs from the previous leaderboard against this new benchmark (leaving out those LLMs that we believe are no longer in broad use).

When we look at the top LLMs in the new leaderboard (Figure 4), we observe that the new hallucination rates are much higher than before. This is consistent with our expectations and the reason for launching this new leaderboard in the first place: the dataset is fresh, larger, and includes longer and more challenging articles.

Figure 4: Hallucination rates for top-25 LLMs with the new dataset. Note that here we included only LLMs with answer rates >= 95%.

Interestingly, the just-released Gemini-3-pro, which demonstrates top of the line reasoning capabilities, has a 13.6% hallucination rate, and didn’t even make the top-25 list.

Other notable thinking models like Claude Sonnet 4.5, GPT-5, GPT-OSS-120B, Grok-4, or Deepseek-R1 all have a hallucination rate > 10%.

On the flipside, Gemini-2.5-flash-lite leads with 3.3% hallucination rate, with models from Mistral-large, Deepseek-v3.2, and IBM Granite-4 not far behind.

The top-10 large models (with at least 32B parameters) with the lowest hallucination rates are shown in Figure 5, and the top-10 small (with up to 32B parameters) models are shown in Figure 6:

Figure 5: Hallucination rates for top-10 LLMs with more than 32B parameters (LLMs)

Figure 6: Hallucination rates for the top-10 LLMs with less than 32B parameters (SLMs)

Similarly if we look at commercial vs Open-weights LLMs, the top-10 in each are shown in Figure 7 and 8, respectively:

Figure 7: Hallucination rates for the top-10 commercial LLMs

Figure 8: Hallucination rates for the top-10 open-weights LLMs

It’s interesting to see the results of articles from various categories - as expected, not all LLMs perform the same way, but on average we see consistent patterns of difference between categories. Figure 9 shows the average hallucination rate per category for some select LLMs.

Figure 9: Hallucination rate for by category of article and LLM

If we look at the hallucination rate by type of article (low vs high complexity; Figure 10) - we see that hallucination rates for high complexity articles is consistently higher than that for low complexity articles, which is consistent with our expectations.

Figure 10: Hallucination rate by LLM for low vs high complexity articles.

Finally, looking at length of article, we split all the articles into 10 bins, and as expected we observe the general trend that the hallucination rate for longer articles is higher.

Figure 11: Hallucination rate by length of article

Conclusion

We are excited to announce the launch of our next generation RAG and Agentic Hallucination Leaderboard, enabling the generative AI ecosystem to better capture how modern large language models (LLMs) handle factual accuracy.

The original leaderboard, released two years ago, became a key benchmark for measuring hallucination rates in generative AI. The updated leaderboard forms a new baseline: it is based on a much larger dataset that better reflects real enterprise data, and includes over 7,700 articles spanning diverse domains like law, medicine, finance, education and technology. It’s designed to test factual consistency across longer and more complex text, using Vectara’s Hallucination Detection Model (HHEM) to automatically score summaries produced by each LLM being evaluated.

Early results show that hallucination rates are generally higher under the new benchmark, confirming its tougher standards and modern relevance, enabling better separation among top-performing LLMs and offering deeper insight into their reliability across real-world scenarios, ultimately helping developers and enterprises select models that are both capable and trustworthy.

If you have a need to reduce hallucinations in your RAG or agentic applications, HHEM or VHC (Vectara Hallucination Corrector) - both standard API calls in Vectara - may be the solution you are looking for. Please contact us and we would be happy to help.