Open Evaluation

At Vectara, we believe that great search and retrieval systems deserve great evaluation tools—open, accessible, and trusted by the community. Evaluation shouldn't be locked behind heavy frameworks or fragile standards. It should be something everyone can use, understand, and build on together.

That's why we've invested so heavily in making evaluation open-source, lightweight, and useful for developers, researchers, and product teams alike. Our goal for Open RAG Eval is to create a toolkit that developers actually use, that non-developers can trust, and for it to drive real-world improvements.

Open RAG Eval was the First Step

When it comes to evaluating RAG systems, traditional methods fall short. They often rely on "golden answers"—perfect, hand-crafted reference answers—which are impractical, expensive, and fragile in fast-moving real-world applications.

To fix this, we launched Open RAG Eval—an open-source, lightweight, practical evaluation framework designed for the realities of modern RAG.

Open-RAG-Eval brings several key innovations:

No need for golden answers: Instead of relying on fragile gold sets, we use high-quality human judgments and LLM-based assessments of helpfulness and relevance.
Lightweight and developer-friendly: Easy to install, easy to run. Built to be incorporated into day-to-day workflows.
Open and community-centric: We believe evaluation should be a shared yardstick—something the entire community can use, trust, and improve together. Open RAG Eval is free, open, and available to all.

Making Evaluation Reports More Accessible

But we’re not stopping there. Great evaluation isn't just about running a script and getting numbers—it’s about understanding the reports, spotting opportunities, and taking action.

That's why we're excited to launch the Open Evaluation website.

We designed Open Evaluation to surface the patterns and insights represented within a report’s metrics. After generating reports with Open RAG Eval, you can load a report into Open Evaluation and use the UI to compare various metrics across queries with a report. You can also load multiple reports to compare metrics across reports. By analyzing how your RAG configurations map to changes in metrics across reports, you can zero in on the tweaks that will optimize your RAG system.

If you want to try it out, you can head over to Open Evaluation and explore our sample report. Now let’s take a tour through the features!

Open Evaluation Features

Metric Comparison and Explanation

Open Evaluation translates Open RAG Eval’s retrieval and generation metrics into understandable terms and formats them for easy comparison and analysis. The UI enables you to review these metrics at a high level, comparing them across reports, and also enables you to drill down into specific queries and metrics to better understand why a query scored well or poorly for a given metric. You can also click on metric values to read about what that metric means and how it was generated. Read more about the various metrics below.

A screenshot of the Open Evaluation UI. You can add as many evaluation reports as you want and compare their performance.

Relevance Metric

Relevance score measures whether the retrieved search results are relevant to the query. Open RAG Eval uses the UMBRELA evaluation method to determine an UMBRELA score for each search result. These scores range from 0 (worst) to 3 (best). Having even one search result with a 3 is great because it means that your retrieval system was able to find a result that contains the information needed to precisely answer the query.

Irrelevant (0)	The search result has nothing to do with the query.
Partially relevant (1)	The search result seems related to the query, but it doesn't answer it.
Partial answer (2)	The search result addresses the query but doesn't entirely answer it, or it isn't a clear answer.
Complete answer (3)	The search result precisely answers the query.

A screenshot of the UI that demonstrates the relevance score, which is the highest UMBRELA score of those calculated. The interface also provides the citation number used by the generated answer, so you can cross-reference various parts of the generated answer with the search results.

Groundedness and Factuality Metrics

Groundedness score measures how strongly the generated answer reflects the relevant facts contained in the search results. Here’s how this works:

Open RAG Eval analyzes the search results and extracts factual statements, known as “nuggets”.
Each nugget is labeled as a “relevant fact” or “irrelevant fact”.
Lastly, it labels each nugget as Not reflected (0), Partially reflected (0.5), or Reflected (1) in the generated answer.
The recommended way to use this data to arrive at a score is to find the average numerical facts of all relevant facts, but you can also look at scores that are derived using other standard methods.

Factuality score measures how hallucination-free is the generated answer. Open RAG Eval uses Vectara's Hughes Hallucination Evaluation Model (HHEM) to produce this score, ranging from 0 (fully hallucinated) to 1 (fully factual).

A screenshot of the UI that demonstrates the generated answer, the groundedness score, which is an average of the relevant nuggets’ scores, and the factuality score as calculated by HHEM. You can click on the inline citation numbers within the generated answer to view the referenced search results. In the nuggets section, you can select a different nuggetizer algorithm to calculate the groundedness score with an alternate method.

Citations

Citations score measures the strength of the relationship between the statements contained in the generated answer and the search results they cite. Scores can be one of these:

No relationship (0)	No connection between the citation and the generated statement that references it.
Partial relationship (0.5)	There is a connection, but it's not strong.
Strong relationship (1)	The generated statement is clearly justified by the cited search result.

A screenshot of the UI that demonstrates the citations score. The score is an average of the scores for each citation. You can click the citation number to view the referenced search result.

Our Commitment to the Community

Open Evaluation and Open RAG Eval represent our long-term commitment to building a transparent and tractable RAG ecosystem. We believe in enabling RAG evaluation to be:

Actionable, to facilitate rapid iteration cycles.
Continuous, in order to produce better and ever-improving AI systems.
Collaborative, because as AI becomes woven into our organizations, evaluation will require input from everyone within an organization.

We’re just getting started.

If you care about making RAG better, come join the discussion.

Open Evaluation

Open RAG Eval was the First Step

Making Evaluation Reports More Accessible

Open Evaluation Features

Metric Comparison and Explanation

Relevance Metric

Groundedness and Factuality Metrics

Citations

Our Commitment to the Community

Connect with
our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

Connect with our Community!