Open Evaluation
The Open Evaluation website makes it easier for you to spot the patterns within an Open RAG Eval evaluation report and take effective steps to improve your RAG system.
4-minute read time
At Vectara, we believe that great search and retrieval systems deserve great evaluation tools—open, accessible, and trusted by the community. Evaluation shouldn't be locked behind heavy frameworks or fragile standards. It should be something everyone can use, understand, and build on together.
That's why we've invested so heavily in making evaluation open-source, lightweight, and useful for developers, researchers, and product teams alike. Our goal for Open RAG Eval is to create a toolkit that developers actually use, that non-developers can trust, and for it to drive real-world improvements.
Open RAG Eval was the First Step
When it comes to evaluating RAG systems, traditional methods fall short. They often rely on "golden answers"—perfect, hand-crafted reference answers—which are impractical, expensive, and fragile in fast-moving real-world applications.
To fix this, we launched Open RAG Eval—an open-source, lightweight, practical evaluation framework designed for the realities of modern RAG.
Open-RAG-Eval brings several key innovations:
- No need for golden answers: Instead of relying on fragile gold sets, we use high-quality human judgments and LLM-based assessments of helpfulness and relevance.
- Lightweight and developer-friendly: Easy to install, easy to run. Built to be incorporated into day-to-day workflows.
- Open and community-centric: We believe evaluation should be a shared yardstick—something the entire community can use, trust, and improve together. Open RAG Eval is free, open, and available to all.
Making Evaluation Reports More Accessible
But we’re not stopping there. Great evaluation isn't just about running a script and getting numbers—it’s about understanding the reports, spotting opportunities, and taking action.
That's why we're excited to launch the Open Evaluation website.
We designed Open Evaluation to surface the patterns and insights represented within a report’s metrics. After generating reports with Open RAG Eval, you can load a report into Open Evaluation and use the UI to compare various metrics across queries with a report. You can also load multiple reports to compare metrics across reports. By analyzing how your RAG configurations map to changes in metrics across reports, you can zero in on the tweaks that will optimize your RAG system.
If you want to try it out, you can head over to Open Evaluation and explore our sample report. Now let’s take a tour through the features!
Open Evaluation Features
Metric Comparison and Explanation
Open Evaluation translates Open RAG Eval’s retrieval and generation metrics into understandable terms and formats them for easy comparison and analysis. The UI enables you to review these metrics at a high level, comparing them across reports, and also enables you to drill down into specific queries and metrics to better understand why a query scored well or poorly for a given metric. You can also click on metric values to read about what that metric means and how it was generated. Read more about the various metrics below.

Relevance Metric
Relevance score measures whether the retrieved search results are relevant to the query. Open RAG Eval uses the UMBRELA evaluation method to determine an UMBRELA score for each search result. These scores range from 0 (worst) to 3 (best). Having even one search result with a 3 is great because it means that your retrieval system was able to find a result that contains the information needed to precisely answer the query.

Groundedness and Factuality Metrics
Groundedness score measures how strongly the generated answer reflects the relevant facts contained in the search results. Here’s how this works:
- Open RAG Eval analyzes the search results and extracts factual statements, known as “nuggets”.
- Each nugget is labeled as a “relevant fact” or “irrelevant fact”.
- Lastly, it labels each nugget as Not reflected (0), Partially reflected (0.5), or Reflected (1) in the generated answer.
- The recommended way to use this data to arrive at a score is to find the average numerical facts of all relevant facts, but you can also look at scores that are derived using other standard methods.
Factuality score measures how hallucination-free is the generated answer. Open RAG Eval uses Vectara's Hughes Hallucination Evaluation Model (HHEM) to produce this score, ranging from 0 (fully hallucinated) to 1 (fully factual).

Citations
Citations score measures the strength of the relationship between the statements contained in the generated answer and the search results they cite. Scores can be one of these:

Our Commitment to the Community
Open Evaluation and Open RAG Eval represent our long-term commitment to building a transparent and tractable RAG ecosystem. We believe in enabling RAG evaluation to be:
- Actionable, to facilitate rapid iteration cycles.
- Continuous, in order to produce better and ever-improving AI systems.
- Collaborative, because as AI becomes woven into our organizations, evaluation will require input from everyone within an organization.
We’re just getting started.
If you care about making RAG better, come join the discussion.