Open RAG Benchmark: A New Frontier for Multimodal PDF Understanding in RAG

Evaluating RAG systems is often limited to how well they ground generated text in the retrieved textual information. However, this approach overlooks a significant challenge: much of the world's information, particularly in business and research, is stored in PDFs that convey information through a mix of text, tables, and images. Existing RAG evaluation tools frequently fall short in assessing a system's ability to comprehend and synthesize information from these varied modalities.

To address this gap, we have developed the Open RAG Benchmark, a new dataset for evaluating RAG systems. The benchmark is designed to measure a system's proficiency in processing and integrating multimodal information, accompanying our Open RAG Evaluation framework. What distinguishes the Open RAG Benchmark is its construction from a diverse collection of arXiv PDF documents. We have moved beyond simple text extraction to generate queries that specifically target the content within text, tables, and images. This method allows for a more comprehensive assessment of RAG systems on their ability to understand and reason across the different data types interwoven within a single document.

We have made the dataset freely available on Hugging Face, and the details of its creation are on Github.

A Glimpse at the Dataset

Our dataset follows the format of the BEIR benchmark. This initial release offers a robust testing ground:

1000 PDF papers carefully selected and evenly distributed across all arXiv categories. This includes a mix of "positive" documents (each serving as the golden source for certain queries) and "hard negative" documents (completely irrelevant, designed to test a system's ability to filter out noise).
Multimodal Content: we've successfully extracted text, tables, and images, capturing the rich diversity of research papers.
3000+ Question-Answer Pairs: These pairs are thoughtfully categorized by:
- Query Type: including both abstractive queries (requiring understanding and synthesis to generate a summary or rephrased response) and extractive queries (seeking concise, fact-based answers directly from the text).
- Generation Source: highlighting queries that rely purely on text, and those that necessitate understanding of text alongside images, tables, or even a combination of all three.

Features

Traditional RAG evaluations often fall short when confronted with complex real-world documents like PDFs. Most evaluation tools are designed to work only with extracted text, rendering them blind to the crucial information embedded in tables, charts, and images. This leads to flawed assessments: a tool might fail to verify a correct answer sourced from a table or be unable to detect a hallucination related to an image. This inability to account for non-textual data means that a system's true understanding of the document remains untested. The Open RAG Benchmark aims to help practitioners address this challenge by offering:

Comprehensive Multimodal Understanding: our corpus is derived exclusively from PDF documents, encompassing a wide spectrum of text, tabular, and visual information. This richness, combined with our seamless integration with the Open RAG Evaluation framework, enables a deeper, more nuanced assessment of RAG system performance. This translates directly to improved table and image understanding, benefiting verticals like Legal, Healthcare, and Finance, where critical information is often embedded within these complex modalities. For instance, accurately extracting data from structured tables or identifying anomalies in medical imaging becomes more achievable, supporting applications such as enterprise search solutions or legal discovery platforms.
High-Quality Retrieval Queries & Answers: the high-quality query-answer pairs for each modality provide a reliable standard for assessing how well RAG systems perform, ensuring that evaluations are precise and reflect real-world performance. For example, this is particularly helpful when evaluating customer support Q&A applications or tools designed to enhance research capabilities in scientific domains. By using our benchmark, you can precisely assess a RAG system's ability to retrieve relevant information from structured PDFs, regardless of its original format.
Diverse Knowledge Domains: by sourcing content from various scientific and technical domains within arXiv, the dataset ensures broad applicability and challenges RAG systems across a wide array of knowledge areas. This diversity helps ensure that RAG models trained and evaluated with our benchmark are robust enough for real-world deployment across different industries, from automating information extraction in engineering firms to streamlining data analysis in pharmaceutical research.

Dataset Creation

Creating such a comprehensive dataset is a systematic endeavor. Our process involved the following steps:

Document Collection: gathering raw PDFs from sources like arXiv.
Document Processing: applying advanced OCR techniques to parse these PDFs, extracting text, converting tables into structured Markdown, and encoding images.
Content Segmentation: dividing documents into logical sections based on their structural elements.
Query Generation: generating appropriate retrieval queries for each section using powerful LLMs (currently gpt-4o-mini), with a keen eye on multimodal content like tables and images.
Quality Filtering: utilize a suite of high-performing embedding models and further LLM-based filtering, to ensure the quality and relevance of every query.
Hard-Negative Mining (Optional): identify and add to our corpus "hard negative" documents – which are additional documents that are completely irrelevant to any query. This helps further assess retrieval capabilities.

Looking Ahead: The Future of Open RAG Benchmark

While our current release is a significant achievement, we are continuously working to enhance the Open RAG Benchmark. Our future plans include:

Dataset Expansion: broadening the scope beyond academic papers, enhancing OCR for unstructured documents, and potentially adding multilingual support.
Enriched Dataset Structure: adding more comprehensive document metadata (authors, publication dates, categories) and offering flexible content granularity (sections, paragraphs, sentences) to support diverse retrieval strategies.
Advanced Multimodal Representation: this includes exploring alternative PDF table and image extraction solutions, and providing both Markdown and programmatically accessible formats for tables. Maintaining clear positional relationships between text and visual elements is also a key focus.

Conclusion

The Open RAG Benchmark marks a crucial step forward in evaluating RAG systems, especially when dealing with the complex, multimodal nature of real-world documents like PDFs. By moving beyond text-only assessments, this benchmark, along with the Open RAG Evaluation framework, provides a robust and comprehensive method for gauging a system's ability to understand and synthesize information from text, tables, and images.

We encourage you to explore the dataset on Hugging Face and the detailed creation process on GitHub. To experience the power of Open RAG Evaluation in action, sign up for Vectara here. We look forward to your involvement and collaboration.