When AI “Assures” Without Evidence: Lessons from Deloitte’s $290K Hallucination

A $290K Refund and an Uncomfortable Truth

When Deloitte Australia delivered a 237-page “independent assurance” report to the Australian government earlier this year, it aimed to validate a complex welfare compliance system. Instead, the document quickly became an embarrassing case study in AI failure.

Soon after publication, Chris Rudge, an academic at Sydney University noticed fundamental flaws: citations that didn't exist, misquoted legal judgments, and references to academic papers that were simply fabricated. The culprit was generative AI—reportedly GPT-4o via Azure OpenAI—used to help draft key sections of the report. Deloitte was forced to revise the document, acknowledge the use of the tools, and agreed to refund the government A$290,000 (around US $190,000).

A tangled web of hallucinations

Most people think of an AI hallucination as simply getting a fact wrong, but in high-stakes fields like enterprise and government work, and increasingly in academia, citation hallucination is far more toxic. This occurs when a model invents sources, quotes, or references that look entirely legitimate. The AI-generated passages in Deloitte’s report included nonexistent legal citations and made-up academic papers.

The document looked polished, professional, and thoroughly researched—until experts tried to verify the sources. While human error is common, our process relies on traceable source citation for verification. This is why citation hallucination is so dangerous: the AI doesn't just make an incorrect claim; it actively creates the illusion of credibility.

Not Just "AI’s Fault"

The failure wasn’t just that GPT-4o “made stuff up.” It was that there were no organizational contingencies for validating or measuring the quality of citations in the report before its publication.

This is precisely the kind of breakdown Retrieval-Augmented Generation (RAG) was designed to prevent. By constraining a model’s responses to retrieved, verifiable sources, RAG systems reduce hallucination risk. For example, by connecting the LLM to a vector database storing actual scientific papers or verified news articles, the model is compelled to retrieve factual evidence before generating a citation or a claim.

That being said, even within RAG systems, hallucinations can still sneak in. The model might cite irrelevant or mismatched documents, often because the database retrieval relies only on cosine similarity of embeddings, meaning the retrieved context may not be 100% accurate. For instance, a database may retrieve 10 articles about peanut allergy using simple embeddings, but require a second pass, or reranking, by an LLM to determine which of those 10 articles is most relevant to a specific query (e.g., how common is peanut allergy in India?). The necessity of this reranking highlights that the initial retrieval step, which must sift through millions of documents, is inherently imperfect. Furthermore, retrieval might be incomplete or stale, or the citations might simply appear correct while not being semantically aligned with the generated facts.

That’s where RAG evaluation frameworks like Open RAG Eval comes in — and where Deloitte’s experience becomes a vivid reminder of a crucial element missing in most AI content pipelines.

An enhancement on peer review

AI can draft faster than humans can review, but rather than try to decrease a writer's ability to produce, we can increase a reviewer's ability to verify.

Open RAG Eval (ORE) is Vectara’s open framework for evaluating factual grounding in RAG outputs. It analyzes generated text against retrieved passages and computes:

Citation Score: Measures the fraction of generated sentences that have a corresponding citation(s). This is critical for catching the failure mode seen in the Deloitte case, where uncited claims were fabricated.
Groundedness Score (AutoNuggetizer): Measures the fraction of claims in the generation that can be validated by the source documents. This uses a technical approach to decompose the text into atomic "nuggets" (claims) and checks the support of each nugget in the retrieved context, ensuring the claims are truly supported.

There are many other metrics available in ORE that folks should feel free to play around with, including our latest Consistency Score. You can learn more about the framework here: Introducing Open RAG Eval.

Conclusion: Scaling the Safeguards for Enterprise-Ready AI

Confident but unsupported statements -- the failures seen in Deloitte's case -- are flagged for review. AI can be an amazing productivity tool, but using it for drafting alone is like mounting a V8 engine on a bicycle: the brakes that have worked so well for so long will no longer be up to the task. For AI to become Enterprise Ready, the complete pipeline, including the safeguards, must be scaled up proportionally.

(For a collection of other failure modes and real world examples, see Vectara's list of awesome-agent-failures)

Want to learn how ORE and Agentic RAG can make your AI outputs verifiable?
→ Explore Vectara ORE Evaluation
→ Talk to our team about Agentic RAG

When AI “Assures” Without Evidence: Lessons from Deloitte’s $290K Hallucination

A $290K Refund and an Uncomfortable Truth

A tangled web of hallucinations

Not Just "AI’s Fault"

An enhancement on peer review

Conclusion: Scaling the Safeguards for Enterprise-Ready AI

Connect with
our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

A $290K Refund and an Uncomfortable Truth

A tangled web of hallucinations

Not Just "AI’s Fault"

An enhancement on peer review

Conclusion: Scaling the Safeguards for Enterprise-Ready AI

Connect with our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

Connect with
our Community!