Your Copilot is Not Enough

Microsoft's recent announcement of 'Mico,' a voice agent for Copilot, has been presented as a nostalgic nod to the old Office 'Clippy.' However, that comparison feels uncomfortably accurate for many users, who remember the '90s paperclip mascot as distracting and often unhelpful

While Copilot presents a far more sophisticated, AI-enabled interface, the core experience is beginning to echo the frustrations of its spiritual predecessor. This new assistant, integrated across the entire Microsoft ecosystem, promises to revolutionize productivity, but users are quickly discovering that this new Copilot is just not good enough: quality of responses is way below their expectations and hallucinations are rampant.

In many ways it reminds me of Windows 95 - while it certainly was a monumental leap forward, bringing graphical interfaces to the masses, I also vividly remember the infamous "blue screen of death," the constant crashes, the inexplicable glitches, and the drivers that simply wouldn't work. It was a functional, groundbreaking, and necessary step, but it was also deeply flawed, unstable, and often maddening to use.

Copilot is exhibiting the same first-generation, mass-market teething problems. Despite its relatively low per-seat pricing, its responses are plagued with hallucinations and inaccuracies that completely undermine its utility. In fact, many corporate users complain that “ChatGPT in my pocket is so much better than the Copilot my employer provides.”

How Copilot Works (or Doesn’t)?

Microsoft Copilot implements a RAG system that integrates three key components: Large Language Models like OpenAI GPT-5, Microsoft 365 apps (like Outlook, Teams, and Word), and some specific business data.

The most critical feature of the M365 version is its ability to "ground" prompts in your organization's own data, just as you might expect from any RAG-based application. However, as many are quick to discover, while the integration with Outlook, Teams or Sharepoint is relatively good, using “connectors” to ground in other data sources leaves a lot to be desired.

Copilot has a standard web UI that is provided to every user, which appears both as a centralized, cross-app chat experience and as deeply integrated sidebars within individual M365 apps. It also provides an API (in beta) if you want to integrate into your own workflow, exposing its capabilities programmatically through the Microsoft Graph API. This allows developers to embed Copilot's grounded intelligence directly into their own line-of-business applications or create entirely new, automated processes.

Sounds good, right? Let’s go ahead a test how well it works

Evaluating Quality: Copilot vs Vectara

I started by creating a corpus in Vectara and used vectara-ingest to crawl the complete set of pages from our own website: https://vectara.com.

Similarly, I used a M365 Copilot connector to crawl the same content so it’s available to use for grounding in the M365 Copilot application. By default, web search is enabled with Copilot, so I had to research the docs quite extensively until I was able to find how to disable it - and even then it took about 24 hours until that change was effective.

But finally, after passing all the hurdles, I was ready to try some queries.

I first asked “Who is Amr Awadallah”?

Vectara’s response was great, well-grounded in the website data and with citations:

Amr Awadallah is the Co-Founder and CEO of Vectara, a company focused on building AI Assistants and Agents using Retrieval Augmented Generation (RAG) technology. He has a background in the big data industry, having been a co-founder of Cloudera, which played a significant role in the adoption of big data solutions across enterprises. At Vectara, he leads efforts to create scalable, secure, and accurate AI solutions that transform real-time data into actionable insights, enhancing customer experiences and business operations [1], [2], [6]

When I tried the same query on Copilot I got this response:

You asked about Amr Awadallah, and I searched across your enterprise's people directory, files, emails, chats, meetings, and transcripts. There are no internal references to Amr Awadallah in your organization's data.

If you're referring to the well-known technologist and entrepreneur—Amr Awadallah is recognized for co-founding Cloudera, a company focused on big data and enterprise data platforms. He previously served as CTO of Cloudera and has been involved in various AI and data infrastructure initiatives. More recently, he founded ZIR AI, a company working on AI-driven solutions.

If you're looking for something specific about Amr—like a presentation, collaboration, or commentary—feel free to clarify and I’ll dig deeper or help you draft something.

This response is not grounded in the data coming from the connector (which crawls the web pages of vectara.com) - instead the model presents information from its internal knowledge, which - even if somewhat true - is clearly a RAG hallucination, and furthermore includes outdated facts (like “ZIR AI”)

Next, I tried the following query: “What is Vectara's Hallucination Model?”

Again, Vectara’s RAG provides a solid response:

Vectara's Hallucination Model, known as the Hallucination Evaluation Model (HEM), is designed to assess the trustworthiness of responses generated by large language models (LLMs) in Retrieval Augmented Generation (RAG) systems. The model provides a factual consistency score, which quantifies how well a generative LLM adheres to the provided context, helping to detect and reduce hallucinations in AI-generated responses. This model is fast, cost-effective, and suitable for production-grade applications, making it a practical solution for enterprises. Vectara has also released an open-source version of this model, allowing enterprises to evaluate and improve the reliability of their RAG systems [6], [9].

In contrast, when I presented the same query to Copilot I got the following:

You asked about Vectara's Hallucination Model, and I found 389 relevant enterprise documents that provide a comprehensive view of its capabilities and evolution.

🧠 What Is Vectara’s Hallucination Model?

Vectara’s approach to hallucination detection and correction centers around a suite of tools and models designed to improve the factual accuracy of AI-generated content, especially in Retrieval-Augmented Generation (RAG) systems.

1. Vectara Hallucination Corrector (VHC): This is the core product that identifies inaccuracies in AI outputs, suggests corrections, and provides guardrails to prevent hallucinations. It’s built to mitigate risks in high-stakes industries like finance, healthcare, and legal

2. Hughes Hallucination Evaluation Model (HHEM): Vectara’s open-source model for evaluating hallucinations in LLMs. It has surpassed 2 million downloads and is widely used to benchmark hallucination rates across models like GPT-4, Claude, and PaLM.

3. Factual Consistency Score (FCS): A calibrated score embedded in Vectara’s platform that helps developers and end-users assess the factual reliability of AI responses. It’s used both internally and as a visual cue in applications.

🧪 Evaluation Tools & Benchmarks

HCMBench: Vectara’s toolkit for testing hallucination correction models with modular pipelines and diverse datasets.

Leaderboards: Vectara maintains public leaderboards comparing hallucination rates across models, helping developers choose the most reliable LLMs.

🧬 Integration with Vectara’s Platform

Mockingbird LLM: Vectara’s proprietary model optimized for RAG workloads, featuring low hallucination rates and embedded FCS. It outperforms GPT-4 in citation precision and structured output quality.

Guardian Agents: Vectara positions its hallucination correction tools as part of a broader strategy to build trustworthy AI agents that operate safely in enterprise environments.

📰 Media Coverage & Industry Impact

The New York Times highlighted Vectara’s hallucination model as a key innovation in addressing the growing concern over AI-generated misinformation.

Vectara’s work has influenced industry standards, especially in regulated sectors where factual accuracy is paramount.

This is a very long and detailed response, and includes information that is well beyond what I asked for (e.g. information about the Mockingbird LLM as well as Guardian agents), which from a RAG perspective, is a failure to answer the user’s question.

I ran a few more queries (like “what is reranker?” or “What is an Agent”), and the overall experience remained disappointing. In some cases, Copilot finds no information, and in others it provides way too much information - it seems as if Copilot just used any retrieved chunk of information in its response, whether it’s relevant to the query or not, which leads to a higher rate of hallucinations and diminished trust by end users.

Conclusion

Copilot is a milestone: widely deployed, inexpensive, and aiming to be ubiquitous. But “ubiquitous” isn’t the same as “reliable.” In our tests, the gap showed up in the unglamorous parts of RAG - low recall retrieval, grounding that is too strict, and not knowing when to say “I don’t know.” And that is only what was visible in our tests - there could be many more hallucinations we just didn’t see yet.

Until those basics are boringly good, breadth creates the illusion of intelligence, while trust rapidly decays and you lose the value of your GenAI initiative.

To achieve success with enterprise Generative AI, teams need an AI Agent that earns trust answer by answer. That means accurate retrieval, precise responses, citations that really back the response, hallucination mitigation, and a graceful no-answer when the corpus is thin.

This is why we built Vectara’s high accuracy retrieval stack, Hallucination Evaluation Model, and Vectara Hallucination Corrector: tools that measure and actively reduce unsupported claims, not just style the output.

What has been your experience with Copilot?

If you’re feeling the same friction - contact us for a demo, or just try Vectara for yourself to see the difference - sign up for a free trial, and experience first hand better quality responses and reduced hallucinations in your GenAI application.

Your Copilot is Not Enough

How Copilot Works (or Doesn’t)?

Evaluating Quality: Copilot vs Vectara

Conclusion

Connect with
our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

How Copilot Works (or Doesn’t)?

Evaluating Quality: Copilot vs Vectara

Conclusion

Related posts

Vectara's Hallucination Corrector

HCMBench: an evaluation toolkit for hallucination correction models

Connect with our Community!

Discord.

Github.

X / Twitter.

LinkedIn.

Discuss.

E-mail.

Connect with
our Community!