HHEM | Flash Update: Phi 2
See how Phi 2 compares to Mixtral 8x7B and Titan Express in the Hughes Hallucination Evaluation Model (HHEM)
2-minute read time
Last week, we added Microsoft Phi 2 to our hallucination leaderboard. This leaderboard quantifies the factual consistency of LLM-generated summaries from sets of underlying facts. It’s designed to reflect how LLMs are used in the retrieval-augmented generation (RAG) architecture. As our late colleague Simon Hughes explained in the context of honoring GDPR wipeout requests in generative AI systems, RAG restricts the role of the LLM to interpreting the results of the retrieval step, thereby significantly narrowing the opportunity to hallucinate.
The table below, which reproduces the leaderboard as of January 16, 2024, shows Phi 2 hallucinating in 8.5% of summaries, the same as Cohere and Claude 2, and slightly better than Mixtral 8x7B and Titan Express.

The importance of these results stems from two facts. First, weighing in at “only” 2.7B parameters, our results confirm Microsoft’s observations that Phi 2’s novel training strategy allows it to outperform much larger models. In turn, this eases adoption and deployment of LLMs: not only is hardware quickly growing more powerful and cheaper, but the LLMs themselves are steadily increasing bang-for-the-parameter. Second, on January 5th Microsoft moved the model to the MIT license making it much easier to incorporate into commercial software and enterprise systems.
Taken together, these moves significantly increase pressure on proprietary LLM vendors to accelerate innovation and reduce prices, with consumers being the ultimate winners.






