Context Engineering: Can you trust long context?

Introduction

Modern LLMs support very long context windows (1M for Gemini-2.5 or GPT-4.1, and 200K for Anthropic 4 Sonnet), and this opens up opportunities for including more text in the LLM prompt than was ever possible.

This improves RAG applications, since you can provide larger chunks or more chunks in the generative step, and even more in agentic applications where the context includes instructions, tool information, as well as short and long-term agentic memory.

This triggered recent discussions (see for example, this post by Philip Schmidt) in the AI world about a new term: “context engineering”. In a nutshell, context engineering aims to describe the “art and science” of providing all the right context for an LLM to ensure success when building agents or other generative AI applications.

But it’s not that simple - we do see degraded LLM performance at long sequence lengths.

The Joy of Longer Context Windows

In 2023, with models like GPT-4 or Llama3, the context windows for LLMs were limited to about 8,192 tokens. That was quite small and quickly became a bottleneck.

Fast forward to today, and based on research like YARN and LongROPE, most modern LLMs now support a context length that is between 200K to 1M tokens, and even more coming soon.

Image 1: As LLMs evolved over the last 2 years, their context window size has dramatically increased.

This changes the game for RAG applications, opening the door for using more chunks and larger chunks of retrieved information, providing the model with a more comprehensive understanding of the topic.

This expansion of the context window is even more transformative for the development of agentic workflows, where the prompt is a carefully constructed “workspace” containing a variety of essential elements:

The User's Query: What the user wants to accomplish.
Agent Instructions: The "system prompt" that defines the agent's persona and goals.
Available Tools: Detailed descriptions of the functions the agent can call upon, along with details about their arguments.
Short-Term Memory: The immediate history of the conversation so far.
Long-Term Memory: The agent's repository of learned preferences, past interactions, and accumulated knowledge.

As we can see, the sheer volume of information we can now include in a prompt makes context engineering a central challenge. It's no longer a matter of what we can fit, but what we should include and where we should place it to achieve the best possible outcome.

Challenges with Longer Context Windows

The support for such a long context seems too good to be true.

Unfortunately, it is, since LLMs demonstrate degraded performance with longer sequences.

The most widely documented and intuitive form of long-context performance degradation is the "Lost in the Middle" effect. This phenomenon describes the tendency of LLMs to exhibit strong performance when recalling or reasoning about information located at the very beginning or very end of a long input context, while performance drops significantly for information buried in the middle.

When plotted, this creates a characteristic "U-shaped" performance curve, with high accuracy at the edges of the context and a deep trough in the center, as shown in image 2:

Image 2: Typical U-shaped curve for accuracy of generated responses relative to position of document in the prompt.

This means that when an LLM is asked a question, if the answer lies in the first or last few paragraphs of a long prompt, it is likely to answer correctly. However, if the same critical information is moved to the middle of the prompt, the model's accuracy can plummet sharply.

The so-called “needle in the haystack” challenge demonstrates the difficulty LLMs have when asked to locate and extract a very specific piece of information - a “needle” - from within a vastly larger body of text - the “haystack (long prompt).” While LLMs excel at generating fluent and coherent language, pinpointing low-frequency or deeply buried facts in long contexts often trips them up.

More recently, the AbsenceBench flips the classic “needle-in-a-haystack” retrieval task on its head: rather than asking an LLM to find an unexpected insertion, it asks the model to identify what’s been removed from a document. AbsenceBench demonstrates that even the most capable long-context LLMs remain surprisingly bad at identifying omissions.

Taken all together, we see that long contexts are useful but vulnerable to degraded performance. The LongBench leaderboard clearly demonstrates that - even the best models are only around 50-60% accurate in tasks with long context windows.

Context Distillation

To counteract these effects, you need a well-implemented retrieval system combined with careful context orchestration. Instead of overwhelming the model with a massive, unfiltered context, an intelligent retrieval system acts as a precise filter, pinpointing and extracting only the most relevant information for the task at hand.

This type of “contextual distillation” is only half the battle. The orchestration of retrieved context, and strategically placing the most critical facts where the model is least likely to ignore them, is equally important. By curating a smaller, more potent context, detecting and correcting hallucinations, and arranging the clean and most important facts thoughtfully in the prompt. We guide the LLM to leverage its capabilities effectively and mitigate the risks of long-context vulnerabilities.

Crucially, this grounded approach helps directly address the challenge of hallucination. By forcing the model to base its response on specific, verifiable information provided in the prompt, we constrain its tendency to invent facts or generate plausible-sounding falsehoods.

This act of grounding the LLM in retrieved data is a foundational pillar of building trusted AI, as it makes the model's outputs not only more accurate but also more attributable and transparent.

Conclusion: The Future of Context Engineering

Context Engineering is the new name for prompt engineering. Success in RAG and AI agents is no longer about a single or simple prompt, it’s about a complex sequence of inputs to the LLM.

The existing (and often ignored) vulnerability of LLMs to long context windows means we must think through and carefully select just the right context, instead of “stuffing” all the data we have into the LLM prompt.

Contact our team to learn more about Vectara’s approach to context engineering and agentic AI.