
a new framework from Stanford University And sambanova Addresses a key challenge in building robust AI agents: context engineering. called Agentic Context Engineering (ACE), the framework automatically populates and modifies the context window of large language model (LLM) applications by treating them as “evolved playbooks” that create and refine strategies as the agent gains experience in its environment.
ACE is designed to overcome key limitations of other context-engineering frameworks, by preventing the model’s context from deteriorating as it accumulates more information. Experiments show that the ACE system works by optimizing signals and managing the agent’s memory, outperforming other methods as well as being significantly more efficient.
Reference Engineering Challenge
Advanced AI applications using LLM rely heavily on "context adaptation," Or context engineering, to guide their behavior. Instead of the expensive process of retraining or fine-tuning models, developers use LLM ability to learn in context Directing its behavior by modifying input signals with specific instructions, logic steps, or domain-specific knowledge. This additional information is typically obtained as the agent interacts with its environment and gathers new data and experience. The main goal of context engineering is to organize this new information in a way that improves model performance and prevents it from becoming confused. This approach is becoming a central paradigm for building capable, scalable, and self-improving AI systems.
Reference engineering has many advantages for enterprise applications. Contexts are interpretable for both users and developers, can be updated with new knowledge at runtime, and can be shared across different models. Context engineering also benefits from ongoing hardware and software advances, such as growing context windows Efficient inference techniques like LLM and early and context caching.
There are various automated context-engineering techniques, but most of them suffer from two major limitations. The first is a “brevity bias”, where rapid optimization methods favor brief, general instructions over comprehensive, detailed instructions. This may degrade performance in complex domains.
The second, more serious issue is "Reference collapse." When an LLM is tasked with rewriting his entire stored context over and over again, he may suffer from a form of digital amnesia.
“What we call ‘context collapse’ occurs when an AI attempts to rewrite or compress everything it has learned into a new version of its prompt or memory,” the researchers said in written comments to VentureBeat. “Over time, that rewriting process erases important details – like overwriting a document so many times that key notes disappear. In customer-facing systems, this could mean a support agent suddenly losing awareness of previous interactions… leading to erratic or inconsistent behavior.”
The researchers argue that “References should serve not as brief summaries, but as comprehensive, evolving playbooks – detailed, inclusive, and rich with domain insights.” This approach relies on the strengths of modern LLMs, which can effectively tease out relevance from long and detailed references.
How Agentic Context Engineering (ACE) works
ACE is a framework for pervasive context adaptation that is designed for both offline and system prompt customizationand online scenarios, such as real-time memory updates for agents. Instead of compressing information, ACE treats context like a dynamic playbook that collects and organizes strategies over time.
The framework divides labor into three specialized roles: a generator, a reflector, and a curator. According to the paper, this modular design is inspired by “how humans learn – experimenting, reflecting and consolidating – while avoiding the bottleneck of overloading a single model with all responsibilities”.
The workflow starts with a generator, which creates logic paths for input signals, highlighting both effective strategies and common mistakes. Reflector then analyzes these paths to extract the main text. Finally, the curator synthesizes these lessons into compact updates and merges them into the existing playbook.
To prevent context collapse and concreteness bias, ACE incorporates two key design principles. First, it uses incremental updates. The reference is represented as a collection of structured, itemized bullets rather than a single block of text. This allows ACE to make detailed changes and retrieve the most relevant information without rewriting the entire context.
Second, ACE uses a “grow-and-refine” mechanism. As new experience is gathered, new bullets are added to the playbook and existing bullets are updated. The de-duplication phase regularly removes redundant entries, ensuring that the reference remains comprehensive yet relevant and concise over time.
ACE in action
Researchers have evaluated ACEs on two types of tasks that benefit from changing context: agent benchmarks that require multi-turn reasoning and tool use, and domain-specific financial analysis benchmarks that demand specialized knowledge. For high-risk industries like finance, the benefits go beyond pure performance. As the researchers said, the framework is “far more transparent: A compliance officer can literally read what the AI has learned, as it is stored in human-readable text, rather than hidden in billions of parameters.”
Results showed that ACE consistently outperformed a strong baseline GEPA and learning in the classic context, achieving an average performance gain of 10.6% on agent tasks in both offline and online settings and 8.6% on domain-specific benchmarks.
Critically, ACE can create effective context by analyzing feedback from its actions and environment rather than requiring manually labeled data. Researchers note that this ability is a "Key components for self-improvement LLMs and agents." on the public appworld The benchmark, designed to evaluate agentic systems, uses ACEs with an agent small open-source model (DeepSeek-V3.1) matches the top ranked performance, GPT-4.1-powered agents Surpassed it on average and on the more difficult test set.
The takeaway is important for businesses. “This means companies will no longer have to rely on largely proprietary models to remain competitive,” the research team said. “They can deploy local models, protect sensitive data, and still get top-tier results by continuously refining the context rather than retraining the weights.”
Beyond accuracy, the ACE proved to be highly efficient. It adopts new functions with an average of 86.9% lower latency than existing methods and requires fewer steps and tokens. The researchers explain that this efficiency demonstrates that “scalable self-correction can be achieved with both high accuracy and low overhead.”
For enterprises concerned about estimation costs, the researchers point out that the longer references produced by ACEs do not translate into proportionately higher costs. Modern serving infrastructures are increasingly being optimized for long-context workloads with techniques such as KV cache reuse, compression, and offloading, which increases the cost of handling broad context.
Ultimately, ACE points to a future where AI systems are dynamic and constantly improving. "Today, only AI engineers can update models, but context engineering opens the door for domain experts—lawyers, analysts, doctors—to directly shape what the AI knows by editing its contextual playbooks." The researchers said. It also makes governance more practical. "Selective unlearning becomes more streamlined: if a piece of information is outdated or legally sensitive, it can be easily removed or replaced in context, without having to retrain the model.

