
researchers of University of Illinois Urbana-Champaign And Google Cloud AI Research have developed a framework that enables large language model (LLM) agents to organize their experiences into a memory bank, helping them get better at complex tasks over time.
framework, is called reasoningbankUncovers “general reasoning strategies” from an agent’s successful and unsuccessful attempts to solve problems. The agent uses this memory during inference to avoid repeating past mistakes and make better decisions when faced with new problems. Researchers show that when it is combined with test-time scaling techniqueWhere an agent makes multiple attempts on a problem, ReasoningBank LLM significantly improves the agents’ performance and efficiency.
Their findings show that ReasoningBank consistently outperforms classic memory mechanisms in web browsing and software engineering benchmarks, providing a practical path toward building more adaptive and reliable AI agents for enterprise applications.
LLM Agent Memory Challenge
Since LLM agents are deployed in long-running applications, they face a continuous stream of tasks. One of the major limitations of current LLM agents is their failure to learn from this accumulated experience. By approaching each task alone, they inevitably repeat past mistakes, discard valuable insights from related problems, and fail to develop skills that will make them more competent over time.
The solution to this limitation is to give agents some form of memory. Previous efforts to give agents memory have focused on storing past interactions for reuse by organizing information in a variety of forms, from plain text to structured graphs. However, these approaches often fall short. Many use raw interaction logs or store only successful action instances. This means that they cannot distill high-level, transferable reasoning patterns and, importantly, they do not extract and use valuable information from the agent’s failures. As the researchers write in their paper, “Existing memory designs are often limited to passive record-keeping rather than providing actionable, generalizable guidance for future decisions.”
How does ReasoningBank work?
ReasoningBank is a memory framework designed to overcome these limitations. Its central idea is to distribute useful strategies and reasoning signals from past experiences into structured memory objects that can be stored and reused.
According to Jun Yan, a Google research scientist and co-author of the paper, this marks a fundamental change in the way agents operate. "Traditional agents operate statically—each task is processed in isolation," Yan explained. "ReasoningBank changes this by turning every work experience (successful or unsuccessful) into a structured, reusable reasoning memory. As a result, the agent doesn’t start from scratch with each customer; It remembers and adopts proven strategies from previous similar cases."
The framework processes both successful and unsuccessful experiences and turns them into a collection of useful strategies and preventive lessons. Agent measures success and failure LLM-A-JUDGE SCHEMES To eliminate the need for human labeling.
The vehicle provides a practical example of this process in action. An agent tasked with finding Sony headphones may fail because its broad search query returns more than 4,000 irrelevant products. "ReasoningBank would first try to find out why this approach failed," Yan said. "It will then develop strategies such as ‘Optimize search queries’ and ‘Limit products with category filtering’. Those strategies will be extremely useful to successfully complete similar tasks in the future."
The process operates in a closed loop. When an agent is faced with a new task, it uses embedding-based search to retrieve relevant memories from the reasoning bank to guide its actions. These memories are fed into the agent’s system prompts, which provide context for its decision making. Once the task is completed, the framework creates new memory items to extract insights from successes and failures. This new knowledge is then analyzed, distilled and merged into the ReasoningBank, allowing the agent to continuously grow and improve its abilities.
Supercharging memory with scaling
Researchers found a powerful synergy between memory and test-time scalingClassic test-time scaling involves generating multiple independent answers to the same question, but researchers argue that this “vanilla form is sub-optimal because it does not take advantage of the inherent contradictory signal that arises from redundant exploration on the same problem.”
To address this, they propose Memory-Aware Test-Time Scaling (MaTTS), which integrates scaling with ReasoningBank. MaTTS comes in two forms. In “parallel scaling”, the system generates multiple trajectories for the same query, then compares and contrasts them to identify consistent logic patterns. In sequential scaling, the agent iteratively refines its reasoning in a single attempt, with intermediate notes and improvements also serving as valuable memory cues.
This creates a virtuous cycle: the existing memory in the ReasoningBank leads the agent to more promising solutions, while the diverse experiences generated through scaling enable the agent to form higher quality memories to store in the ReasoningBank.
“This positive feedback loop positions memory-driven experience scaling as a new scaling dimension for agents,” the researchers wrote.
ReasoningBank in action
Researchers tested their framework WebArena (web browsing) and SWE-Bench-Verified (Software Engineering) benchmarks, using models such as Google’s Gemini 2.5 Pro and Anthropic’s Cloud 3.7 Sonet. They compared ReasoningBank to a baseline including memory-free agents and agents using trajectory-based or workflow-based memory frameworks.
The results show that ReasoningBank consistently outperforms these baselines across all datasets and the LLM backbone. On WebArena, it improved the overall success rate by 8.3 percentage points compared to the memory-free agent. It generalized better to more difficult, cross-domain tasks while reducing the number of interaction steps required to complete the tasks. When combined with MaTTS, both parallel and sequential scaling further boosted performance, consistently outperforming standard test-time scaling.
This efficiency gain has a direct impact on operating costs. Yan points to a case where a memory-free agent took eight trial-and-error steps to find the right product filter on a website. "Those trial and error costs can be avoided by leveraging relevant insights from ReasoningBank," He noted. "In this case, we save almost double the operating costs," Which also improves the user experience by resolving issues faster.
For enterprises, ReasoningBank can help develop cost-effective agents that can learn from experience and adapt over time in complex workflows and areas such as software development, customer support, and data analysis. As the paper concludes, “Our findings suggest a practical path toward building adaptive and lifelong learning agents.”
Yan confirmed that their findings indeed point to the future of creative intelligence. For example, a coding agent can learn different skills from different tasks, such as API integration and database management. "Over time, these modular skills… become building blocks that the agent can flexibly recombine to solve more complex tasks," He suggests a future where agents can autonomously pool their knowledge to manage entire workflows with minimal human oversight.

