Want smart insight into your inbox? Enterprise AI, only what matters to data and security leaders, sign up for our weekly newspapers. Subscribe now
Researcher from University of California, Berkeley, Stanford University And Databric A new AI adaptation method is introduced called Gapa It performs better than traditional reinforcement learning (RL) techniques to adopt large language models (LLM) for important tasks.
The GEPA removes the popular pattern of learning through thousands of tests-and-trunk efforts directed by the simple numeric score. Instead, it uses its own language understanding of an LLM to reflect its performance, diagnose errors and develop its instructions. In addition to being more accurate than installed techniques, GEPA is much more efficient, with 35 times lower tests achieve better results.
For the businesses of complex AI agents and workflow construction, it directly translates into rapid growth cycles, fairly low computational costs, and more performing, reliable applications.
High cost of adaptation of modern AI system
Modern enterprise AI app is rarely the same call for LLM. They are often “compound AI systems”, complex workflows that give a series to perform sophisticated tasks, including multiple LLM modules, external equipment such as database or code interpreters, and custom logic, including multi-phase research and data analysis.
AI scaling hits its boundaries
Power caps, rising token costs, and entrance delays are re -shaping Enterprise AI. Join our exclusive salons to learn about top teams:
- Transform energy into a strategic profit
- Architecting efficient estimates for real thrruput benefits
- Unlocking competitive ROI with sustainable AI system
Secure your location to stay ahead,
A popular way to customize these systems is through methods of reinforcementSuch as group relative policy adaptation (GRPO), a technique employed in popular logic models, which includes Deepsek-R1. This method considers the system as a black box; It runs a task, receives a simple success metric (like a “scaller reward”, like a score of 7/10), and uses this reaction to gradually to mush up the parameters of the model in the right direction.
The major defect of RL is its sample disability. In order to effectively learn from these rare numerical scores, RL methods often require thousands of, or even thousands of, trial runs, known as “rollouts”. For any real -world enterprise application that includes expensive tool calls (eg, API query, code compilation) or use powerful proprietary models, the process is prohibited and expensive.
The complexity is a major obstacle for many companies, an Aggarwal as Laxman, co-writer of paper and doctoral student at UC Berkeley,, a major obstacle for many companies. Aggarwal said, “For many teams, RL is not practical due to its cost and complexity-and so far they will often be a hand-hint engineering.” He said that GEPA is designed for teams that need to adapt to the top-level model that may not be cured often, allowing them to improve performance without management of the custom GPU clusters.
Researchers frame this challenge as follows: “How can we extract the maximum learning signal from every expensive rollout to enable the effective adaptation of complex, modular AI systems in low-detta or budget-free settings?”
An optimizer who learns with language

GEPA (Genetic-Pareto) is a quick optimizer that deal with this challenge, changing the rare awards with a reaction of natural language. This takes advantage of the fact that the entire execution of the AI system (including its logic stages, tool calls and even error messages) can be serials in the text that an LLM can read and understand. The functioning of GEPA is built on three main pillars.
The first is “Genetic Prompt Evolution”, where Gepa treats a population of signals like a gene pool. This recurrence indicates the “mutated” new, potentially better version. This mutation is an intelligent process operated by the second column: “reflection with a reaction to the natural language.” After some rollouts, the GEPA provides an LLM with complete execution trace (what the system tried to do) and results (which is right or wrong). The LLM then “reflects” this reaction in the natural language to diagnose the problem and write a better, more detailed signal. For example, instead of looking at a low score on code generation work, it can analyze a compiler error and eliminate a quick need to specify a special library version.
The third column is the “Pareto-based selection”, which ensures smart exploration. Instead of focusing only a single best-performing prompt, which can fall into a subptimal solution (a “local optimal”), maintains a diverse roster of GEPA “specialist”. It tracks which performs the best on different individual examples, making a list of top candidates. From sampling from this diverse set of winning strategies, Gepa ensures that it examines more solutions and is more likely to discover a sign that normalizes well in a wide range of inputs.

The effectiveness of this entire process rests on the fact that researchers say “feedback engineering”. Aggarwal explains that the key is to bring rich, text details to the surface that already produce the system but often renounce. “Traditional pipelines often reduce this description for a single numeric reward, given why special results are,” he said. “The main guidance of the GEPA is to structure the reaction that not only gives the surfaces, but also the intermediate trajectory and errors in the plain text – the same evidence that will use to diagnose a human system behavior.”
For example, for a document recovering system, it means which documents were recovered correctly and which were left instead of calculating only the final score.
Gepa in action
Researchers evaluated GEPA in four diverse works, including multi-hop questions Answing (Hotpotqa) and privacy-protection query (PUPA). He used both Open-SUS (Qwen3 8B) and ownership (GPT-4.1 Mini) models, comparing GEPA against RL-based GRPO and state-of-the-art Prompt Optimizer Miprov2.
During all tasks, GEPA improved GRPO to a great extent, while 35 times less rollouts obtained up to 19% higher score. Aggarwal provided a concrete example of this efficiency gain: “We used Gepa to adapt a QA system in ~ 3 hours versus 24 hours of GRPO to optimize a QA system – 8x deficiency in development time, while also achieved 20% more performance,” he explained. “RL-based adaptation of the same landscape in our tests is almost cost about $ 300 in GPU time, while the cost of Gepa is less than $ 20 for better results-15x saving in our experiments.”

Beyond the raw performance, researchers found that the GEPA-customized systems are more reliable when faced with new, unseen data. It is measured by “generalization gap” (difference between training data and performance on final test data). Aggarwal envisages that this is because the GEPA learns from a rich reaction. He said, “The small normalization difference of Gepa may stems from the use of a natural-language response on each result-what did the work, and why-instead of fully relying on a scalar reward, instead of fully relying,” he said. “This can encourage the system to develop developed instructions and strategies in the wide understanding of success, rather than specific learning patterns for training data.” For enterprises, this better reliability means less brittle, more adaptive AI application in customer-honor roles.
A major practical advantage is that the GEPA’s instruction-based indications are 9.2 times lower than the signals produced by optimizer like Miprov2, including many-shot examples. Low signals reduce delay and reduce costs for API-based models. This makes the final application cheaper to run faster and in production.
The paper also presents the promising results for using Gepa to use as a “inference-time” search strategy, converting AI into a recurring problem solution from single-north generator. Aggarwal described a landscape where Gepa can be integrated into the company’s CI/CD pipeline. When the new code is committed, the GEPA can automatically generate and refine several customized versions, test them to perform, and open a bridge request with the best performing version to review the engineers. Aggarwal said, “This turns adaptation into a constant, automatic process-the solution generated by the experts that often matches or cross the hand-tuning,” Aggarwal said. In its experiments on the CUDA code generation, this approach promoted performance at an expert level on 20% tasks, compared to 0% for single-shot effort from GPT-4o.
Paper authors believe that Gepa AI is a fundamental step towards a new paradigm of development. But beyond creating more human-like AI, it may have the most immediate effects to build a high-performance system.
“We hope that the GEPA AI will enable a positive change in the system building-to get adaptation of such systems acceptable by an end-user, who often have relevant domain expertise for work, but not necessarily time and desire to learn complex RL nuances,” Agraval said. “This gives direct power to stakeholders with accurate affected domain knowledge.”