
should have been 2025 year of "ai agent," According to Nvidia CEO Jensen Huang and other AI industry personnel. And this, in many ways, is at odds with many major AI model providers such as OpenAI, Google and even Chinese rivals such as Alibaba releasing fine-tuned AI models or applications designed to focus on a narrow set of tasks such as web searching and report writing.
But a major hurdle remains in the future of highly performing, reliable, AI agents: keeping them on task as the task expands to multiple stages. Third-party benchmark testing Show that even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the more time (more than hours) they spend on it.
A New educational framework called EAGLET Proposes a practical and efficient method to improve long-term task performance in LLM-based agents – without the need for manual data labeling or re-training.
Developed by researchers at Tsinghua University, Peking University, Deeplong AI and the University of Illinois Urbana-Champaign, EAGLET OFFERS A "global planner" Which can be integrated into existing agent workflows to reduce hallucinations and improve work efficiency.
EAGLET is a streamlined language model that interprets task instructions – usually provided as prompts by the user or the agent’s operating environment – and produces a high-level plan for the agent (driven by its own LLM). It does not interfere during execution, but its advance guidance helps in reducing planning errors and improving the task completion rate.
Solving the planning problem in long-horizon agents
Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, planning fallacies, and inefficient trajectories.
EAGLET deals with this limitation by introducing a global planning module Who works with the executing agent.
Instead of mixing planning and task creation into a single model, EAGLET separates them, enabling more coherent, task-level strategies.
Two-stage training pipeline with no human comments
EAGLET’s planner is trained using a two-step process that does not require any human-written plans or annotations.
The first step involves building synthetic schemes with high-powered LLMs such as GPT-5 and DeepSeq-v3.1-Think.
These plans are then filtered using a new strategy called Homologous Consensus Filtering, which retains only those that improve task performance for both expert and novice executioner agents.
In the second stage, a rule-based reinforcement learning process further refines the planner, using a custom-designed reward function to assess how much each plan helps multiple agents succeed.
Introduction of Performance Efficiency Gain Award (ECGR)
One of the key innovations of EAGLET is the Executor Capability Gain Reward (ECGR).
This reward measures the value of the generated plan by examining whether it helps both high- and low-ability agents complete tasks more successfully and with fewer steps.
It also includes a decay factor to favor shorter, more efficient work trajectories. This approach avoids overly rewarding schemes that are only useful to already competent agents and promotes more generalizable scheme guidance.
Compatible with existing agents and models
The EAGLET planner is designed to be modular and "plug and play," This means it can be inserted into existing agent pipelines without requiring retraining of executors.
In the evaluation, the planner boosted performance across a variety of foundational models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.
It also proved effective regardless of what motivates the strategy, working well with standard React-style signals as well as approaches like Reflexion.
State-of-the-art performance across all benchmarks
EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based laboratory environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home setting; and Webshop, which evaluates goal-driven behavior in a realistic online shopping interface.
Across all three, execution agents equipped with EAGLET outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.
In experiments with the open source llama-3.1-8b-instruct model, EAGLET increased the average performance from 39.5 to 59.4, an increase of +19.9 points across tasks.
On ScienceWorld’s unseen scenarios, this increased performance from 42.2 to 61.6.
In the scenarios observed by ALFWorld, EAGLET’s results improved from 22.9 to 54.3, a performance increase of more than 2.3×.
Even stronger gains were seen with more capable models.
For example, the average score with GPT-4.1 EAGLET improved from 75.5 to 82.2, and GPT-5 increased from 84.5 to 88.1, despite already being a strong performer.
In some benchmarks, the performance gain was as high as +11.8 points, such as when combining EAGLET with the ETO executor method on ALFWorld invisible tasks.
Compared to other planning baselines such as MPO, EAGLET consistently delivered high task completion rates. For example, on ALFWorld’s unseen tasks with GPT-4.1, MPO scored 79.1 points, while EAGLET scored 83.6 points – a +4.5 point advantage.
Additionally, the paper reports that agents using EAGLET complete the task in fewer steps on average. With GPT-4.1 as the executor, the average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5, it dropped from 11.4 to 9.4, supporting the claim of better execution efficiency.
Efficiency gains in training and execution
Compared to RL-based methods like GIGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with about one eighth of the training effort.
This efficiency also applies to execution: agents using EAGLET typically require fewer steps to complete tasks. This reduces estimating time and cost calculations in production scenarios.
No public code—yet
As of the version submitted to arXiv, the authors have not released an open-source implementation of EAGLET. It is unclear when or under what license the code will be released, or how it will be maintained, which may limit the near-term usefulness of the framework for enterprise deployments.
VentureBeat has reached out to the authors to clarify these points and will update this piece when we respond.
Enterprise deployment questions remain
While Planner is described as plug-and-play, it is unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks like Langchain or Autogen, or whether it requires a custom stack to support planning-execution separation.
Similarly, the training setup takes advantage of multiple execution agents, which may be difficult to replicate in an enterprise environment with limited model access. VentureBeat asked researchers whether the homegrown consensus filtering method could be adapted for teams that have only one executor model or access to limited compute resources.
The authors of EAGLET report success across model types and sizes, but it is not yet known what the minimum viable model scale is for practical deployment. For example, can enterprise teams effectively use Planner with a sub-10b parameter open model in latency-sensitive environments? Additionally, the framework may provide industry-specific value in domains such as customer support or IT automation, but it remains to be seen how easily the planner can be fine-tuned or adapted for such verticals.
Real time vs pre-made planning
Another open question is how EAGLET is best deployed in practice. Should the planner work in real time with executors within a loop, or is it better to use it offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat posed this question to writers and will report on any insights that emerge.
Strategic Tradeoffs for Enterprise Teams
For technology leaders in medium to large enterprises, EAGLET LLM offers a compelling proof of concept for improving the reliability and efficiency of agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Enterprises must weigh the potential gains in performance and efficiency against the cost of reproducing or approximating the training process in-house.
Potential use cases in enterprise settings
For enterprises developing agentic AI systems – especially in environments requiring phased planning, such as IT automation, customer support, or online interactions – EAGLET provides a template for how to incorporate planning without retraining. Its efficient training methodology as well as the ability to direct both open- and closed-source models may make it an attractive starting point for teams looking to improve agent performance with minimal overhead.

