Want smart insight into your inbox? Enterprise AI, only what matters to data and security leaders, sign up for our weekly newspapers. Subscribe now
A new structure from researchers The University of Hong Kong (HKU) and collaborative institutions provide an open source foundation to create a strong AI agent operating computer. Framework called OpenkuaTools, data and dishes have been included to increase the growth of computer-use agents (CUAS).
Trained models using this framework perform strongly on the CUA benchmark, improve the existing open source models better and compete closely with major AI labs such as Openai and Ethropot.
Challenge of creating computer-use agents
Computer-use agents are designed to complete the tasks on the computer, from navigating websites to operating complex software. They can also help to automatically automatically workflows in the enterprise. However, the most capable CUA systems are ownership, important details about their training data, architecture and development processes are private.
Researchers said in researchers, “lack of transparency limits technological progress and increases safety concerns, the research community actually requires CAAA Framework to study its abilities, limitations and risks.” Their paper,
AI scaling hits its boundaries
Power caps, rising token costs, and entrance delays are re -shaping Enterprise AI. Join our exclusive salons to learn about top teams:
- Transform energy into a strategic profit
- Architecting efficient estimates for real thrruput benefits
- Unlocking competitive ROI with sustainable AI system
Secure your location to stay ahead,
At the same time, those who face open source efforts face their own sets. There is no scalable infrastructure to collect the diverse, large -scale data required to train these agents. The current open source dataset has limited data for graphical user interface (GUI), and many research projects provide insufficient details about their methods, making it difficult for others to repeat their work.
According to paper, “These boundaries collectively obstruct advances in general-objective CUAS and restrict their scalability, generality and a meaningful exploration of possible teaching approaches.”
Introduction opencua

Opencua is an open source structure designed to solve these challenges by scaling both data collections and models. At its core is an agent tool to record human performances of computer functions on various operating systems.
The tool streamlines the data collection by capturing screen videos, mouse and keyboard input and underlying accessibility trees, which provides structured information about the on-screen elements, by running the background on the individual computers of an anonym. This raw data is then processed in “state-action tracts”, which combines the screenshot of the computer (state) with the user’s related action (one click, key press, etc.). Anotater can then review, edit and present these demonstrations.

Using this tool, the researchers collected Agentnet datasets, including more than 22,600 functions in Windows, Macos and Ubuntu, over 200 applications and websites. “It authentically captures the complexity of human behavior and environmental dynamics from the individual computing environment of dataset users,” paper note.
Assuming that screen-ricading tools increase important data for enterprises increase privacy concerns, researchers have designed agent tools with protection. In HKU, the co-writer of paper and PhD student Xinyuan Wang explained that he has implemented a multi-layer secrecy protection framework. “First, the anotheters can see the data generated by themselves completely … before deciding whether it is to be submitted,” he told the venturebeat. The data then undergoes manual verification for privacy issues and scanning by a large model to detect any remaining sensitive material before release. “This layered process ensures the strength of enterprise-grade for sensitive customer or financial data that handles financial data,” Wang said.
To accelerate the evaluation, the team also curated the agentbench, an offline benchmark that provides several correct actions for each stage, which provides a more efficient way to measure the performance of the agent.
A new recipe for training agents
Opencua framework data and training introduces a novel pipeline for processing computer-use agents. The first step trains raw human performances in a clean state-action pairs that are suitable for training vision-language models (VLM). However, the researchers found that the buses get limited performance benefits with large amounts of data in the training model on these pairs.

The major insight was to increase these trajectory with chain-off-three (COT) logic. This process produces a wide “internal monologue” for each action, including planning, memory and reflection. This structured argument is arranged in three levels: a high-level observation of the screen, reflective ideas that analyze the situation and plan the next stages, and finally, brief, executable action. This approach helps the agent develop a deeper understanding of tasks.
Researchers wrote, “We generally find the logic of natural language important for the Computer-Use Foundation Model, which helps to internal Cuas cognitive abilities.”
This data synthesis pipeline is a common structure that can be adapted by companies to train agents on their own unique internal devices. According to Wang, an enterprise can record the performance of its ownership workflows and use the same “reflector” and “generator” pipeline to create the necessary training data. “This allows them to be a high -performing agent to suit their internal devices, which manually, without the need for a mark of logic,” they explained.
Keep Opencua for Testing
Researchers implemented the OpenCUA framework, in which to train a range of open source VLM including QWEN and Kimi-VL variants, with parameters size from 3 billion to 32 billion. The model was evaluated on a suit of online and offline benchmarks that test their ability to do tasks and understand the GUI.
32 billion-parameter model, OpenCUA-32B, established a new state-of-the-art success rate among the open source model on the Osworld-verified benchmark. It also surpassed Openai’s GPT-4O-based CUA and close the performance difference with anthropic’s major proprietary model.

For enterprise developers and product leaders, research provides many major conclusions. The opencua method is roughly applied, improves performance on models with various architecture (both dense and mix-experts) and sizes. Trained agents also show strong generalization, perform well in a variety of functions and operating systems.
According to Wang, the framework is particularly suited to automatic repetition, labor-intensive enterprise workflow. “For example, in Agentnet dataset, we already capture some of the performances of launching the EC2 instance on Amazon AWS and configuring the anotation parameters on MTURK,” he told Venturebeat. “These functions include several sequential stages, but follow the repeated patterns.”
However, Wang said that there is a need to solve important challenges around safety and reliability to reduce the difference for living. “The biggest challenge in real deployment is safety and reliability: Agent should avoid mistakes that can unknowingly change system settings or trigger harmful side effects beyond the desired task,” he said.
Researchers have released Code, DatasetAnd Weighing For their model.
As open source agents manufactured on framework such as opensua become more capable, they can basically develop a relationship between knowledge workers and their computers. Wang imagines a future where proficiency in complex software becomes less important than the AI agent’s ability to clear the goals clearly.
He described two primary methods of work: “Offline automation, where the agent takes advantage of its comprehensive software knowledge, to pursue an end-to-end,” and “online collaboration, where the agent reacts in real-time and works shoulder to shoulder with humans like a colleague.” Originally, humans will provide strategic “what”, while rapidly sophisticated AI agents handle operations “How How.”

