
A new framework developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges of developing computer usage agents (CUAs): collecting high-quality training examples at large scale.
framework, dubbed watch and learn (W&L), addresses the problem of training data generation in a way that does not require human annotation and can automatically extract performance from raw video.
Their experiments show that the data generated from W&L can be used to train or fine-tune existing computer usage and foundation models to improve their performance on computer-use tasks. But equally important is that the same approach can be used to create learning in context (ICL) is an example for computer use agents, which enables companies to create CUA for custom internal tasks without the need for expensive training of specialized models.
CUA’s data bottleneck
The Web is rich with video tutorials and screencasts that describe complex workflows for using the applications. These videos are a gold mine that can provide computer access agent With domain knowledge and instructions to complete various tasks through user interface interactions.
However, before they can be used to train CUA agents, these videos need to be transformed into annotated trajectories (i.e., task descriptions, screenshots, and a set of tasks), a process that is extremely expensive and time-consuming if done manually.
Existing approaches to address this data bottleneck rely on annotating these videos through the use of multimodal language models, typically resulting in low accuracy and faulty examples. A different approach uses self-playing agents that autonomously explore the user interface to collect trajectories. However, techniques using this approach typically create simplified examples that are not useful in unpredictable real-world situations.
As the researchers write in their paper, “Overall, these approaches either rely on brittle heuristics, are expensive because they rely on explorations in real environments, or generate low-complexity displays inaccurately predicted by human intent.”
watch and learn
The See and Learn framework attempts to address the challenges of creating CUA performance by rethinking problem formulation.
Rather than directly generating trajectories or relying on complex multi-stage pipelines, researchers formulate the problem as an “inverse mobility objective”: given two consecutive observations, predict the intermediate action that generated the transition.
According to the researchers, this formulation is “easy to learn, avoids hand-crafted approximations and generalizes robustly to all applications.”
The W&L framework can be divided into three major steps: training an inverse dynamics model (IDM), recovering raw videos, and training CUA agents.
In the first phase, researchers used agents to interact with live web pages to create a large collection of 500,000 state changes (two consecutive observations and the action that resulted in the change). They then used this data (along with 132,000 human-annotated infections from an existing open dataset) to train an inverse dynamics model (IDM) that takes two consecutive observations and predicts infection action. Their trained IDM, which is a small transformer model, outperforms off-the-shelf foundation models in predicting transition actions.
The researchers then designed a pipeline that retrieves videos from platforms like YouTube and runs them through IDM to generate high-quality trajectories. IDM continuously takes video frames and determines the actions (scroll, click) that caused changes in the environment, which are then packed into annotated trajectories. Using this method, they generated 53,125 trajectories with high-accuracy action labels.
These examples can be used to train effective computer usage models for specific tasks. But the researchers also found that trajectories extracted through IDM could serve as examples of learning in context to improve the performance of CUA on bespoke tasks at predictable times. For ICL, they use Gemini 2.5 Flash to add additional argument annotations to the observation/action instances in the trajectory, which can be inserted into the CUA agent’s prompt (typically 3-5 instances) during inference.
“This dual role (training and guiding in context) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.
W&L in action
To test the usefulness of W&L, researchers ran a series of experiments with closed and open source models. osworld benchmarkWhich evaluates agents in real desktop and operating system environments across a variety of tasks, including productivity, programming, and design.
For fine-tuning, they used their corpus of 53,000 trajectories to train two open source models: UI-TARS-1.5, a robust, open source vision-language-action model designed specifically for computer use, and QUEEN 2.5-VLAn open-weight multimodal LLM.
For in-context learning tests, they implemented W&L examples on general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI O3, and Cloud Sonnet 4.
W&L resulted in OSWorld improvements in all model categories, including up to 3 points for ICL on general-purpose models and up to 11 points for fine-tuned open-source models.
More importantly, these benefits were achieved without any manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable basis for advancing CUA to real-world deployment,” the researchers write.
This could have significant implications for real-world applications, enabling enterprises to turn their existing videos and conference recordings into training data for CUA. This also makes it easier to generate new training trajectories. All you need to do is record videos of you performing various tasks and have them annotated by IDM. And with frontier models continually improving and becoming cheaper, you can expect to get more from your existing data and progress in the field will continue.

