Close Menu
Pineapples Update –Pineapples Update –

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    The most durable USB-C cable I’ve tested so far is only $11 this weekend (and I’ll be buying several)

    November 30, 2025

    Finally, an Android tablet that I wouldn’t mind keeping my iPad Pro for (especially at this price)

    November 30, 2025

    How much RAM will your PC really need in 2025? A Windows and Mac expert’s view

    November 30, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Pineapples Update –Pineapples Update –
    • Home
    • Gaming
    • Gadgets
    • Startups
    • Security
    • How-To
    • AI/ML
    • Apps
    • Web3
    Pineapples Update –Pineapples Update –
    Home»AI/ML»Google’s ‘Watch and Learn’ framework removes the data barrier for training computer-using agents
    AI/ML

    Google’s ‘Watch and Learn’ framework removes the data barrier for training computer-using agents

    PineapplesUpdateBy PineapplesUpdateOctober 31, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Google’s ‘Watch and Learn’ framework removes the data barrier for training computer-using agents
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Google’s ‘Watch and Learn’ framework removes the data barrier for training computer-using agents

    A new framework developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges of developing computer usage agents (CUAs): collecting high-quality training examples at large scale.

    framework, dubbed watch and learn (W&L), addresses the problem of training data generation in a way that does not require human annotation and can automatically extract performance from raw video.

    Their experiments show that the data generated from W&L can be used to train or fine-tune existing computer usage and foundation models to improve their performance on computer-use tasks. But equally important is that the same approach can be used to create learning in context (ICL) is an example for computer use agents, which enables companies to create CUA for custom internal tasks without the need for expensive training of specialized models.

    CUA’s data bottleneck

    The Web is rich with video tutorials and screencasts that describe complex workflows for using the applications. These videos are a gold mine that can provide computer access agent With domain knowledge and instructions to complete various tasks through user interface interactions.

    However, before they can be used to train CUA agents, these videos need to be transformed into annotated trajectories (i.e., task descriptions, screenshots, and a set of tasks), a process that is extremely expensive and time-consuming if done manually.

    Existing approaches to address this data bottleneck rely on annotating these videos through the use of multimodal language models, typically resulting in low accuracy and faulty examples. A different approach uses self-playing agents that autonomously explore the user interface to collect trajectories. However, techniques using this approach typically create simplified examples that are not useful in unpredictable real-world situations.

    As the researchers write in their paper, “Overall, these approaches either rely on brittle heuristics, are expensive because they rely on explorations in real environments, or generate low-complexity displays inaccurately predicted by human intent.”

    watch and learn

    The See and Learn framework attempts to address the challenges of creating CUA performance by rethinking problem formulation.

    Rather than directly generating trajectories or relying on complex multi-stage pipelines, researchers formulate the problem as an “inverse mobility objective”: given two consecutive observations, predict the intermediate action that generated the transition.

    According to the researchers, this formulation is “easy to learn, avoids hand-crafted approximations and generalizes robustly to all applications.”

    The W&L framework can be divided into three major steps: training an inverse dynamics model (IDM), recovering raw videos, and training CUA agents.

    In the first phase, researchers used agents to interact with live web pages to create a large collection of 500,000 state changes (two consecutive observations and the action that resulted in the change). They then used this data (along with 132,000 human-annotated infections from an existing open dataset) to train an inverse dynamics model (IDM) that takes two consecutive observations and predicts infection action. Their trained IDM, which is a small transformer model, outperforms off-the-shelf foundation models in predicting transition actions.

    The researchers then designed a pipeline that retrieves videos from platforms like YouTube and runs them through IDM to generate high-quality trajectories. IDM continuously takes video frames and determines the actions (scroll, click) that caused changes in the environment, which are then packed into annotated trajectories. Using this method, they generated 53,125 trajectories with high-accuracy action labels.

    These examples can be used to train effective computer usage models for specific tasks. But the researchers also found that trajectories extracted through IDM could serve as examples of learning in context to improve the performance of CUA on bespoke tasks at predictable times. For ICL, they use Gemini 2.5 Flash to add additional argument annotations to the observation/action instances in the trajectory, which can be inserted into the CUA agent’s prompt (typically 3-5 instances) during inference.

    “This dual role (training and guiding in context) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.

    W&L in action

    To test the usefulness of W&L, researchers ran a series of experiments with closed and open source models. osworld benchmarkWhich evaluates agents in real desktop and operating system environments across a variety of tasks, including productivity, programming, and design.

    For fine-tuning, they used their corpus of 53,000 trajectories to train two open source models: UI-TARS-1.5, a robust, open source vision-language-action model designed specifically for computer use, and QUEEN 2.5-VLAn open-weight multimodal LLM.

    For in-context learning tests, they implemented W&L examples on general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI O3, and Cloud Sonnet 4.

    W&L resulted in OSWorld improvements in all model categories, including up to 3 points for ICL on general-purpose models and up to 11 points for fine-tuned open-source models.

    More importantly, these benefits were achieved without any manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable basis for advancing CUA to real-world deployment,” the researchers write.

    This could have significant implications for real-world applications, enabling enterprises to turn their existing videos and conference recordings into training data for CUA. This also makes it easier to generate new training trajectories. All you need to do is record videos of you performing various tasks and have them annotated by IDM. And with frontier models continually improving and becoming cheaper, you can expect to get more from your existing data and progress in the field will continue.

    agents barrier computerusing data Framework Googles learn removes training Watch
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow to Send and Receive iPhone iMessages in Windows
    Next Article Microsoft’s agent platform Play The Verge
    PineapplesUpdate
    • Website

    Related Posts

    Startups

    The Google Pixel Watch 4 is my favorite smartwatch — and it’s on sale for its lowest price ever

    November 26, 2025
    Startups

    3 ways AI agents will transform your work beyond recognition in the next few years

    November 26, 2025
    Startups

    T-Mobile will give you a free iPhone 17, iPad, and Apple Watch right now — how the deal works

    November 26, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Microsoft’s new text editor is a VIM and Nano option

    May 19, 2025797 Views

    The best luxury car for buyers for the first time in 2025

    May 19, 2025724 Views

    Massives Datenleck in Cloud-Spichenn | CSO online

    May 19, 2025650 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    10,000 steps or Japanese walk? We ask experts if you should walk ahead or fast

    June 16, 20250 Views

    FIFA Club World Cup Soccer: Stream Palmirus vs. Porto lives from anywhere

    June 16, 20250 Views

    What do chatbott is careful about punctuation? I tested it with chat, Gemini and Cloud

    June 16, 20250 Views
    Our Picks

    The most durable USB-C cable I’ve tested so far is only $11 this weekend (and I’ll be buying several)

    November 30, 2025

    Finally, an Android tablet that I wouldn’t mind keeping my iPad Pro for (especially at this price)

    November 30, 2025

    How much RAM will your PC really need in 2025? A Windows and Mac expert’s view

    November 30, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms And Conditions
    • Disclaimer
    © 2025 PineapplesUpdate. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.