
Some of the largest providers of large language models (LLM) have sought to move beyond multimodal chatbots – fleshing out their models "agent" This can actually drive more action on the part of the user on websites. Recall Openai’s CHATGPT agent (formerly known as "operator") and Anthropic’s Computer Usage, both released in the last two years.
Now, Google is getting into the same game. Today, the search giant DeepMind AI Labs subsidiary unveils a new, fine-tuned and custom-trained version of its powerful Gemini 2.5 Pro LLM known as "Gemini 2.5 Pro Computer Use," who can Use a virtual browser to surf the web, retrieve information, fill out forms, and even take actions on websites on your behalf – All from a single user text prompt.
"These are early days, but the model’s ability to interact with the web – like scrolling, filling out forms + navigating dropdowns – is a Important next step in creating general-purpose agents," Said Google CEO Sundar Pichai, as part of Long statement on social networks, x.
The model is not available to consumers directly from Google, however.
instead, Google partnered With another company, browserbaseestablished by Former Twilio engineer Paul Klein in early 2024which provides virtual "leaderless" Web browser specifically for use by AI agents and applications. (A "leaderless" A browser is one that does not require a graphical user interface, or GUI, to navigate the web, although in this case and others, the browserbase shows a graphical representation to the user).
Users can access the models directly on BrowserBase using the new Gemini 2.5 computers Here And even compare it side-by-side with older, rival offerings from OpenEye and Anthropic in a new "Browser Arena" Launched by the startup (though only one additional model can be selected with Gemini at a time).
For AI builders and developers, it is being created as a raw, albeit proprietary, LLM. Through GEMINI API in Google AI Studio For rapid prototypingand Google Cloud Vertex AI Model selector and application creation platform.
New offering builds on capabilities Gemini 2.5 ProIt was released back in March 2025, but has been updated several times since then, with a specific focus on enabling AI agents to have direct interactions with user interfaces, including browsers and mobile applications.
Overall, it appears GEMINI 2.5 is designed to use computer developers to create agents that can autonomously complete interface-driven tasks—such as clicking, typing, scrolling, filling out forms, and navigating past login screens.
Rather than relying solely on APIs or structured input, this model allows AI systems to interact with the software visually and functionally, much like a human would.
Brief user hands-on testing
In my brief, unscientific initial hands-on test on the Browsebase website, using the Gemini 2.5 computer successfully navigated to Taylor Swift’s official website as directed and provided me with a summary of what was being sold or promoted over the top—a special edition of her latest album. "Life of a showgirl."
In another test, I used the Gemini 2.5 computer to search Amazon for highly rated and well-reviewed solar lights that I could stake out in my back yard, and I was pleased to see it successfully complete a Google search CAPTCHA designed to creep out non-human users (("Select all the boxes with the motorcycle.") did this in a few seconds.
However, once it got there, it stalled and was unable to complete the task, despite "task competed" Message.
I should also note that while the CHATGPT agent from Openai and Anthropic’s cloud can create and edit local files – such as PowerPoint presentations, spreadsheets, or text documents – on the user’s behalf, the GEMINI 2.5 computer used does not currently offer direct file system access or native file creation capabilities.
Instead, it is designed to control and navigate web and mobile user interfaces through actions such as clicking, typing, and scrolling. Its output is limited to suggested UI actions or chatbot-style text responses; Any structured output, like a document or file, must be handled separately by the developer, often through custom code or third-party integration.
performance rich
Google says the Gemini 2.5 computer usage has demonstrated leading results in several interface control benchmarks, especially when compared to other leading AI systems, including Cloud Sonnet and OpenEAI’s agent-based models.
The evaluations were conducted through BrowserBase and Google’s own testing.
Some highlights include:
-
online-mind2web (browserbase): 65.7% for Gemini 2.5 vs 61.0% (Cloud Sonnet 4) and 44.3% (OpenEAI Agent)
-
WebVoyager (browserbase): 79.9% for Gemini 2.5 vs 69.4% (Cloud Sonnet 4) and 61.0% (OpenEA Agent)
-
AndroidWorld (DEEPMIND): 69.7% for Gemini 2.5 vs 62.1% (Cloud Sonnet 4); Openai’s model could not be measured due to lack of access
-
OSWORLD: Not currently supported by Gemini 2.5; Top competitive result was 61.4%
In addition to strong accuracy, Google reports that the model operates at lower latency than other browser control solutions – an important factor in production use cases like UI automation and testing.
how it works
Agents driven by computer usage models work within an interaction loop. they receive:
-
a user task prompt
-
A screenshot of the interface
-
History of past actions
The model analyzes this input and produces a recommended UI action, such as clicking a button or typing in a field.
If necessary, it may request confirmation from the end user for risky actions, such as making a purchase.
Once the action is executed, the interface state is updated and a new screenshot is sent back to the model. The loop continues until the task is completed or is stopped due to an error or security decision.
The model uses a special tool called computer_useAnd it can be integrated into custom environments using the tool playwright or through browserbase Demo sandbox.
Use cases and adoption
According to Google, teams internally and externally have already started using the model in several domains:
-
Google’s Payment Platform Team reports that the Gemini 2.5 computer successfully recovers over 60% of failed test executions, reducing a major source of engineering inefficiencies.
-
autotaba third-party AI agent platform, said the model outperformed others on complex data parsing tasks, increasing performance by 18% in their most difficult evaluation.
-
Poke.comAn active AI assistant provider, said that the Gemini model is often driven by 50% faster From competing solutions during interface interactions.
The model is also being used in Google’s own product development efforts, including project mariner, firebase test agentAnd AI mode in search,
safety measures
Because this model directly controls the software interface, Google emphasizes a multi-layered approach to security:
-
A per step security service Inspects every proposed action before execution.
-
Developers can define system-level instructions Need to confirm or require specific actions.
-
The model includes built-in safeguards to prevent actions that may compromise security or violate Google’s prohibited use policies.
For example, if the model encounters a CAPTCHA, it will generate an action to click a checkbox, but flag it as requiring user confirmation, ensuring that the system does not proceed without human inspection.
technical competence
The model supports a wide array of built-in UI actions such as:
-
click_at,type_text_at,scroll_document,drag_and_dropand more -
User-defined functions can be added to extend its accessibility to mobile or custom environments
-
Screen coordinates are normalized (0–1000 scale) and translated back to pixel dimensions during execution.
it accepts image and text input and output text responses Or function call To work. The recommended screen resolution for optimal results is 1440×900Although it may work with other sizes.
API pricing is almost the same as Gemini 2.5 Pro
pricing for gemini 2.5 computer access Aligns closely with the standard Gemini 2.5 Pro model. Both follow the same per-token billing structure: input tokens are priced at $1.25 per one million tokens For signs under 200,000 tokens, and $2.50 per million tokens For a signal longer than that.
Output tokens follow a similar split, at the expense of $10.00 per million For small reactions and $15.00 For older people.
Where the models differ is in availability and additional features.
Gemini 2.5 Pro includes a free tier This allows developers to use the model at no cost, with no explicit token cap published, although use may be subject to rate limits or quota constraints depending on the platform (e.g. Google AI Studio).
This free access includes both input and output tokens. Once developers exceed their allotted quota or switch to the paid tier, standard per token pricing applies.
On the contrary, Gemini 2.5 computer access is available exclusively through a paid tier. there is no free access Currently offered for this model, and all usage incurs a token-based fee from the start.
Feature-wise, Gemini 2.5 Pro supports optional capabilities like context caching (starting at $0.31 per million tokens) and grounding with Google Search (free for up to 1,500 requests per day, then $35 per 1,000 additional requests). These are not available for computer use at this time.
Another difference is in data handling: output from the computer usage model is not used to improve Google products in the paid tier, whereas Gemini 2.5 Pro’s free-tier usage model contributes to improvement unless explicitly opted out.
Overall, developers can expect similar token-based costs in both models, but they should consider tiered access, included capabilities, and data usage policies to decide which model best meets their needs.

