MCP-Universe benchmark suggests that GPT-5 fails more than half of real world orchestration works

Want smart insight into your inbox? Enterprise AI, only what matters to data and security leaders, sign up for our weekly newspapers. Subscribe now

Adopting interoperability standards, such as model reference protocols (MCP), can provide insight to enterprises how agents and models work outside their walls. However, many benchmarks fail to catch real -life conversations with MCP.

Sales force AI Research developed a new open-source benchmark, which he calls MCP-UNVers, aimed at tracking LLM as they interact with MCP servers in the real world, arguing that it actually uses a better picture of real-life and real-time interaction with tool enterprises. In its initial test, it was found that models such as models OpeniRecently GPT-5 released Strong, but still do not perform in real -life scenarios.

“Existing benchmarks mainly focus on individual aspects of LLM performance, such as mathematics logic, or function calling, without providing a comprehensive assessment without any comprehensive assessment, how to interact with the real-world MCP servers in diverse landscapes,” Said said a paper,

The MCP-Universe tool captures model performance through uses, multi-turn tool calls, long reference windows and large tool spaces. It is based on the existing MCP server with access to real data sources and environment.

AI scaling hits its boundaries

Power caps, rising token costs, and entrance delays are re -shaping Enterprise AI. Join our exclusive salons to learn about top teams:

Transform energy into a strategic profit

Architecting efficient estimates for real thrruput benefits

Unlocking competitive ROI with sustainable AI system

Secure your location to stay ahead,

Junn Lee, director of AI Research in Salesforce, told Venturebeat that many models “still face boundaries that put them back on enterprise-grade works.”

“The two are the largest: long reference challenges, models may lose the track of frequent logic or conflict when handling very long or complex inputs,” Lee said. “And, unknown tool challenges, models are often not able to use unfamiliar devices or systems, the way humans can optimize the fly. That is why it is important that it is important that it is important not to take a DIY approach to the power agents alone with a single model, but to rely on a platform, which adds data references to rely on the needs.”

MCP-Universe is included in other MCP-based proposed benchmarksSuch as Mcp-radar Massachusetts University from Amharst and Sheeian Jiaotong University, as well as Beijing University Post and Telecommunications’ McpworldIt also makes McPevals, which was released in July, which is mainly focused on agents. Lee said that the biggest difference between MCP-Universe and McPevals is that later is evaluated with synthetic functions.

how it works

MCP-Universe evaluates how well each model does a series of tasks that mimic the people performed by enterprises. Salesforce stated that it has designed MCP-Universe to include six core domains used by enterprises: location navigation, repository management, financial analysis, 3D design, browser automation and web search. It accessed 11 MCP servers for a total of 231 tasks.

The location navigation focuses on geographical logic and execution of spatial functions. Researchers exploited the Google Maps MCP server for this process.

The repository management domain looks at the codebase operations and connects the GITHUB MCP to highlight the version control tools such as repo search, issue tracking and code editing.

Financial analysis connects Yahoo Finance MCP server for quantitative logic and evaluation of financial market decision making.

The 3D design blender evaluates the use of computer-aided design tools through MCP.

The browser automation is connected to the MCP of the playrite, testing the browser interaction.

The web search domain employs the Google Search MCP server and Fetch MCP to check the “open-domain information” and is structured as a more open-ended task.

Salesforce stated that it had to design new MCP functions that reflect cases of real use. For each domain, he made four to five types of tasks that researchers feel that LLM can easily complete. For example, researchers assigned a goal to the model that included root planning, identifying the optimal stops and then detecting the destination.

Each model is evaluated how they completed the tasks. Lee and his team opted to follow an execution-based assessment paradigm rather than a more common LLM-A-Audes system. Researchers noted the LLM-A-A-Judge paradigm, “Our MCP-Universe is not well suited, as some tasks are designed to use real-time data, while LLM Judge’s knowledge is stable.”

Salesforce researchers used three types of evaluators: The draft evaluator to see if agents and model format requirements, static evaluator to assess purity over time and dynamic evaluators to submit answers such as flight prices or GITHUB issues.

“MCP-Universe focuses on creating challenging real-world functions with execution-based evaluator, which can stress the agent in complex scenarios. In addition, the MCP-universe provides an expandable structure/codbase for the manufacture and evaluation of agents,” Lee provides Lee.

Even big models have trouble

To test the MCP-Universe, Slesforce evaluated many popular ownership and open-source models. These include Grock -4 Xai, anthropic‘Cloud -4 Sonnet and Cloud 3.7 Sonnet, Openai’s GPT -5, O 4 -Mune, O3, GPT -4.1, GPT -4O, GPT -OSS, Google‘Gemini 2.5 Pro and Gemini 2.5 FCash, GLM -4.5 Victory, Moonlight‘Kim-K2, CowenQwen3 Kodar and Qwen3-235B-A22B-Instruct-25507 and Deepseek-V3-0304 DeepsekEach model tested had at least 120B parameters.

In its test, Slesforce found that GPT-5 had the best success rate, especially for financial analysis tasks. Grok -4 defeated all models for browser automation, and Cloud -4.0 Sonnet dropped out of the top three, although it did not post the number of any performance more than any model. Among the open-source model, GLM-4.5 performed the best.

However, MCP-Universe showed that the model was difficult to handle longer references, especially for location navigation, browser automation and financial analysis, with significant decline in efficiency. The moment LLMS faces unknown equipment also falls. LLMS usually showed difficulty in completing more than half of the enterprises.

“These conclusions highlight that the current Frontier LLM still falls short of the MCP functions of various real-world MCP functions. Our MCP-University benchmark, therefore, provides a challenging and necessary tests to evaluate LLM performance in areas lined by the existing benchmarks,” said the paper said.

Lee told Venturebeat that he hoped that enterprises would use MCP-Anavars, where agents and model fails on tasks, so that they can improve their framework or implementation of their MCP tools.

Daily insights on business use cases with VB daily

If you want to impress your boss, VB daily has covered you. We give you the scoop inside what companies are doing with generative AI, from regulatory changes to practical deployment, so you can share insight for maximum ROI.

Read our privacy policy

Thanks for membership. See more VB newsletters here.

There was an error.

What's Hot

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

HDR10 vs HDR10+ vs Dolby Vision: Which Format Works Best for Your TV?

The biggest opportunity in real estate since 2008

Microsoft’s new text editor is a VIM and Nano option

The best luxury car for buyers for the first time in 2025

Massives Datenleck in Cloud-Spichenn | CSO online

Most Popular

Google tests AI-operated audio overview in search results for some questions

Yes, this was the original voice of the Garat in the trailer for the thief VR

A close look at the moons of Uranus reveals a stunning dark side

Our Picks

I tried 0patch as a last resort for my Windows 10 PC – here’s how it compares to its promises

A PC Expert Explains Why Don’t Use Your Router’s USB Port When These Options Are Present

New ‘Remote Labor Index’ shows AI fails 97% of the time in freelancer tasks

Subscribe to Updates

What's Hot

MCP-Universe benchmark suggests that GPT-5 fails more than half of real world orchestration works

how it works

Even big models have trouble

Related Posts

Subscribe to Updates