
Key takeaways of zdnet
- AI Frontier models fail to provide safe and accurate output on medical subjects.
- The goal of LMARENA and Datatecnica is ‘strictly’ testing LLMS ‘of medical knowledge’.
- It is not clear how agents and drug-specific LLMs will be measured.
Get ZDNET Tech coverage more deeply: Add us as a favorite google source Chrome and chromium on the browser.
Despite many AI progresses in therapy cited in scholars’ literature, all generative AI fails to produce the program, which are both safe and accurate when working with medical subjects. A new report By benchmark firm LMARENA.
This discovery is especially related to the fact that people are going into bots like chat for medical answers, and research suggests that people rely on AI’s medical advice on the advice of doctors, even when it is wrong.
Too: Patients are confident of AI’s medical advice on doctors – even when it is wrong, it is still found.
The new study comparing Openai’s GPT-5 with many models of Google, Anthropic and Meta, finds that “performs from enough performance in real-world biomedical research.”
(Disclosure: ZDNET’s original company Ziff Davis filed a case of April 2025 against Openai, alleging that it violates Ziff Davis copyright training and operating its AI system.)
A knowledge interval in therapy
According to the LMARENA team, “No current model firmly fulfills the demands of biomedical scientists’ arguments and domain-specific knowledge.”
The report concludes that the current models are simply very loose and very fuzzy to meet the standards of the drug:
“This fundamental difference highlights the growing mismatch between general AI capabilities and needs of special scientific communities. Biomedical researchers work at the intersection of complex, developed knowledge and real -world influence. They do not need models that require ‘right’; they require devices that help reduce insight, reduce errors, and raise the speed of search.”
The study resonates conclusions from other benchmark tests related to the drug. For example, in May, Openai unveiled the healthbench, a suit of the lesson related to the medical conditions and conditions, which can be presented as the chatbot properly by the person seeking medical advice. The study found that O3 large language model of Openai, the best accuracy score by 0.598, left enough room to improve the benchmark.
Too: Openai’s healthbench shows that AI’s medical advice is improving – but who will listen?
Expansion of benchmark
To address the gap between AI model and therapy, LMARENA has worked closely with startups DattekanikaWhich had unveiled a benchmark suit of trials for a question-answer-level benchmark, cardbomedbench for evaluation of LLM in biomedical research earlier this year.
Together, lmarena and datatecnica plan to expand what is said BiomederenaA leaderboard that shoulders to people shoulder and votes with the best performance compared to the AI model.
Too: Meta’s Lama 4 ‘Flock’ controversy and AI contamination, explained
Biomedarena means specific for therapy ResearchUnlike general-purpose leaderboard, instead of very common questions.
Bioomederena is already done by scientists in the Interamural Research Program of the US National Institute of Health, he notes, “where scientists pursue high-risk, high-inam projects that are often beyond traditional educational research due to their scale, complexity, or resource demands.”
According to the LMARENA team, biomedarena work, “will focus on the work and evaluation strategies based in day-to-day realities of biomedical discovery-from interpretation of using data and literature to the hypothesis generation and to assist in clinical translation.”
Too: You can track the top AI image generator through this new leaderboard – and vote for your favorite also
As the web right of ZDNET reported in June, Lmarena.ai The AI model ranks. The website was originally established as a research initiative through UC Berkeley under the name chatboot Arena and has since become a full platform, with financial assistance from UC Berkeley, A16Z, Sequia Capital and others.
Where can they be wrong?
There are two big questions for this new benchmark effort.
First, the study with doctors has shown that General AI’s utility dramatically expands when the AI models are bent for the “Gold Standard” database of medical information, capable of performing better by tapping the top frontier model with dedicated large language models (LLM) only by tapping in information.
Too: Hooking a liberal AI for medical data for doctors improves utility
From today’s announcement, it is not clear how LMARENA and Datatecnica plan to address that aspect of the AI model, which is actually a type of agent capacity – the ability to tap in resources. Without measuring how AI models use external resources, benchmarks may have limited utility.
Second, several medical-specific LLMs are being developed at all times, including the “Medapalam” program of Google developed two years ago. It is not clear that biomedarena work will take into account these dedicated drug LLM. The work so far has tested only the general frontier model.
Too: Google’s Medpal Medical AI emphasizes human physicians
This is a completely valid option from LMARENA and Datatecnica, but it leaves a completely important effort.

