A new paper From AI Lab Kohere, Stanford, MIT, and AI2 accused LM Arena, behind the popular mob’s organization AI Benchmark Chatbot Arena, behind AI Benchbot Arena, helps a select group of AI companies to get better leaderboard scores at the cost of rivals.
According to the authors, LM Arena allowed some industries such as Meta, Openi, Google and Amazon to privately tested several variants of AI models, then did not publish the score of the lowest artists. Authors say that it made it easy for these companies to finish a top position on the leadersboard of the platform, although not the opportunity was given to every firm.
“Only a handful of a few (companies) were told that this private test was available, and the quantity of private tests that were received by some (companies) is much higher than others,” Sarah Hooker said in an interview with Tekkachrin, the VP of AI Research of Kohere’s AI Research and co-writer, Sarah Hooker. “This is gamification.”
Made as an academic research project from UC Berkeley in 2023, the chatboat Arena became a Go-Two benchmark for AI companies. It works by responding to two different AI models in a “fight” and asks users to choose the best. It is not uncommon to see unpublished models competing in the arena under a pseudonym.
Over time, votes contribute to the score of a model – and, as a result, its placement on the Chatbot Arena Leaderboard. While many commercial actors participate in the chatboat Arena, LM Arenna has long maintained that its benchmark is a fair and fair.
However, it is not what the author of the paper says that he exposed.
An AI company, Meta, was able to privately tested the 27 model variants between Mata, January and March, who was a pioneer for the release of Tech Giant Lama 4, alleged by the authors. At the time of launch, Meta only publicly revealed a single model score – a model that was to rank near the top of the chatboot Arena Leaderboard.
Techcrunch event
Burkeley, CA
,
5 June
book now

In an email to Techcrunch, LM Arenon co-founder and UC Burkeley Professor Ion Stoica stated that the study was filled with “proof” and “suspicious analysis”.
In a statement given to Techchunch, LM Arena said, “We are committed to fair, community-operated evaluation, and all model providers invite providers to present more models for testing and improve their performance on human preference.” “If a model provider chooses to present more tests than another model provider, it does not mean that another model provider is misbehaved.”
Labs are believed to be
Paper authors began to conduct their research in November 2024, after knowing that some AI companies were probably being given preference for chatbot Arena. Overall, he measured more than 2.8 million chatbot akhada battle in a five -month stretch.
The authors say they found evidence that LM Arena allowed some AI companies to collect more data from the chatbot arena and some AI companies, including Meta, Openi and Google, which appear in a high number of their models “fight”. This increased sample rate gave unfair benefits to these companies, the authors alleged.
Using the additional data of LM Arena can improve the performance of a model on the Arena Hard, another benchmark LM Arena has increased by 112%. However, LM Arena said Post on X This arena is not related to the hard performance directly chatbot Arena performance.
Hooker said it is not clear how the AI ​​companies have got priority access, but it is unlikely regardless of increasing its transparency on LM Arena.
One in Post on XLM Arena said that many claims in paper do not reflect reality. The organization pointed to one blog post Earlier this week, it indicates that more chatbot are appeared in the battbott Arena battles about model studies of non-primary laboratories.
An important limit of study is that it depended on the “self-identity” to determine which AI models were in private trials on chatbott Arena. The authors inspired the AI ​​model several times about their original company, and trusted the model’s answers to classify them – a method that is not foolish.
However, Hooker said that when the author arrived at LM Arena to share his initial conclusions, the organization did not dispute him.
Techcrunch Meta, Google, Openai and Amazon reached – all were mentioned in the study – for comment. Nobody immediately responded.
LM Arena in Hot Water
In paper, the author calls LM Arena, so that many changes can be applied for the purpose of making the chatbot Arena more “fair”. For example, the authors say, LM Arina can determine a clear and transparent range on the number of private tests. AI can conduct labs, and publicly reveal the score from these tests.
One in Post on X, LM Arena dismissed these suggestions, claiming that it has published information on pre-relief tests From March 2024The benchmarking organization also said that it is “no meaning to show scores for pre-relaes models that are not publicly available,” because the AI ​​community cannot test the model for itself.
Researchers also say that LM Arena can accommodate the sample rate of the Arina Chatbot Arena to ensure that all models in the arena appear in the same number of fighting. LM Arena has been publicly receptive to this recommendation, and indicated that it would create a new sample algorithm.
The paper came later after the Meta when the Gaming benchmark was caught around the launch of its above Lama 4 model at the chatboot Arena. Meta adapted one of the Lama 4 models for “connivance”, which helped it get an impressive score on the leaderboard of Chatbot Arena. But the company never released customized models – and the vanilla version performed very badly on the chatbot Arena.
At that time, LM Arena said that meta should have been more transparent in its approach to benchmarking.
Earlier this month, LM Arena announced that it was Launch a companyWith a plan to raise capital from investors. The study increases the investigation on the private benchmark organization – and can they be trusted to assess the AI ​​model without corporate effects.
Update 4/30/25 at 9:35 pm PT: A previous version of this story included a Google deep -ranked comment, which said that the part of Kohere’s study was wrong. The researcher did not dispute that Google sent 10 models to LM Arena for a pre-of-relief test from January to March, as Kohere accused, but simply noted the open source team of the company, who works on Jemma, only sent to one.