Close Menu
  • Home
  • UNSUBSCRIBE
  • News
  • Lifestyle
  • Tech
  • Entertainment
  • Sports
  • Travel
Facebook X (Twitter) WhatsApp
Trending
  • Apple’s iPhone 17 could come with a $50 price hike
  • Takeovers of six The Hundred franchises completed as ECB pledges £500m grassroots investment | Cricket News
  • Micka Wright Perry Pregnant at the Same Time as Daughter
  • Trump hits India with 25% tariff
  • Ancient ‘superfood’ discovered in 2,500-year-old bronze jars in southern Italy
  • Schumer calls for FBI assessment on Epstein files potential security threats
  • Best Samsung deal: Save $900 on 55-inch Samsung OLED S85D Smart TV
  • Granit Xhaka to Sunderland: Black Cats complete £17.3m transfer deal for Bayer Leverkusen midfielder | Football News
Get Your Free Email Account
Facebook X (Twitter) WhatsApp
Baynard Media
  • Home
  • UNSUBSCRIBE
  • News
  • Lifestyle
  • Tech
  • Entertainment
  • Sports
  • Travel
Baynard Media
Home»Lifestyle»AI benchmarking platform is helping top companies rig their model performances, study claims
Lifestyle

AI benchmarking platform is helping top companies rig their model performances, study claims

EditorBy EditorMay 23, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

The go-to benchmark for artificial intelligence (AI) chatbots is facing scrutiny from researchers who claim that its tests favor proprietary AI models from big tech companies.

LM Arena effectively places two unidentified large language models (LLMs) in a battle to see which can best tackle a prompt, with users of the benchmark voting for the output they like most. The results are then fed into a leaderboard that tracks which models perform the best and how they have improved.

However, researchers have claimed that the benchmark is skewed, granting major LLMs “undisclosed private testing practices” that give them an advantage over open-source LLMs. The researchers published their findings April 29 in on the preprint database arXiv, so the study has not yet been peer reviewed.


You may like

“We show that coordination among a handful of providers and preferential policies from Chatbot Arena [later LM Arena] towards the same small group have jeopardized scientific integrity and reliable Arena rankings,” the researchers wrote in the study. “As a community, we must demand better.”

Luck? Limitation? Manipulation?

Beginning as Chatbot Arena, a research project created in 2023 by researchers at the University of California, Berkeley’s Sky Computing Lab, LM Arena quickly became a popular site for top AI companies and open-source underdogs to test their models. Favoring “vibes-based” analysis drawn from user responses over academic benchmarks, the site now gets more than 1 million visitors a month.

To assess the impartiality of the site, the researchers measured more than 2.8 million battles taken over a five-month period. Their analysis suggests that a handful of preferred providers — the flagship models of companies including Meta, OpenAI, Google and Amazon — had “been granted disproportionate access to data and testing” as their models appeared in a higher number of battles, conferring their final versions with a significant advantage.

“Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively,” the researchers wrote. “In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data.”

Get the world’s most fascinating discoveries delivered straight to your inbox.

In addition, the researchers noted that proprietary LLMs are tested in LM Arena multiple times before their official release. Therefore, these models have more access to the arena’s data, meaning that when they are finally pitted against other LLMs they can handily beat them, with only the best-performing iteration of each LLM placed on the public leaderboard, the researchers claimed.

“At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives,” the researchers wrote in the study. “Both these policies lead to large data access asymmetries over time.”

In effect, the researchers argue that being able to test multiple pre-release LLMs, having the ability to retract benchmark scores, only having the highest performing iteration of their LLM placed on the leaderboard, as well as certain commercial models appearing in the arena more often than others, gives big AI companies the ability to “overfit” their models. This potentially boosts their arena performance over competitors, but it may not mean their models are necessarily of better quality.

The research has called into question the authority of LM Arena as an AI benchmark. LM Arena has yet to provide an official comment to Live Science, only offering background information in an email response. But the organization did post a response to the research on the social platform X.

“Regarding the statement that some model providers are not treated fairly: this is not true. Given our capacity, we have always tried to honor all the evaluation requests we have received,” company representatives wrote in the post. “If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly. Every model provider makes different choices about how to use and value human preferences.”

LM Arena also claimed that there were errors in the researchers’ data and methodology, responding that LLM developers don’t get to choose the best score to disclose, and that only the score achieved by a released LLM is put on the public leaderboard.

Nonetheless, the findings raise questions about how LLMs can be tested in a fair and consistent manner, particularly as passing the Turing test isn’t the AI watermark it arguably once was, and that scientists are looking at better ways to truly assess the rapidly growing capabilities of AI.

Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleMesmerizing supercomputer simulation reveals the space between stars like never before
Next Article Does time ‘go slower’ when you’re exercising?
Editor
  • Website

Related Posts

Lifestyle

Ancient ‘superfood’ discovered in 2,500-year-old bronze jars in southern Italy

July 30, 2025
Lifestyle

Google has turned 2 billion smartphones into a global earthquake warning system — it’s just as effective as seismometers

July 30, 2025
Lifestyle

Hot blob beneath Appalachians formed when Greenland split from North America — and it’s heading to New York

July 30, 2025
Add A Comment

Comments are closed.

Categories
  • Entertainment
  • Lifestyle
  • News
  • Sports
  • Tech
  • Travel
Recent Posts
  • Apple’s iPhone 17 could come with a $50 price hike
  • Takeovers of six The Hundred franchises completed as ECB pledges £500m grassroots investment | Cricket News
  • Micka Wright Perry Pregnant at the Same Time as Daughter
  • Trump hits India with 25% tariff
  • Ancient ‘superfood’ discovered in 2,500-year-old bronze jars in southern Italy
calendar
July 2025
M T W T F S S
 123456
78910111213
14151617181920
21222324252627
28293031  
« May    
Recent Posts
  • Apple’s iPhone 17 could come with a $50 price hike
  • Takeovers of six The Hundred franchises completed as ECB pledges £500m grassroots investment | Cricket News
  • Micka Wright Perry Pregnant at the Same Time as Daughter
About

Welcome to Baynard Media, your trusted source for a diverse range of news and insights. We are committed to delivering timely, reliable, and thought-provoking content that keeps you informed
and inspired

Categories
  • Entertainment
  • Lifestyle
  • News
  • Sports
  • Tech
  • Travel
Facebook X (Twitter) Pinterest WhatsApp
  • Contact Us
  • About Us
  • Privacy Policy
  • Disclaimer
  • UNSUBSCRIBE
© 2025 copyrights reserved

Type above and press Enter to search. Press Esc to cancel.