Over the weekend, Meta dropped two new Llama 4 fashions: a smaller mannequin named Scout, and Maverick, a mid-size mannequin that the corporate claims can beat GPT-4o and Gemini 2.0 Flash “throughout a broad vary of extensively reported benchmarks.”
Maverick rapidly secured the number-two spot on LMArena, the AI benchmark web site the place people evaluate outputs from totally different methods and vote on the most effective one. In Meta’s press launch, the corporate highlighted Maverick’s ELO rating of 1417, which positioned it above OpenAI’s 4o and slightly below Gemini 2.5 Professional. (A better ELO rating means the mannequin wins extra usually within the enviornment when going head-to-head with opponents.)
The achievement appeared to place Meta’s open-weight Llama 4 as a severe challenger to the state-of-the-art, closed fashions from OpenAI, Anthropic, and Google. Then, AI researchers digging by means of Meta’s documentation found one thing uncommon.
In nice print, Meta acknowledges that the model of Maverick examined on LMArena isn’t the identical as what’s obtainable to the general public. Based on Meta’s personal supplies, it deployed an “experimental chat model” of Maverick to LMArena that was particularly “optimized for conversationality,” TechCrunch first reported.
“Meta’s interpretation of our coverage didn’t match what we anticipate from mannequin suppliers,” LMArena posted on X two days after the mannequin’s launch. “Meta ought to have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a custom-made mannequin to optimize for human desire. Because of that, we’re updating our leaderboard insurance policies to strengthen our dedication to honest, reproducible evaluations so this confusion doesn’t happen sooner or later.“
A spokesperson for Meta, Ashley Gabriel, stated in an emailed assertion that “we experiment with all sorts of customized variants.”
“‘Llama-4-Maverick-03-26-Experimental’ is a chat optimized model we experimented with that additionally performs nicely on LMArena,” Gabriel stated. “Now we have now launched our open supply model and can see how builders customise Llama 4 for their very own use instances. We’re excited to see what they may construct and stay up for their ongoing suggestions.”
Whereas what Meta did with Maverick isn’t explicitly towards LMArena’s guidelines, the location has shared issues about gaming the system and brought steps to “forestall overfitting and benchmark leakage.” When corporations can submit specially-tuned variations of their fashions for testing whereas releasing totally different variations to the general public, benchmark rankings like LMArena turn out to be much less significant as indicators of real-world efficiency.
”It’s essentially the most extensively revered common benchmark as a result of the entire different ones suck,” impartial AI researcher Simon Willison tells The Verge. “When Llama 4 got here out, the truth that it got here second within the enviornment, simply after Gemini 2.5 Professional — that basically impressed me, and I’m kicking myself for not studying the small print.”
Shortly after Meta launched Maverick and Scout, the AI neighborhood began speaking a couple of rumor that Meta had additionally educated its Llama 4 fashions to carry out higher on benchmarks whereas hiding their actual limitations. VP of generative AI at Meta, Ahmad Al-Dahle, addressed the accusations in a submit on X: “We’ve additionally heard claims that we educated on take a look at units — that’s merely not true and we’d by no means try this. Our greatest understanding is that the variable high quality persons are seeing is because of needing to stabilize implementations.”
“It’s a really complicated launch typically.”
Some additionally observed that Llama 4 was launched at an odd time. Saturday doesn’t are typically when large AI information drops. After somebody on Threads requested why Llama 4 was launched over the weekend, Meta CEO Mark Zuckerberg replied: “That’s when it was prepared.”
“It’s a really complicated launch typically,” says Willison, who intently follows and paperwork AI fashions. “The mannequin rating that we bought there’s fully nugatory to me. I can’t even use the mannequin that they bought a excessive rating on.”
Meta’s path to releasing Llama 4 wasn’t precisely easy. Based on a current report from The Info, the corporate repeatedly pushed again the launch as a result of mannequin failing to satisfy inner expectations. These expectations are particularly excessive after DeepSeek, an open-source AI startup from China, launched an open-weight mannequin that generated a ton of buzz.
In the end, utilizing an optimized mannequin in LMArena places builders in a tough place. When deciding on fashions like Llama 4 for his or her functions, they naturally look to benchmarks for steering. However as is the case for Maverick, these benchmarks can mirror capabilities that aren’t really obtainable within the fashions that the general public can entry.
As AI growth accelerates, this episode reveals how benchmarks have gotten battlegrounds. It additionally reveals how Meta is raring to be seen as an AI chief, even when which means gaming the system.
Replace, April seventh: The story was up to date so as to add Meta’s assertion.