The criteria for the new AI models of methane are a little wrong

One New flagship AI models Saturday, Maverick, Meta released, It is second in the LM ArenaA test that compares the results of a test models with a human rating and choosing their choice. However, this is the Maverick version, which is a vivid version that is widespread for the developers of Meta, which is placed in the LM Arena.

Like a few AI Researchers He pointed out in the X, Meta, LM Arena said that Maverick’s « Experimental Chat version ». A graph over Official LLAM websiteAt the same time, the parent’s LM Arena test « Optimized Llama 4 Maverick » is used for retirees.

As we write beforeFor various reasons, the LM Arena has never been the most reliable size of the AI model performance. However, the AI companies generally did not adjust their models to collect better in the LM Arena or did not make it otherwise or to do it at least.

It is difficult to make a brand for a brand for a criterion, and then released the same model « vanilla », it is difficult to predict the model the model in certain contexts. And it’s misleading. Ideally, the criteria – as much as they are in an inadequate – Provide an image between the powerful and weaknesses of a model of a model.

Really there are researchers in X Observation Differences in behavior Openly downloaded Maverick compared to the model hosted by LM Arena. The LM Arena version uses many emojis and gives incredibly long-lasting answers.

Okay Llama 4, What is the city of Yap pic.twitter.com/y3gvhbvz65

– Nathan Lambert (@natolambert) April 6, 2025

For some reason, the Llama 4 model uses more emojis in the arena

together. AI looks better: pic.twitter.com/f74odx4ztt

– TECH DEV NOTES (@techdevnotes) April 6, 2025