Yes, and I also noted how it beats Claude 3.5 Sonnet in Chatbot Arena by a bit of a margin.
This further feeds into my concern that the more advanced AI models we get, random enthusiasts at that site may no longer be able to rank them well, and tuning for Chatbot Arena might be a thing. One that is also exploited by GPT-4o. GPT-4o absolutely does not rank wildly ahead of Claude 3.5 Sonnet in a wide variety of benchmarks, yet it does in Chatbot Arena... People actually using Claude 3.5 Sonnet are also quite satisfied with its performance, often ranking it more helpful than GPT-4o when solving engineering problems, but at the expense of tighter usage limits.
Chatbot Arena was great when they were still fairly stupid, but these days, remember that everyday people are put against the task of ranking premium LLM's even solving some logic puzzles, trick questions and with a deep general knowledge far beyond that of singular humans. They can strike against traditional weaknesses like math, but then all of them suffer. So it's not an easy task at all and I'm not sure the site is very reliable anymore other than for smaller models.
There was a mini-uproar when GPT-4o-mini (an obviously "dumber" model) outscored claude-3.5-sonnet on Chatbot Arena, so much so that LMSYS released a subset of the battles: https://huggingface.co/spaces/lmsys/gpt-4o-mini_battles
You can review for yourself and decide if it was justified (you can compare based on W/L/T responses and matchups). Generally, Claude still has more refusals (easy wins for the model that actually answers the request), often has worse formatting (arguable if this is better, but people like it more), and is less verbose (personally, I'd prefer the right answer with less words, but ChatArena users generally disagree).
If you look at the questions (and Chat Arena and Wildchat analyses), most people aren't using LLMs for math, reasoning, or even coding - if anything the arena usage is probably overly skewed to reasoning/trick questions due to the subset of people poking at the models.
Of course, different people value different things. I've almost exclusively been using 3.5 Sonnet since it came out because it's been the best code assistant and Artifacts are great, only falling back to GPT-4o for occasional Code Interpreter work (for tricky problems, Mistral's Codestral actually seems to be a good fallback, often being able to debug issues that neither of those models can, despite being a tiny model in comparison).
Is there yet standardized ways of objectively testing LLMs? The Chatbot Arena thing has always felt weird to me; basically ranking them based on vibes.
Not really. There's a hundred benchmarks, but all of them suffer from the same issues. They're rated by other LLMs, and the tasks are often too simple and similar to each other. The hope is that just gathering enough of these benchmarks means you get a representative test suite, but in my view we're still pretty far off.
- The prompts people use on have an incredible sample bias towards certain tasks and styles, and as such are unrepresentative of "overall performance" which is what people expect from a leaderboard.
- It is incredibly easy to game by a company, their employees or their fanboys if they would like to. No idea if anyone has done so, but it's trivial.
Just to give one example of the bias; advances in non-English performance don't even register on the leaderboard because almost everyone rating completions there is doing so in English. You could have a model that's a 100 in English and a 0 on every other language, and it would do better on the leaderboard than a model that's a 98 in every human language in the world.