Regular reminder that most open source LLM benchmarks are not very useful (in the sense that they don't represent day to day ai chatbot usage and what users care about). If you haven't looked through the datasets to see what they actually contain, I'd encourage you to do so. [1] I think we're just in a strange suboptimal schelling point of sorts, where people report their scores on those benchmarks because they think other people care about those sort of benchmarks, and therefore those benchmarks are the ones that people expect and care about.
And to recap their statement about it being second most powerful, it's based on MMLU scores, which IMO is a non-useful comparison. (Also, doesn't test against GPT-4-Turbo or Claude-long-2.1)
What they're saying is that Inflection-2 ranks #2 relative to other models including GPT-4, Claude-2, PaLM 2, Grok-1, and Llama 2 70b, specifically on MMLU scores.
This model could be great, but that'll be determined by "do day to day users, both free and paying, prefer it over Claude 2 and GPT-4-Turbo" - not MMLU scores.
And to recap their statement about it being second most powerful, it's based on MMLU scores, which IMO is a non-useful comparison. (Also, doesn't test against GPT-4-Turbo or Claude-long-2.1)
What they're saying is that Inflection-2 ranks #2 relative to other models including GPT-4, Claude-2, PaLM 2, Grok-1, and Llama 2 70b, specifically on MMLU scores.
This model could be great, but that'll be determined by "do day to day users, both free and paying, prefer it over Claude 2 and GPT-4-Turbo" - not MMLU scores.
[1]: https://huggingface.co/datasets/lukaemon/mmlu/viewer/abstrac...