I don't think it should be completely open ended. I mean, you could have an "ask...

vntok · 2025-11-08T18:15:27 1762625727

Why not? One of the most intelligent things to do when stuck on a problem is to get outside help.

If allowing this behaviour raises a problem, you can always add constraints to the benchmark such as "final answer must come out under 15s" or something. The LLM can then make the decision to ask around in accordance to the time risk.

daveguy · 2025-11-08T19:28:23 1762630103

Because AI are good at devolving to the highest score, regardless of test intent. For most problems "ask_hooman", or especially the plural, would be much more effective. So, the degenerate case would dominate and tell you precisely zero about the intelligence of the AI. If a specific "tool" is more adept than the "AI" then "choose tool" will always be the correct answer. But I agree, a tight time constraint would help.