I agree that pass@k feels a bit weird for large k. But for LLMs, it's a decent p...

I agree that pass@k feels a bit weird for large k. But for LLMs, it's a decent proxy for "are the knowledge/skills/circuit necessary to solve the problem somewhere in the model". Note that choices for large k is on the order of 256, and the range of valid answers is much larger than that. So your infinite monkeys critique, while true in the limit, wouldn't actually outperform models in the tested regime.

Also, in practice, models don't have that much semantic entropy of a given prompt. With temperature based sampling, models will tend to generate very similar but not identical responses.