It’s a quite deceptive paper. The main headline benchmarks (math500, aime24 /25)...

It’s a quite deceptive paper. The main headline benchmarks (math500, aime24 /25) final answer is just a number from 0-1000, so what is the takeaway supposed to be for pass@k of 512/1024?

On the unstructured outputs, where you can’t just ratchet up the pass@k until it’s almost random, it switches the base model out for instruct, and in the worse case on livecodebench it uses a qwen-r1-distill as a _base_ model(!?) that’s an instruct model further fine tuned on R1’s reasoning traces. I assume that was because no matter how high the pass@k, a base model won’t output correct python.