I don't think this disproves my claim, for several reasons. First, I don't know ...

I don't think this disproves my claim, for several reasons.

First, I don't know where those human participants came from, but if you pick people off the street or from a college campus, they aren't going to be the world's best programmers. On the other hand, github users are on average more skilled than the average CS student. Even students and beginners who use github usually don't have much code there. If the LLMs are weighted to treat every line of code about same, they'd pick up more lines of code from prolific developers (who are often more experienced) than they would from beginners.

Also in a coding contest, you're under time pressure. Even when your code works, its often ugly and thrown together. On github, the only code I check in is code that solves whatever problem I set out to solve. I suspect everyone writes better code on github than we do in programming competitions. I suspect if you gave the competitors functionally unlimited time to do the programming competition, many more would outperform GPT-4.

Programming contests also usually require that you write a fully self contained program which has been very well specified. The program usually doesn't need any error handling, or need to be maintained. (And if it does need error handling, the cases are all fully specified in the problem description). Relatively speaking, LLMs are pretty good at these kind of problems - where I want some throwaway code that'll work today and get deleted tomorrow.

But most software I write isn't like that. And LLMs struggle to write maintainable software in large projects. Most problems aren't so well specified. And for most code, you end up spending more effort maintaining the code over its lifetime than it takes to write in the first place. Chatgpt usually writes code that is a headache to maintain. It doesn't write or use local utility functions. It doesn't factor its code well. The code is often overly verbose. It often writes code that's very poorly optimized. Or the code contains quite obvious bugs for unexpected input - like overflow errors or boundary conditions. And the code it produces very rarely handles errors correctly. None of these problems really matter in programming competitions. But it does matter a lot more when writing real software. These problems make LLMs much less useful at work.