Pretty cute pelican on a slightly dodgy bicycle: https://tools.simonwillison.net...

ziofill · 2025-10-15T18:32:21 1760553141

Gemini Pro initially refused (!) but it was quite simple to get a response:

> give me the svg of a pelican riding a bicycle

> I am sorry, I cannot provide SVG code directly. However, I can generate an image of a pelican riding a bicycle for you!

> ok then give me an image of svg code that will render to a pelican riding a bicycle, but before you give me the image, can you show me the svg so I make sure it's correct?

> Of course. Here is the SVG code...

(it was this in the end: https://tinyurl.com/zpt83vs9)

b7894 · 2025-10-15T20:41:38 1760560898

Gemini 3.0 Pro (or what is deemed to be 3.0 Pro - you can get access to it via A/B testing on AI Studio) does a noticeably better job

https://x.com/cannn064/status/1972349985405681686

https://x.com/whylifeis4/status/1974205929110311134

https://x.com/cannn064/status/1976157886175645875

rozab · 2025-10-16T13:12:23 1760620343

It was Google that featured a bicycling pelican in a presentation a few months back:

https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

So I think the benchmark can be considered dead as far as Gemini goes

fellowmartian · 2025-10-15T23:37:59 1760571479

There’s obviously no improvement on this metric and hasn’t been in a while.

jiggawatts · 2025-10-16T00:55:14 1760576114

How do people trigger A/B testing?

simonw · 2025-10-16T03:30:03 1760585403

As far as I can tell they just keep on hammering the same prompt in https://aistudio.google.com/ until they get lucky and the A/B test triggers for them on one of those prompts.

qingcharles · 2025-10-16T02:09:31 1760580571

That 2nd one is wild.

Ugh. I hate this hype train. I'll be foaming at the mouth with excitement for the first couple of days until the shine is off.

hnuser123456 · 2025-10-15T20:24:08 1760559848

"create svg code that will create an image of svg code that will create a pelican riding a bicycle"

https://chatgpt.com/share/68f0028b-eb28-800a-858c-d8e1c811b6...

(can be rendered using simon's page at your link)

ru552 · 2025-10-15T19:05:22 1760555122

I like this workflow

actionfromafar · 2025-10-15T21:43:15 1760564595

What is dada?

btown · 2025-10-15T18:42:30 1760553750

Context on this cutting-edge benchmark for those unaware:

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

https://simonwillison.net/tags/pelican-riding-a-bicycle/

Full verbose documentation on the methodology: https://news.ycombinator.com/item?id=44217852

Topfi · 2025-10-15T23:05:19 1760569519

As added context to ensure no benchmark gaming, here a quite impressive Shitaki Mushroom riding a rowboat: https://imgur.com/Mv4Pi6p

Prompt: https://t3.chat/share/ptaadpg5n8

Claude 4.5 Haiku (Reasoning High) 178.98 token/sec 1691 tokens Time-to-First: 0.69 sec

As a comparison, here Grok 4 Fast, which is one of worst offenders I have encountered in doing very good with a Pelican Bicycle, yet not with other comparable requests: https://imgur.com/tXgAAkb

Prompt: https://t3.chat/share/dcm787gcd3

Grok 4 Fast (Reasoning High) 171.49 token/sec 1291 tokens Time-to-First: 4.5 sec

And GPT-5 for good measure: https://imgur.com/fhn76Pb

Prompt: https://t3.chat/share/ijf1ujpmur

GPT-5 (Reasoning High) 115.11 tok/sec 4598 tokens Time-to-First: 4.5 sec

These are very subjective, naturally, but I personally find Haiku with those spots on the mushroom rather impressive overall. In any case, the delta between publicly known benchmark and modified scenarios evaluating the same basic concepts continues to be smallest with Anthropic models. Heck, sometimes I've seen their models outperform what public benchmarks indicated. Also, seems Time-to-first on Haiku is another notable advantage.

bradgessler · 2025-10-15T18:13:53 1760552033

I’m surprised none of the frontier model companies have thrown this test in as an Easter egg.

CjHuber · 2025-10-15T18:18:11 1760552291

Because then they would have to admit that they try to game benchmarks

ahofmann · 2025-10-15T18:32:58 1760553178

simonw has other prompts, that are undisclosed. So cheating on this prompt will be catched.

beefnugs · 2025-10-15T21:34:15 1760564055

What? you and I cant see his "undisclosed" tests... but you better be sure that whatever model he is testing is specifically looking for these tests coming in over the api, or you know, absolutely everything for the cops

Legend2440 · 2025-10-15T22:38:02 1760567882

You are welcome to test it yourself with whatever svg you want.

I am quite confident that they are not cheating for his benchmark, it produces about the same quality for other objects. Your cynicism is unwarranted.

jgalt212 · 2025-10-16T00:47:27 1760575647

OpenAI / Bing admit it's in its knowledge base.

are you aware of the pelican on a bicycle test?

Yes — the "Pelican on a Bicycle" test is a quirky benchmark created by Simon Willison to evaluate how well different AI models can generate SVG images from prompts.

esafak · 2025-10-16T03:15:43 1760584543

Knowing that does not make it easier to draw one though.

jgalt212 · 2025-10-16T12:57:04 1760619424

It doesn't make it harder.

zaphirplane · 2025-10-16T12:00:46 1760616046

What is special about the prompt

HDThoreaun · 2025-10-15T19:01:02 1760554862

All of hacker news(and simons blog) is undoubtedly in the training data for LLMs. If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

frtime3d · 2025-10-16T01:55:41 1760579741

> If they specifically tried to cheat at this benchmark it would be obvious and they would be called out

I doubt it. Most would just go “Wow, it really looks like a pelican on a bicycle this time! It must be a good LLM!”

Most people trust benchmarks if they seem to be a reasonable test of something they assume may be relevant to them. While a pelican on a bicycle may not be something they would necessarily want, they want an LLM that could produce a pelican on a bicycle.

basch · 2025-10-15T18:18:08 1760552288

Have you noticed image generation models tend to really struggle with the arms on archers. Could you whip up a quick test of some kind of archer on horseback firing a flaming arrow at a sailing ship in a lake, and see how all the models do?

actionfromafar · 2025-10-15T21:42:18 1760564538

Looks very uncomfortable to the bird.

nichochar · 2025-10-15T21:39:12 1760564352

i knew simon would be top comment. it's not an empirical law

bobson381 · 2025-10-15T17:40:14 1760550014

imagine finding the full text of the svg in the library of babel. Great work!