Maxproof

Posted by ilreb 5 days ago

Comments

Comment by daquisu 5 days ago

"I thought it was interesting and a bit underappreciated that the fraction of gold medalists at the 2025 IMO (72/630 = 11.4%) is the highest it’s been since 1981.

Crudely, IMO gold medals are awarded to the highest-scoring 1/12 of contestants.1 However, because scores are integers up to 42 and there’s no provision for tiebreaking, it’s possible for a lot of contestants to be tied around the threshold. In that case, either all of them get a gold medal or none do, and the fraction of gold medalists might deviate substantially from 1/12. That’s what happened this year: 46 contestants all won a gold medal by scoring exactly 35 points.

In fact, bizarrely, 35 is the mode of the scores this year; the last time the modal score was a gold medal score was in 1994. And, of course, 35 is the same score claimed by AI systems from Google, OpenAI, and others."

From https://blog.vero.site/post/imo-2025

Comment by quibono 5 days ago

I was under the impression that IMO is conducted in an official "exam" capacity, on site and in a very formal setting. So I find it hard to believe _direct_ LLM usage would be a factor Then again - it very well could be a factor in the training and preparation? I imagine "Write me a prep document for the IMO" will surface all kinds of interesting things from the training set.

Comment by quietbritishjim 5 days ago

> And, of course, 35 is the same score claimed by AI systems from Google, OpenAI, and others.

This is the part of the quote your6 replying about.

You seemed to take "of course" as an implication that the contestants used LLMs, and that's why they got the same score as the LLMs.

I took it to mean: since this was the modal score, there seemed to be 35 points worth of significantly easier answers (relatively speaking) than the remaining points, so it's not a surprise that LLMs got the same easier bits right. (Though I doubt all contestants got their points on exactly the same answers.)

But it's certainly unclear what exactly the author meant.

Comment by daquisu 4 days ago

Later in the same blog post, the author says:

> We can also consider the IMO 2025 problems individually. In the Epoch AI newsletter, Greg Burnham combines a subjective analysis with Evan Chen’s MOHS ratings to argue that the first five problems at IMO 2025 were unusually easy and the sixth was unusually hard, so it’s not surprising that the first five problems were exactly the ones solved by these AIs. Though I’m not sure the MOHS scale is rigorous enough to make sense as the x-axis of a bar chart it’s easy to corroborate the high-level story with the official IMO statistics. Based on average scores, this year’s Problem 6 was the fourth hardest and its Problem 3 was by far the easiest of all Problem 3s and 6s since 2000.

In the linked MaxProof paper, in the section "6.3.1. Per-Problem Analysis" it shows the same behavior: 7/7 in the first 5 problems, 0/7 in the last problem.

Comment by dooglius 5 days ago

This is not bizarre, it's a reflection of how the IMO is scored: 6 questions with scores from 0-7 but partial credit is rare. It's really a score of 5/6.

Comment by gus_massa 4 days ago

> Partial credit is rare

It's not rare at all. I can't find the 2026 results, but here are the 2025 ones.

https://www.imo-official.org/results/individual/year/2025/

The top of the table is full of 7 and the bottom is full of 0, but in the middle there are a lot of intermediate points. It's not uncommon "7 ? 0 7 ? 0" because the 1st and 4th are usually the easiest and the 3rd and 6th the hardest. But there are a of of other combinations due to stupid mistakes and lucky solutions and different personal styles/preferences that make some problems easier/harder for each contestant.

Comment by dooglius 4 days ago

You're right, it's less uncommon than I recalled, and thanks for the source. But I don't think that 35/42 is suspicious/bizarre, and it does look like the 5x7 scores make up the bulk of the 35s.

Comment by gus_massa 4 days ago

I agree. If 6 was too hard and 3 "easy", then it would be common to get 777770.

Comment by pfannl 5 days ago

The real AGI test is apparently not solving the IMO, but getting caught in the same scoring traffic jam as 46 teenagers.