AI, Math, and Cognitive Dissonance
July 28, 2025
This blog post is the work of an authentic dinobaby. Sorry. No smart software can help this reptilian thinker.
AI marketers will have to spend some time positioning their smart software as great tools for solving mathematical problems. “Not Even Bronze: Evaluating LLMs on 2025 International Math Olympiad” reports that words about prowess are disconnected from performance. The write up says:
The best-performing model is Gemini 2.5 Pro, achieving a score of 31% (13 points), which is well below the 19/42 score necessary for a bronze medal. Other models lagged significantly behind, with Grok-4 and Deepseek-R1 in particular underperforming relative to their earlier results on other MathArena benchmarks.
The write up points out, possibly to call attention to the slight disconnect between the marketing of Google AI and its performance in this contest:
As mentioned above, Gemini 2.5 Pro achieved the highest score with an average of 31% (13 points). While this may seem low, especially considering the $400 spent on generating just 24 answers, it nonetheless represents a strong performance given the extreme difficulty of the IMO. However, these 13 points are not enough for a bronze medal (19/42). In contrast, other models trail significantly behind and we can already safely say that none of them will achieve the bronze medal. Full results are available on our leaderboard, where everyone can explore and analyze individual responses and judge feedback in detail.
This is one “competition”, the lousy performance of the high-profile models, and the complex process required to assess performance make it easy to ignore this result.
Let’s just assume that it is close enough for horse shoes and good enough. With that assumption in mind, do you want smart software making decisions about what information you can access, the medical prognosis for your nine-year-old child, or decisions about your driver’s license renewal?
Now, let’s consider this write up fragmented across Tweets: [Thread] An OpenAI researcher says the company’s latest experimental reasoning LLM achieved gold medal-level performance on the 2025 International Math Olympiad. The little posts are perfect for a person familliar with TikTok-type and Twitter-like content. Not me. The main idea is that in the same competition, OpenAI earned “gold medal-level performance.”
The $64 dollar question is, “Who is correct?” The answer is, “It depends.”
Is this an example of what I learned in 1962 in my freshman year at a so-so university? I think the term was cognitive dissonance.
Stephen E Arnold, July 28, 2025
Comments
Got something to say?