2 min read

Confident, fluent and wrong

Current large language models do not perform well on Math Olympiad questions. But in such advanced domains, they might fool people into thinking otherwise.
Confident, fluent and wrong
Photo by Antoine Dautry / Unsplash

Right after the recent USA Math Olympiad, before any of the questions could make it onto to the web and be crawled by LLMs, a team of researchers took those questions and, as Ernest David and Gary Marcus report, "gave the problems to some of the top large language models, whose mathematical and reasoning abilities have been loudly proclaimed: o3-Mini, o1-Pro, DeepSeek R1, QwQ-32B, Gemini-2.0-Flash-Thinking-Exp, and Claude-3.7-Sonnet-Thinking."

It did not go well, as none of the "reasoning" models obtained a score over 5%. Of course, their output was still confident-sounding stuff, owing to their mastery of the linguistic form. For David and Marcus, this is exactly where the danger lies:

The refusal of these kinds of AI to admit ignorance or incapacity and their obstinate preference for generating incorrect but plausible-looking answers instead are one of their most dangerous characteristics. It is extremely easy for a user to pose a question to an LLM, get what looks like a valid answer, and then trust to it, without doing the careful inspection necessary to check that it is actually right.

In other words, they fear that people will leave their epistemic vigilance at the door if conversing with AI. If it looks good, and especially if it's on a complicated topic that's hard to verify, people can be tempted to lean in on the tool. Indeed, low task confidence predicts less critical thinking during GenAI use, so you'd expect issues to arise in domains like advanced mathematical reasoning, as few people have the relevant expertise to be vigilant.

To some extent, the poor performance of the models may be down to suboptimal prompting. David and Marcus concede that this might be the case, but that the point is moot, because:

It is – maybe – relevant for judging the usefulness of AIs to regular users of the technology who will take the time to optimize their query style and who are experienced in crafting such prompts. [...] But what about the hundreds of millions of users who are not specialists in prompting? By now, naïve users expect that, when they ask a question, they can get a reliable answer. Reality is otherwise.

To be fair, I suppose there will be progress in LLMs on exactly this point. You could build a prompt optimiser, trained on verified question-answer couples, pray it generalises and add it as a layer between user input and LLM processing. This could work well for questions that are present in the underlying corpus and it can be good enough for questions about domains where users have relevant expertise themselves.

However, the addition of a prompt optimiser would still not instil reasoning capabilities into an LLM – it would just offer a searchlight into a tangled web of linguistic forms. Like retrieval-augmented generation, it's a way to calibrate responses, but the underlying machine cognition does not change. For the user, the tool could become more reliable, yet the more fundamental issue remains: the machine isn't doing the thinking.