Just less than before, according to the ORCA test

2 weeks ago theregister.co.uk

exclusive Current-day LLMs are prediction engines and, as such, they can only find the most likely solution to problems, which is not necessarily the correct one. Though popular models have mostly become better at math, even top performer Gemini 3 Flash would receive a C if assessed with a letter grade.

Researchers affiliated with Omni Calculator, a maker of online calculators for specific applications, have subjected a new set of AI models to the company's ORCA Benchmark, which consists of 500 practical math questions.

In their initial evaluation last November, OpenAI's ChatGPT-5, Google's Gemini 2.5 Flash, Anthropic's Claude Sonnet 4.5, xAI's Grok 4, and DeepSeek's DeepSeek V3.2 (alpha) all did poorly, scoring 63 percent or less on math problems.

The latest set of contestants consists of ChatGPT-5.2, Gemini 3 Flash, Grok 4.1, and DeepSeek V3.2 (stable release). Sonnet ...

Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

Share: