How AI reasoning exploded in 12 months, which models lead on which benchmarks, and why no single model is "best" anymore.
In early 2025, Humanity's Last Exam launched with a simple premise: create a test so hard that AI couldn't pass it. The Center for AI Safety and Scale AI recruited nearly 1,000 experts across mathematics (41%), physics, biology, chemistry, computer science, and humanities to create 2,500 questions designed to be "Google-proof" — requiring genuine understanding, not information retrieval.
The results were sobering. GPT-4o scored 2.7%. Claude 3.5 Sonnet managed 4.1%. Even OpenAI's o1, the first dedicated reasoning model, could only reach 8%. Human experts, by contrast, score around 90%.
Then, over the next twelve months, something remarkable happened.
By January 2026, Gemini 3 Pro Preview hit 37.2%, GPT-5.2 reached 35.4%, and Gemini 3 Flash (Reasoning) scored 33.7%. Then on February 5, 2026, Anthropic released Claude Opus 4.6 — scoring 40.0% without tools and 53.1% with tools, the highest score any frontier model has achieved on HLE.
Unlike benchmarks such as MATH-500 or MMLU that test well-trodden territory, HLE questions require synthesizing knowledge across disciplines at expert level. The test resists memorization because many questions have never appeared in any training data. It's designed to be the final closed-ended academic benchmark — a ceiling that rises with human knowledge.
The most striking finding of early 2026 is that no single model dominates every benchmark. Instead, each frontier model has carved out a domain of excellence — and the choice of model now depends entirely on the task.
| Model | Best At | Key Score | Weakness |
|---|---|---|---|
| Claude Opus 4.6 | Agentic coding, professional work | Terminal-Bench: 65.4% (SOTA) | Abstract reasoning (vs o3) |
| GPT-5.2 | Pure mathematics, abstract reasoning | FrontierMath: 40.3% (10x prev.) | Professional work (vs Opus) |
| Gemini 3 Pro | Multimodal science, broad excellence | GPQA Diamond: 91.9% | ARC-AGI-2 (31.1%) |
| OpenAI o3 | Abstract visual reasoning | ARC-AGI-1: 87.5% (high compute) | Extremely expensive per puzzle |
Released February 5, 2026, Opus 4.6 introduced several firsts: a 1 million token context window for Opus-class models, "Adaptive Thinking" that dynamically adjusts reasoning depth across four effort levels, and "Agent Teams" enabling multiple Claude instances to collaborate on the same project.
On GDPval-AA — a benchmark measuring real-world professional tasks across 44 knowledge work occupations — Opus 4.6 scored 1606 Elo, beating GPT-5.2's 1462 by 144 points. That translates to winning roughly 70% of head-to-head comparisons on finance, legal, and other professional tasks.
The model also demonstrated its capabilities in two dramatic demonstrations: 16 parallel Claude instances autonomously built a C compiler over 2 weeks (consuming 2 billion input tokens), and Opus 4.6 discovered 500+ zero-day vulnerabilities in open-source code using "out-of-the-box" capabilities.
"Claude Opus 4.6 excels on the hardest problems. It shows greater persistence, stronger code review, and the ability to stay on long tasks where other models tend to give up."
— Michael Truell, Co-founder of CursorOpenAI's GPT-5.2 achieved a paradigm shift in mathematical reasoning. On FrontierMath — a benchmark of 350 original, unpublished problems requiring PhD-level mathematics — it scored 40.3%. Previous models couldn't break 2%. On AIME (American Invitational Mathematics Examination), GPT-5.2 Thinking achieved a perfect 100%.
GPT-5.2 also scored 77% on FrontierScience Olympiad tasks and 25% on Research tasks — demonstrating capability at the boundary of current scientific knowledge.
o3's 87.5% on ARC-AGI-1 in high-compute mode surpassed the 85% prize threshold — a historic milestone. But at what cost? Some puzzles consumed hundreds of dollars in compute; even in low-compute mode, o3 costs $17–20 per task. Humans solve the same puzzles for roughly $5 each. ARC-AGI's François Chollet has responded by adding efficiency metrics: intelligence = capability + cost-effectiveness. On the harder ARC-AGI-2, o3 scores 75.7% in standard compute.
A quieter crisis underlies the benchmark race: the tests that made headlines in 2024 are now functionally obsolete. MATH-500, the gold standard for mathematical reasoning, is no longer used for new model releases because most frontier models score above 90%. MMLU, once the universal intelligence test, routinely exceeds 95%.
This saturation has driven the creation of a new generation of harder benchmarks:
| New Benchmark | What It Tests | Best Score | Why It Matters |
|---|---|---|---|
| Humanity's Last Exam | Expert-level multidisciplinary reasoning | 53.1% | 2,500 never-before-published questions |
| FrontierMath | Research-grade mathematics | 40.3% | PhD mathematicians need hours per problem |
| ARC-AGI-2 | Abstract visual reasoning + efficiency | 75.7% | Now measures cost per puzzle, not just score |
| Terminal-Bench 2.0 | Agentic coding in real environments | 65.4% | Tests multi-step tool use, not just code generation |
| GDPval-AA | Real paid professional work | 1606 Elo | 44 occupations across 9 industries |
François Chollet has announced ARC-AGI-3, launching March 25, 2026. It introduces 1,000+ interactive video-game-like environments where AI agents must discover rules through exploration — no instructions provided. This represents a shift from static question-answering to dynamic problem-solving that resists memorization entirely.
The data tells a clear story: the era of a single "best" model is over. Early adopters are implementing model-switching strategies — GPT-5.2 for abstract mathematics, Claude Opus 4.6 for agentic coding and professional work, Gemini 3 Pro for multimodal science, DeepSeek R1 for high-volume budget workloads.
The single biggest technical shift of 2026 isn't bigger models — it's smarter thinking. OpenAI's o1/o3 proved that letting models "think longer" at inference time unlocks capabilities that larger base models cannot achieve. Every major lab has adopted this approach:
The implication for enterprise buyers: reasoning quality is now a dial, not a fixed property. You pay more for deeper thinking, but get measurably better results on hard problems.
DeepSeek R1 matches OpenAI o1 on math, coding, and reasoning tasks. Qwen3-235B achieves 81.5% on AIME 2025, competitive with many proprietary models. Microsoft's Phi-4-reasoning (14B parameters) rivals the full DeepSeek-R1 (671B) on AIME — a 47x efficiency improvement. The reasoning capability moat is shrinking fast.
The benchmark progress has split AI leaders into three camps: optimists celebrating the numbers, skeptics questioning what they measure, and pragmatists evolving the tests themselves.
"We might be 6 to 12 months away from when the model is doing most, maybe all of what SWEs do end-to-end. There's a lot of uncertainty, and it's easy to see how this could take a few years."
— Dario Amodei, CEO of Anthropic"The model is now unlocking long-horizon tasks that were previously achievable only by humans."
— Mario Rodriguez, GitHub"The AI industry is completely LLM-pilled. Fluency is often mistaken for reasoning or world understanding. LLMs will never achieve humanlike intelligence."
— Yann LeCun, Chief AI Scientist, Meta"We're going to see every task served neatly on a platter improved. But jobs that need long, multimodal, error-correcting task sequences remain a different challenge."
— Andrej Karpathy"Current AI systems lack several critical abilities: they don't do continual learning, they don't have true creativity yet, and they don't do long-term planning and reasoning."
— Demis Hassabis, CEO of Google DeepMind (estimates AGI at 50% chance within a decade)"We did decide to put most of our effort in 5.2 into making it super good at intelligence, reasoning, coding, engineering. But memory enhancement — not just reasoning — is the key breakthrough for 2026."
— Sam Altman, CEO of OpenAIAndrej Karpathy dismissed the "Humanity's Last Exam" framing as "a bit much, and misleading." His argument: while AI will solve any problem served neatly as a multiple-choice test, real-world jobs require sustained, multi-step, error-correcting workflows — something benchmarks fundamentally can't capture. The gap between test performance and job performance remains the critical unsolved problem.
For enterprise leaders, the benchmark race raises an uncomfortable question: if AI is getting so much smarter, why aren't the results showing up in the bottom line?
BCG AI Radar 2026: 80% of CEOs are more optimistic about AI ROI, and >30% of AI budgets are committed to agentic AI. But MIT found a 95% failure rate for enterprise GenAI projects (no measurable financial returns within 6 months). Accenture found that companies with AI-led processes achieve 2.5x higher revenue growth — but this describes only the top 25% of performers.
Gartner predicts 40% of enterprise applications will embed AI agents by end of 2026, up from less than 5% in 2025 — an 8x increase. The market for AI agents is projected to grow from $7.84 billion (2025) to $52.62 billion (2030) at 46.3% CAGR.
As BCG put it: "ROI will be the acronym of 2026 and beyond." The models are ready. The question is whether organizations are.
No single model is best at everything. Use Claude Opus 4.6 for agentic coding and professional workflows. Use GPT-5.2 for abstract math and research. Use Gemini 3 Pro for multimodal science. Use open-source models (DeepSeek R1, Phi-4) for high-volume budget workloads. Model routing is becoming a core competency.
Inference-time scaling means you can now trade cost for quality. Claude's Adaptive Thinking, OpenAI's extended reasoning, and Google's Deep Think all let you dial up reasoning for hard problems. Budget for this — the productivity gains on complex tasks justify the per-token premium.
Karpathy is right: benchmark scores don't translate directly to job performance. Focus on GDPval-style evaluations that measure real-world tasks in your industry, not academic tests. The models that win on HLE may not be the models that win on your specific workflows.
Published February 5, 2026 · Analysis by aictrl.dev