The Reasoning Race: From 2.7% to 53.1% on Humanity's Last Exam

Part 1

Humanity's Last Exam: The Benchmark That Humbled AI

In early 2025, Humanity's Last Exam launched with a simple premise: create a test so hard that AI couldn't pass it. The Center for AI Safety and Scale AI recruited nearly 1,000 experts across mathematics (41%), physics, biology, chemistry, computer science, and humanities to create 2,500 questions designed to be "Google-proof" — requiring genuine understanding, not information retrieval.

The results were sobering. GPT-4o scored 2.7%. Claude 3.5 Sonnet managed 4.1%. Even OpenAI's o1, the first dedicated reasoning model, could only reach 8%. Human experts, by contrast, score around 90%.

Then, over the next twelve months, something remarkable happened.

Figure 1: Humanity's Last Exam scores over 12 months. GPT-4o's 2.7% in early 2025 gave way to Claude Opus 4.6's 53.1% (with tools) by February 2026 — a 20x improvement. The gap to human experts (90%) has narrowed from 87 points to 37.

By January 2026, Gemini 3 Pro Preview hit 37.2%, GPT-5.2 reached 35.4%, and Gemini 3 Flash (Reasoning) scored 33.7%. Then on February 5, 2026, Anthropic released Claude Opus 4.6 — scoring 40.0% without tools and 53.1% with tools, the highest score any frontier model has achieved on HLE.

What Makes HLE Different

Unlike benchmarks such as MATH-500 or MMLU that test well-trodden territory, HLE questions require synthesizing knowledge across disciplines at expert level. The test resists memorization because many questions have never appeared in any training data. It's designed to be the final closed-ended academic benchmark — a ceiling that rises with human knowledge.

Part 2

The Frontier Four: Each Model Has a Superpower

The most striking finding of early 2026 is that no single model dominates every benchmark. Instead, each frontier model has carved out a domain of excellence — and the choice of model now depends entirely on the task.

Figure 2: Head-to-head on three shared benchmarks. Claude Opus 4.6 leads HLE (53.1%) and is competitive on SWE-bench. OpenAI o3 leads ARC-AGI-2 (75.7% standard compute; 87.5% on ARC-AGI-1 high compute). No model wins everywhere.

Model	Best At	Key Score	Weakness
Claude Opus 4.6	Agentic coding, professional work	Terminal-Bench: 65.4% (SOTA)	Abstract reasoning (vs o3)
GPT-5.2	Pure mathematics, abstract reasoning	FrontierMath: 40.3% (10x prev.)	Professional work (vs Opus)
Gemini 3 Pro	Multimodal science, broad excellence	GPQA Diamond: 91.9%	ARC-AGI-2 (31.1%)
OpenAI o3	Abstract visual reasoning	ARC-AGI-1: 87.5% (high compute)	Extremely expensive per puzzle

Claude Opus 4.6: The Professional's Model

Released February 5, 2026, Opus 4.6 introduced several firsts: a 1 million token context window for Opus-class models, "Adaptive Thinking" that dynamically adjusts reasoning depth across four effort levels, and "Agent Teams" enabling multiple Claude instances to collaborate on the same project.

Figure 3: Claude Opus 4.6 benchmark performance with previous best scores (red dots) where available. Notable jumps: ARC-AGI-2 improved 83% over Opus 4.5, MRCR v2 jumped from 18.5% to 76%.

On GDPval-AA — a benchmark measuring real-world professional tasks across 44 knowledge work occupations — Opus 4.6 scored 1606 Elo, beating GPT-5.2's 1462 by 144 points. That translates to winning roughly 70% of head-to-head comparisons on finance, legal, and other professional tasks.

The model also demonstrated its capabilities in two dramatic demonstrations: 16 parallel Claude instances autonomously built a C compiler over 2 weeks (consuming 2 billion input tokens), and Opus 4.6 discovered 500+ zero-day vulnerabilities in open-source code using "out-of-the-box" capabilities.

"Claude Opus 4.6 excels on the hardest problems. It shows greater persistence, stronger code review, and the ability to stay on long tasks where other models tend to give up."

— Michael Truell, Co-founder of Cursor

GPT-5.2: The Mathematician

OpenAI's GPT-5.2 achieved a paradigm shift in mathematical reasoning. On FrontierMath — a benchmark of 350 original, unpublished problems requiring PhD-level mathematics — it scored 40.3%. Previous models couldn't break 2%. On AIME (American Invitational Mathematics Examination), GPT-5.2 Thinking achieved a perfect 100%.

GPT-5.2 also scored 77% on FrontierScience Olympiad tasks and 25% on Research tasks — demonstrating capability at the boundary of current scientific knowledge.

OpenAI o3: The Cost-Efficiency Question

o3's 87.5% on ARC-AGI-1 in high-compute mode surpassed the 85% prize threshold — a historic milestone. But at what cost? Some puzzles consumed hundreds of dollars in compute; even in low-compute mode, o3 costs $17–20 per task. Humans solve the same puzzles for roughly $5 each. ARC-AGI's François Chollet has responded by adding efficiency metrics: intelligence = capability + cost-effectiveness. On the harder ARC-AGI-2, o3 scores 75.7% in standard compute.

Part 3

The Saturation Problem: When Tests Become Too Easy

A quieter crisis underlies the benchmark race: the tests that made headlines in 2024 are now functionally obsolete. MATH-500, the gold standard for mathematical reasoning, is no longer used for new model releases because most frontier models score above 90%. MMLU, once the universal intelligence test, routinely exceeds 95%.

Figure 4: Saturated benchmarks (yellow, above 90% threshold) versus frontier benchmarks (green) that still differentiate models. The field is shifting toward harder, contamination-resistant evaluations.

This saturation has driven the creation of a new generation of harder benchmarks:

New Benchmark	What It Tests	Best Score	Why It Matters
Humanity's Last Exam	Expert-level multidisciplinary reasoning	53.1%	2,500 never-before-published questions
FrontierMath	Research-grade mathematics	40.3%	PhD mathematicians need hours per problem
ARC-AGI-2	Abstract visual reasoning + efficiency	75.7%	Now measures cost per puzzle, not just score
Terminal-Bench 2.0	Agentic coding in real environments	65.4%	Tests multi-step tool use, not just code generation
GDPval-AA	Real paid professional work	1606 Elo	44 occupations across 9 industries

ARC-AGI-3: The Next Frontier

François Chollet has announced ARC-AGI-3, launching March 25, 2026. It introduces 1,000+ interactive video-game-like environments where AI agents must discover rules through exploration — no instructions provided. This represents a shift from static question-answering to dynamic problem-solving that resists memorization entirely.

Part 4

The Specialization Era: Pick Your Model, Pick Your Strength

The data tells a clear story: the era of a single "best" model is over. Early adopters are implementing model-switching strategies — GPT-5.2 for abstract mathematics, Claude Opus 4.6 for agentic coding and professional work, Gemini 3 Pro for multimodal science, DeepSeek R1 for high-volume budget workloads.

Figure 5: Models mapped by abstract reasoning strength (x-axis) versus professional work capability (y-axis). Each model occupies a distinct quadrant. OpenAI o3 leads abstract reasoning; Claude Opus 4.6 leads professional tasks.

The Inference-Time Scaling Revolution

The single biggest technical shift of 2026 isn't bigger models — it's smarter thinking. OpenAI's o1/o3 proved that letting models "think longer" at inference time unlocks capabilities that larger base models cannot achieve. Every major lab has adopted this approach:

Anthropic: "Adaptive Thinking" with four effort levels (low, medium, high, max)
OpenAI: Chain-of-thought reasoning in o3/o4 series
Google: "Deep Think" parallel sampling — generates multiple answers, refines, combines
Microsoft: Phi-4-reasoning — a 14B model matching DeepSeek R1 (671B) on AIME

The implication for enterprise buyers: reasoning quality is now a dial, not a fixed property. You pay more for deeper thinking, but get measurably better results on hard problems.

Open Source Is Closing the Gap

DeepSeek R1 matches OpenAI o1 on math, coding, and reasoning tasks. Qwen3-235B achieves 81.5% on AIME 2025, competitive with many proprietary models. Microsoft's Phi-4-reasoning (14B parameters) rivals the full DeepSeek-R1 (671B) on AIME — a 47x efficiency improvement. The reasoning capability moat is shrinking fast.

Part 5

The Debate: Do Benchmarks Measure Intelligence?

The benchmark progress has split AI leaders into three camps: optimists celebrating the numbers, skeptics questioning what they measure, and pragmatists evolving the tests themselves.

The Optimists

"We might be 6 to 12 months away from when the model is doing most, maybe all of what SWEs do end-to-end. There's a lot of uncertainty, and it's easy to see how this could take a few years."

— Dario Amodei, CEO of Anthropic

"The model is now unlocking long-horizon tasks that were previously achievable only by humans."

— Mario Rodriguez, GitHub

The Skeptics

"The AI industry is completely LLM-pilled. Fluency is often mistaken for reasoning or world understanding. LLMs will never achieve humanlike intelligence."

— Yann LeCun, Chief AI Scientist, Meta

"We're going to see every task served neatly on a platter improved. But jobs that need long, multimodal, error-correcting task sequences remain a different challenge."

— Andrej Karpathy

The Pragmatists

"Current AI systems lack several critical abilities: they don't do continual learning, they don't have true creativity yet, and they don't do long-term planning and reasoning."

— Demis Hassabis, CEO of Google DeepMind (estimates AGI at 50% chance within a decade)

"We did decide to put most of our effort in 5.2 into making it super good at intelligence, reasoning, coding, engineering. But memory enhancement — not just reasoning — is the key breakthrough for 2026."

— Sam Altman, CEO of OpenAI

The Karpathy Critique

Andrej Karpathy dismissed the "Humanity's Last Exam" framing as "a bit much, and misleading." His argument: while AI will solve any problem served neatly as a multiple-choice test, real-world jobs require sustained, multi-step, error-correcting workflows — something benchmarks fundamentally can't capture. The gap between test performance and job performance remains the critical unsolved problem.

Part 6

The Enterprise Reality: The ROI Gap

For enterprise leaders, the benchmark race raises an uncomfortable question: if AI is getting so much smarter, why aren't the results showing up in the bottom line?

Figure 6: The enterprise AI paradox. CEO optimism (80%) and AI adoption (88%) are at all-time highs, but only 14% of CFOs report measurable ROI. The gap between deploying AI and profiting from it is the central challenge of 2026.

The Numbers That Matter

BCG AI Radar 2026: 80% of CEOs are more optimistic about AI ROI, and >30% of AI budgets are committed to agentic AI. But MIT found a 95% failure rate for enterprise GenAI projects (no measurable financial returns within 6 months). Accenture found that companies with AI-led processes achieve 2.5x higher revenue growth — but this describes only the top 25% of performers.

Gartner predicts 40% of enterprise applications will embed AI agents by end of 2026, up from less than 5% in 2025 — an 8x increase. The market for AI agents is projected to grow from $7.84 billion (2025) to $52.62 billion (2030) at 46.3% CAGR.

As BCG put it: "ROI will be the acronym of 2026 and beyond." The models are ready. The question is whether organizations are.

Part 7

What This Means for Enterprise Leaders

1. Adopt a Multi-Model Strategy

No single model is best at everything. Use Claude Opus 4.6 for agentic coding and professional workflows. Use GPT-5.2 for abstract math and research. Use Gemini 3 Pro for multimodal science. Use open-source models (DeepSeek R1, Phi-4) for high-volume budget workloads. Model routing is becoming a core competency.

2. Buy Reasoning Depth, Not Just Speed

Inference-time scaling means you can now trade cost for quality. Claude's Adaptive Thinking, OpenAI's extended reasoning, and Google's Deep Think all let you dial up reasoning for hard problems. Budget for this — the productivity gains on complex tasks justify the per-token premium.

3. Don't Chase Benchmarks — Chase Workflows

Karpathy is right: benchmark scores don't translate directly to job performance. Focus on GDPval-style evaluations that measure real-world tasks in your industry, not academic tests. The models that win on HLE may not be the models that win on your specific workflows.