AI Coding Research / February 2026

SWE-EVO Benchmark:
The 3x Performance Collapse

Frontier AI models that ace coding benchmarks fail dramatically at real software evolution tasks. GPT-5 drops from 65% to 21% when moving from isolated fixes to long-horizon multi-file changes.

Performance drop 3x from 65% to 21%

What is SWE-EVO?

A new benchmark testing long-horizon software evolution tasks — the kind of work that actually matters in production codebases.

Tasks

Real evolution scenarios from 7 mature Python projects

Avg Files

Files modified per task — true multi-file reasoning

Tests

874

Per instance validating functionality preservation

The Performance Gap

GPT-5 Performance Comparison

SWE-Bench Verified (isolated) vs SWE-EVO (long-horizon)

SWE-Bench Verified

65.0%

SWE-EVO

20.8%

0% 25% 50% 75% 100%

Same model. Same tasks. Different scope. The gap between isolated fixes and sustained multi-file evolution is enormous.

Source: SWE-EVO Paper, Vals.ai SWE-Bench

Full Model Comparison

SWE-EVO vs SWE-Bench Verified — the gap shows capability lost on long-horizon tasks.

SWE-EVO (long-horizon)

SWE-Bench Verified (isolated)

GPT-5

-44%

GPT-5-mini

-49%

-52%

Deepseek-R1

-49%

Qwen3-Coder

-41%

GLM-4p5

-38%

Kimi-K2

-25%

GPT-4.1

-29%

0% 20% 40% 60% 80%

Source: SWE-EVO Paper (Table 1), Vals.ai SWE-Bench Verified

Harness Choice: SWE-Agent vs OpenHands

The same model performs very differently depending on which harness executes it.

OpenHands (container-based)

SWE-Agent (CLI-based)

GPT-4.1

+5x

GPT-5

+11%

Kimi-K2

+12%

GLM-4p5

Tie

Qwen3-Coder

Tie

Deepseek-R1

-20%

+50%

0% 5% 10% 15% 20%+

Biggest Swing

GPT-4.1: 2.08% → 10.42% with SWE-Agent

Exception

-20%

Deepseek-R1 performs worse on SWE-Agent

What's a harness? The agent framework executing LLM actions. SWE-Agent uses CLI-based shell access; OpenHands runs in isolated containers. Your harness architecture matters as much as model selection.

Source: SWE-EVO Paper (Table 2)

Claude / Anthropic Models

Important: Claude models were not included in SWE-EVO. Here's performance on related benchmarks:

80.9%

SWE-Bench Verified

Claude Opus 4.5 — #1

74.6%

SWE-Bench Std

Vals.ai harness

54%

Claude Code

SWE-Bench-Pro 30d

???

SWE-EVO

Not evaluated

Based on GPT-5's 3x drop, Claude Opus 4.5's estimated SWE-EVO: ~25-30%

Key Takeaways for Engineering Leaders

Benchmarks Lie (By Omission)

A model scoring 65% on SWE-Bench Verified may only achieve 21% on real software evolution. The gap between isolated fixes and sustained multi-file work is enormous.

Task Decomposition is Non-Negotiable

AI excels at bounded, single-file changes. A 2-day refactor should become 15-20 well-scoped AI tasks, not one "please refactor this system" prompt.

Harness Choice Matters More Than You Think

The same model can swing from 2% to 10% based on execution framework. Your tooling architecture is as important as model selection.

Strong Models Fail Differently

Frontier models fail on instruction following (misinterpreting nuanced specs). Weaker models fail on tool use and syntax. Train your team accordingly.

The "10x Engineer" Becomes "10x Context Manager"

The 21% vs 65% gap is fundamentally about context. Engineers who excel with AI tools have learned to manage context boundaries explicitly.

Sources

SWE-EVO: Benchmarking Long-Horizon Coding — arXiv, January 2026 SWE-Bench Leaderboard — Vals.ai Claude Code Tracker — MarginLab

SWE-EVO Benchmark:The 3x Performance Collapse

What is SWE-EVO?

The Performance Gap

GPT-5 Performance Comparison

Full Model Comparison

Harness Choice: SWE-Agent vs OpenHands

Claude / Anthropic Models

Key Takeaways for Engineering Leaders

Benchmarks Lie (By Omission)

Task Decomposition is Non-Negotiable

Harness Choice Matters More Than You Think

Strong Models Fail Differently

The "10x Engineer" Becomes "10x Context Manager"

Sources

SWE-EVO Benchmark:
The 3x Performance Collapse