AI Coding Research / February 2026

SWE-EVO Benchmark:
The 3x Performance Collapse

Frontier AI models that ace coding benchmarks fail dramatically at real software evolution tasks. GPT-5 drops from 65% to 21% when moving from isolated fixes to long-horizon multi-file changes.

Performance drop 3x from 65% to 21%
1

What is SWE-EVO?

A new benchmark testing long-horizon software evolution tasks — the kind of work that actually matters in production codebases.

Tasks

48

Real evolution scenarios from 7 mature Python projects

Avg Files

21

Files modified per task — true multi-file reasoning

Tests

874

Per instance validating functionality preservation

2

The Performance Gap

GPT-5 Performance Comparison

SWE-Bench Verified (isolated) vs SWE-EVO (long-horizon)

SWE-Bench Verified
65.0%
SWE-EVO
20.8%
0% 25% 50% 75% 100%
!

Same model. Same tasks. Different scope. The gap between isolated fixes and sustained multi-file evolution is enormous.

Source: SWE-EVO Paper, Vals.ai SWE-Bench

3

Full Model Comparison

SWE-EVO vs SWE-Bench Verified — the gap shows capability lost on long-horizon tasks.

SWE-EVO (long-horizon)
SWE-Bench Verified (isolated)
GPT-5
-44%
GPT-5-mini
-49%
O3
-52%
Deepseek-R1
-49%
Qwen3-Coder
-41%
GLM-4p5
-38%
Kimi-K2
-25%
GPT-4.1
-29%
0% 20% 40% 60% 80%

Source: SWE-EVO Paper (Table 1), Vals.ai SWE-Bench Verified

4

Harness Choice: SWE-Agent vs OpenHands

The same model performs very differently depending on which harness executes it.

OpenHands (container-based)
SWE-Agent (CLI-based)
GPT-4.1
+5x
GPT-5
+11%
Kimi-K2
+12%
GLM-4p5
Tie
Qwen3-Coder
Tie
Deepseek-R1
-20%
O3
+50%
0% 5% 10% 15% 20%+

Biggest Swing

5x

GPT-4.1: 2.08% → 10.42% with SWE-Agent

Exception

-20%

Deepseek-R1 performs worse on SWE-Agent

!

What's a harness? The agent framework executing LLM actions. SWE-Agent uses CLI-based shell access; OpenHands runs in isolated containers. Your harness architecture matters as much as model selection.

Source: SWE-EVO Paper (Table 2)

Claude / Anthropic Models

Important: Claude models were not included in SWE-EVO. Here's performance on related benchmarks:

80.9%
SWE-Bench Verified
Claude Opus 4.5 — #1
74.6%
SWE-Bench Std
Vals.ai harness
54%
Claude Code
SWE-Bench-Pro 30d
???
SWE-EVO
Not evaluated

Based on GPT-5's 3x drop, Claude Opus 4.5's estimated SWE-EVO: ~25-30%

5

Key Takeaways for Engineering Leaders

1

Benchmarks Lie (By Omission)

A model scoring 65% on SWE-Bench Verified may only achieve 21% on real software evolution. The gap between isolated fixes and sustained multi-file work is enormous.

2

Task Decomposition is Non-Negotiable

AI excels at bounded, single-file changes. A 2-day refactor should become 15-20 well-scoped AI tasks, not one "please refactor this system" prompt.

3

Harness Choice Matters More Than You Think

The same model can swing from 2% to 10% based on execution framework. Your tooling architecture is as important as model selection.

4

Strong Models Fail Differently

Frontier models fail on instruction following (misinterpreting nuanced specs). Weaker models fail on tool use and syntax. Train your team accordingly.

5

The "10x Engineer" Becomes "10x Context Manager"

The 21% vs 65% gap is fundamentally about context. Engineers who excel with AI tools have learned to manage context boundaries explicitly.

Sources

SWE-EVO: Benchmarking Long-Horizon Coding — arXiv, January 2026 SWE-Bench Leaderboard — Vals.ai Claude Code Tracker — MarginLab