AI Coding Research / February 2026

Your Model Isn't the Bottleneck.
Your Harness Is.

New research reveals that framework choice creates a 2.5x performance swing with the same AI model. The infrastructure around your model matters more than the model itself.

GPT-5 performance gap 2.5x from harness choice alone
1

Same Model, Different Frameworks

GPT-5 Performance on ABC-Bench

224 backend coding tasks across 8 languages and 19 frameworks

OpenHands
~50%
Claude Code
~45%
mini-SWE-agent
<20%
0% 25% 50% 75% 100%
!

Same GPT-5 model. Only the agent harness changed. Framework architecture accounts for a 30+ percentage point swing in task success rate.

Best Framework

~50%

OpenHands with GPT-5

Worst Framework

<20%

mini-SWE-agent with GPT-5

2

The Operating System Analogy

Computer System

CPU

Processing power

RAM

Working memory

Operating System

Orchestration layer

Application

User logic

= = = =

AI System

Model (GPT-5, Claude)

Reasoning & generation

Context Window

128K-200K tokens

Agent Harness

OpenHands, Claude Code, etc.

Your Coding Task

Bug fix, feature, refactor

!

Key insight: A powerful CPU with a bad operating system = wasted potential. The same applies to AI: a great model with a poor harness underperforms.

3

What Good Harnesses Do

1

Context Curation

Bad: Dumps entire codebase → model drowns in noise
Good: Retrieves only relevant files → model focuses

+10-15%
2

Parallel Tool Calls

Bad: Sequential calls, waits for each response
Good: Parallel execution, batched operations

3-5x faster
3

Error Recovery

Bad: Test fails → gives up or loops forever
Good: Analyzes error → retries with fix → escalates

+20-30%
4

Drift Detection

Bad: Forgets goal after 50+ tool calls
Good: Periodic goal reinforcement, state checkpoints

Critical
5

Standardized Interfaces

Bad: Custom prompts per tool → inconsistent
Good: AGENTS.md + consistent schemas

Reproducible
4

Better Models = Bigger Harness Gap

The performance gap between harnesses increases with model capability

Claude Sonnet 4.5
63% 35%

28pt gap

GPT-5
50% <20%

30+ pt gap

DeepSeek-V3.2
50% 25%

25pt gap

Qwen3-8B
10% 5%

5pt gap

OpenHands mini-SWE-agent
!

Insight: Weak models are bottlenecked by capability. Strong models are bottlenecked by harness. The better your model, the more framework choice matters.

5

The Business Case

Option A

Upgrade Your Model

Change GPT-4 → GPT-5
Cost Impact 3-5x higher API bills
Accuracy Gain ~15%
VS
6

Action Items

This Week

Audit your current harness — what framework are you using?
Add AGENTS.md to your key repositories (60k+ repos already have them)
Benchmark the same task across different frameworks to measure your gap

This Month

Evaluate OpenHands — open-source with best documented performance
Standardize tool interfaces with consistent schemas across your codebase

This Quarter

Build feedback loops — tests, linters, type checkers that give agents immediate feedback

Sources

ABC-Bench: Benchmarking Agentic Backend Coding — OpenMOSS, January 2026 The OpenHands Index — January 28, 2026 The Importance of Agent Harness in 2026 — Phil Schmid SWE-EVO: Benchmarking Coding Agents — January 2026