AI Coding Research / February 2026
New research reveals that framework choice creates a 2.5x performance swing with the same AI model. The infrastructure around your model matters more than the model itself.
224 backend coding tasks across 8 languages and 19 frameworks
Same GPT-5 model. Only the agent harness changed. Framework architecture accounts for a 30+ percentage point swing in task success rate.
Best Framework
~50%
OpenHands with GPT-5
Worst Framework
<20%
mini-SWE-agent with GPT-5
CPU
Processing power
RAM
Working memory
Operating System
Orchestration layer
Application
User logic
Model (GPT-5, Claude)
Reasoning & generation
Context Window
128K-200K tokens
Agent Harness
OpenHands, Claude Code, etc.
Your Coding Task
Bug fix, feature, refactor
Key insight: A powerful CPU with a bad operating system = wasted potential. The same applies to AI: a great model with a poor harness underperforms.
Bad: Dumps entire codebase → model drowns in noise
Good: Retrieves only relevant files → model focuses
Bad: Sequential calls, waits for each response
Good: Parallel execution, batched operations
Bad: Test fails → gives up or loops forever
Good: Analyzes error → retries with fix → escalates
Bad: Forgets goal after 50+ tool calls
Good: Periodic goal reinforcement, state checkpoints
Bad: Custom prompts per tool → inconsistent
Good: AGENTS.md + consistent schemas
The performance gap between harnesses increases with model capability
Insight: Weak models are bottlenecked by capability. Strong models are bottlenecked by harness. The better your model, the more framework choice matters.
Option A
Option B
This Week
This Month
This Quarter