SWE-bench: Can Language Models Resolve Real-World GitHub Issues? -- Qurtoo Papers

TL;DR

Benchmark that tests whether an LLM can take a real GitHub issue and a full repository and produce a passing code change. Went from 2% solve rate in 2023 to 70%+ by 2025 -- the clearest quantitative record of agentic coding progress we have.

Why it matters

SWE-bench is why the narrative of "AI can write code" shifted from demo to production between 2023 and 2025. Every coding assistant vendor reports their number on it; the progress curve is the fastest in any LLM benchmark family.

More importantly, it catalyzed the design of coding-specific agent scaffolds: file search, repo navigation, test-run feedback loops. The paper defines the problem; the last 2 years of agent architecture answers it.

How you'd use this

Read if you're building coding agents or evaluating them. The SWE-bench Lite and SWE-bench Verified subsets are the practical entry points for your own eval work. Use their harness as a reference for how to structure an agent eval with real test feedback.

Read the authors' abstract

We introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.