TL;DR
Benchmark that tests whether an LLM can take a real GitHub issue and a full repository and produce a passing code change. Went from 2% solve rate in 2023 to 70%+ by 2025 -- the clearest quantitative record of agentic coding progress we have.
Why it matters
SWE-bench is why the narrative of "AI can write code" shifted from demo to production between 2023 and 2025. Every coding assistant vendor reports their number on it; the progress curve is the fastest in any LLM benchmark family.
More importantly, it catalyzed the design of coding-specific agent scaffolds: file search, repo navigation, test-run feedback loops. The paper defines the problem; the last 2 years of agent architecture answers it.
How you'd use this
Read if you're building coding agents or evaluating them. The SWE-bench Lite and SWE-bench Verified subsets are the practical entry points for your own eval work. Use their harness as a reference for how to structure an agent eval with real test feedback.
Read the authors' abstract
We introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.
