SWE-bench Verified Leaderboard
coding49 models · Updated June 2026
On the SWE-bench Verified benchmark, Claude 4.5 Opus ranks #1 with a score of 79.2, while Mistral Devstral Small 2505 offers the best score-per-dollar at $0.06/1M output tokens. The full ranking, with cost per million tokens, is below.
💰 Best value
Mistral Devstral Small 2505 — score 46.8 at $0.06/1M output tokens
| # | Model | Score | Output / 1M | Context |
|---|---|---|---|---|
| 1 | Claude 4.5 OpusQiniu | 79.2 | — | 200K |
| 2 | Doubao-Seed-CodeZenMux | 78.8 | $1.12 | 256K |
| 3 | Gemini 3 ProGoogle DeepMind | 77.4 | — | — |
| 4 | Claude 4 SonnetNanoGPT | 76.8 | $14.99 | 200K |
| 5 | Gemini 3 Flash ThinkingNanoGPT | 75.8 | $3.00 | 1M |
| 6 | MiniMax M2.5NanoGPT | 75.8 | $1.20 | 205K |
| 7 | Anthropic: Claude Opus 4.6 (Fast)Anthropic | 75.6 | $150.00 | 1M |
| 8 | Gemini 2.5 Pro (Jun 2025)Google DeepMind | 75.2 | — | — |
| 9 | Claude 4.5 SonnetQiniu | 74.8 | — | 200K |
| 10 | GPT 4.1NanoGPT | 74.6 | $8.00 | 1M |
| 11 | OpenAI o4-mini highNanoGPT | 74.4 | $4.40 | 200K |
| 12 | GPT-5OpenAI | 74.4 | — | — |
| 13 | Claude 4 Opus Thinking (1K)NanoGPT | 73.2 | $75.00 | 200K |
| 14 | GPT-5.2 CodexOpenAI | 72.8 | — | — |
| 15 | GLM-5Z.ai (Zhipu AI) | 72.8 | — | — |
| 16 | GPT-5.2OpenAI | 72.8 | — | — |
| 17 | Kimi K2.5Moonshot | 70.8 | — | — |
| 18 | DeepSeek: DeepSeek V3.2DeepSeek | 70.0 | $0.34 | 131K |
| 19 | Qwen3-Coder-480B-A35BAlibaba | 69.6 | — | — |
| 20 | GLM-4.6Z.ai (Zhipu AI),Tsinghua University | 68.2 | — | — |
| 21 | Claude 4.5 HaikuQiniu | 66.6 | — | 200K |
| 22 | GPT-5.1-CodexOpenAI | 66.0 | — | — |
| 23 | GPT-5.1Poe | 66.0 | $9.00 | 400K |
| 24 | Kimi K2 ThinkingMoonshot | 65.4 | — | — |
| 25 | GLM-4.5Z.ai (Zhipu AI),Tsinghua University | 64.2 | — | — |
| 26 | Claude 3.7 SonnetAnthropic | 63.2 | — | — |
| 27 | MiniMax-M2MiniMax | 61.0 | — | — |
| 28 | Qwen: Qwen3 Coder 30B A3B InstructQwen | 60.4 | $0.27 | 160K |
| 29 | GPT-5 miniOpenAI | 59.8 | — | — |
| 30 | o3OpenAI | 58.4 | — | — |
| 31 | Devstral SmallMistral | 56.4 | $0.30 | 128K |
| 32 | Devstral 2Mistral | 53.8 | $2.00 | 262K |
| 33 | Gemini 2.0 FlashQiniu | 52.2 | — | 1M |
| 34 | Claude 3.5 SonnetAnthropic | 50.8 | — | — |
| 35 | Mistral Devstral Small 2505NanoGPT | 46.8 | $0.06 | 33K |
| 36 | Amazon: Nova Premier 1.0Amazon | 42.4 | $12.50 | 1M |
| 37 | o3-miniOpenAI | 42.4 | — | — |
| 38 | Claude 3.5 HaikuQiniu | 40.6 | — | 200K |
| 39 | GPT-4o (Mar 2025)OpenAI | 38.8 | — | — |
| 40 | GPT-5 nanoOpenAI | 34.8 | — | — |
| 41 | Gemini 2.5 Flash PreviewNanoGPT | 28.7 | $0.60 | 1M |
| 42 | gpt-oss-120bOpenAI | 26.0 | — | — |
| 43 | GPT 4.1 MiniNanoGPT | 23.9 | $1.60 | 1M |
| 44 | GPT-4 (Jun 2023)OpenAI | 22.4 | — | — |
| 45 | Llama 4 MaverickMeta AI | 21.0 | — | — |
| 46 | Claude 3 OpusAnthropic | 15.8 | — | — |
| 47 | Llama 4 ScoutMeta AI | 9.1 | — | — |
| 48 | Qwen2.5 Coder 32B Instruct | 9.0 | $1.00 | 128K |
| 49 | Claude 2Anthropic | 4.4 | — | — |
What does SWE-bench Verified test?
Evaluation framework with 2,294 real GitHub software-engineering problems.
Frequently asked questions
Pricing is indicative — confirm with the provider before production use. Updated June 2026.