Skip to content

SWE-bench Verified Leaderboard

coding49 models · Updated June 2026

On the SWE-bench Verified benchmark, Claude 4.5 Opus ranks #1 with a score of 79.2, while Mistral Devstral Small 2505 offers the best score-per-dollar at $0.06/1M output tokens. The full ranking, with cost per million tokens, is below.

💰 Best value
Mistral Devstral Small 2505 — score 46.8 at $0.06/1M output tokens
#ModelScoreOutput / 1MContext
1Claude 4.5 OpusQiniu79.2200K
2Doubao-Seed-CodeZenMux78.8$1.12256K
3Gemini 3 ProGoogle DeepMind77.4
4Claude 4 SonnetNanoGPT76.8$14.99200K
5Gemini 3 Flash ThinkingNanoGPT75.8$3.001M
6MiniMax M2.5NanoGPT75.8$1.20205K
7Anthropic: Claude Opus 4.6 (Fast)Anthropic75.6$150.001M
8Gemini 2.5 Pro (Jun 2025)Google DeepMind75.2
9Claude 4.5 SonnetQiniu74.8200K
10GPT 4.1NanoGPT74.6$8.001M
11OpenAI o4-mini highNanoGPT74.4$4.40200K
12GPT-5OpenAI74.4
13Claude 4 Opus Thinking (1K)NanoGPT73.2$75.00200K
14GPT-5.2 CodexOpenAI72.8
15GLM-5Z.ai (Zhipu AI)72.8
16GPT-5.2OpenAI72.8
17Kimi K2.5Moonshot70.8
18DeepSeek: DeepSeek V3.2DeepSeek70.0$0.34131K
19Qwen3-Coder-480B-A35BAlibaba69.6
20GLM-4.6Z.ai (Zhipu AI),Tsinghua University68.2
21Claude 4.5 HaikuQiniu66.6200K
22GPT-5.1-CodexOpenAI66.0
23GPT-5.1Poe66.0$9.00400K
24Kimi K2 ThinkingMoonshot65.4
25GLM-4.5Z.ai (Zhipu AI),Tsinghua University64.2
26Claude 3.7 SonnetAnthropic63.2
27MiniMax-M2MiniMax61.0
28Qwen: Qwen3 Coder 30B A3B InstructQwen60.4$0.27160K
29GPT-5 miniOpenAI59.8
30o3OpenAI58.4
31Devstral SmallMistral56.4$0.30128K
32Devstral 2Mistral53.8$2.00262K
33Gemini 2.0 FlashQiniu52.21M
34Claude 3.5 SonnetAnthropic50.8
35Mistral Devstral Small 2505NanoGPT46.8$0.0633K
36Amazon: Nova Premier 1.0Amazon42.4$12.501M
37o3-miniOpenAI42.4
38Claude 3.5 HaikuQiniu40.6200K
39GPT-4o (Mar 2025)OpenAI38.8
40GPT-5 nanoOpenAI34.8
41Gemini 2.5 Flash PreviewNanoGPT28.7$0.601M
42gpt-oss-120bOpenAI26.0
43GPT 4.1 MiniNanoGPT23.9$1.601M
44GPT-4 (Jun 2023)OpenAI22.4
45Llama 4 MaverickMeta AI21.0
46Claude 3 OpusAnthropic15.8
47Llama 4 ScoutMeta AI9.1
48Qwen2.5 Coder 32B Instruct9.0$1.00128K
49Claude 2Anthropic4.4

What does SWE-bench Verified test?

Evaluation framework with 2,294 real GitHub software-engineering problems.

Frequently asked questions

Pricing is indicative — confirm with the provider before production use. Updated June 2026.