Which model is best on SWE-bench Verified?

Claude 4.5 Opus (Qiniu) currently leads the SWE-bench Verified leaderboard with a score of 79.2.

What is the best value model on SWE-bench Verified?

Mistral Devstral Small 2505 offers the best score-per-dollar on SWE-bench Verified — a score of 46.8 at $0.06 per 1M output tokens.

SWE-bench Verified Leaderboard

coding49 models · Updated June 2026

On the SWE-bench Verified benchmark, Claude 4.5 Opus ranks #1 with a score of 79.2, while Mistral Devstral Small 2505 offers the best score-per-dollar at $0.06/1M output tokens. The full ranking, with cost per million tokens, is below.

💰 Best value

Mistral Devstral Small 2505 — score 46.8 at $0.06/1M output tokens

#	Model	Score	Output / 1M	Context
1	Claude 4.5 OpusQiniu	79.2	—	200K
2	Doubao-Seed-CodeZenMux	78.8	$1.12	256K
3	Gemini 3 ProGoogle DeepMind	77.4	—	—
4	Claude 4 SonnetNanoGPT	76.8	$14.99	200K
5	Gemini 3 Flash ThinkingNanoGPT	75.8	$3.00	1M
6	MiniMax M2.5NanoGPT	75.8	$1.20	205K
7	Anthropic: Claude Opus 4.6 (Fast)Anthropic	75.6	$150.00	1M
8	Gemini 2.5 Pro (Jun 2025)Google DeepMind	75.2	—	—
9	Claude 4.5 SonnetQiniu	74.8	—	200K
10	GPT 4.1NanoGPT	74.6	$8.00	1M
11	OpenAI o4-mini highNanoGPT	74.4	$4.40	200K
12	GPT-5OpenAI	74.4	—	—
13	Claude 4 Opus Thinking (1K)NanoGPT	73.2	$75.00	200K
14	GPT-5.2 CodexOpenAI	72.8	—	—
15	GLM-5Z.ai (Zhipu AI)	72.8	—	—
16	GPT-5.2OpenAI	72.8	—	—
17	Kimi K2.5Moonshot	70.8	—	—
18	DeepSeek: DeepSeek V3.2DeepSeek	70.0	$0.34	131K
19	Qwen3-Coder-480B-A35BAlibaba	69.6	—	—
20	GLM-4.6Z.ai (Zhipu AI),Tsinghua University	68.2	—	—
21	Claude 4.5 HaikuQiniu	66.6	—	200K
22	GPT-5.1-CodexOpenAI	66.0	—	—
23	GPT-5.1Poe	66.0	$9.00	400K
24	Kimi K2 ThinkingMoonshot	65.4	—	—
25	GLM-4.5Z.ai (Zhipu AI),Tsinghua University	64.2	—	—
26	Claude 3.7 SonnetAnthropic	63.2	—	—
27	MiniMax-M2MiniMax	61.0	—	—
28	Qwen: Qwen3 Coder 30B A3B InstructQwen	60.4	$0.27	160K
29	GPT-5 miniOpenAI	59.8	—	—
30	o3OpenAI	58.4	—	—
31	Devstral SmallMistral	56.4	$0.30	128K
32	Devstral 2Mistral	53.8	$2.00	262K
33	Gemini 2.0 FlashQiniu	52.2	—	1M
34	Claude 3.5 SonnetAnthropic	50.8	—	—
35	Mistral Devstral Small 2505NanoGPT	46.8	$0.06	33K
36	Amazon: Nova Premier 1.0Amazon	42.4	$12.50	1M
37	o3-miniOpenAI	42.4	—	—
38	Claude 3.5 HaikuQiniu	40.6	—	200K
39	GPT-4o (Mar 2025)OpenAI	38.8	—	—
40	GPT-5 nanoOpenAI	34.8	—	—
41	Gemini 2.5 Flash PreviewNanoGPT	28.7	$0.60	1M
42	gpt-oss-120bOpenAI	26.0	—	—
43	GPT 4.1 MiniNanoGPT	23.9	$1.60	1M
44	GPT-4 (Jun 2023)OpenAI	22.4	—	—
45	Llama 4 MaverickMeta AI	21.0	—	—
46	Claude 3 OpusAnthropic	15.8	—	—
47	Llama 4 ScoutMeta AI	9.1	—	—
48	Qwen2.5 Coder 32B Instruct	9.0	$1.00	128K
49	Claude 2Anthropic	4.4	—	—

What does SWE-bench Verified test?

Evaluation framework with 2,294 real GitHub software-engineering problems.

Paper Dataset Dataset details

Frequently asked questions

Pricing is indicative — confirm with the provider before production use. Updated June 2026.

What does SWE-bench Verified test?

Frequently asked questions

What is the SWE-bench Verified benchmark?

Which model is best on SWE-bench Verified?

What is the best value model on SWE-bench Verified?