Code Arena | WebDev

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Feb 12, 2026

151,146 votes

41 models

/

	Rank Spread
1	12	claude-opus-4-6-thinking Anthropic · Proprietary	1567+17/-17	1,625
2	12	claude-opus-4-6 Anthropic · Proprietary	1560+15/-15	2,113
3	33	claude-opus-4-5-20251101-thinking-32k Anthropic · Proprietary	1503+8/-8	9,892
4	47	gpt-5.2-high OpenAI · Proprietary	1473+16/-16	1,691
5	46	claude-opus-4-5-20251101 Anthropic · Proprietary	1469+8/-8	10,054
6	411	glm-5 Z.ai · MIT	1449+16/-16	1,643
7	610	gemini-3-pro Google · Proprietary	1449+8/-8	16,009
8	511	kimi-k2.5-thinking Moonshot · Modified MIT	1447+12/-12	2,916
9	611	gemini-3-flash Google · Proprietary	1444+8/-8	11,623
10	611	glm-4.7 Z.ai · MIT	1442+10/-10	5,130
11	714	kimi-k2.5-instant Moonshot · Modified MIT	1423+15/-15	1,880
12	1117	minimax-m2.1-preview MiniMax · MIT	1407+8/-8	8,867
13	1118	gemini-3-flash (thinking-minimal) Google · Proprietary	1404+9/-9	7,690
14	1120	gpt-5.2 OpenAI · Proprietary	1398+16/-16	1,633
15	1220	gpt-5-medium OpenAI · Proprietary	1395+12/-12	3,926
16	1220	claude-opus-4-1-20250805 Anthropic · Proprietary	1391+8/-8	8,979
17	1220	gpt-5.1-medium OpenAI · Proprietary	1390+9/-9	6,437
18	1320	claude-sonnet-4-5-20250929-thinking-32k Anthropic · Proprietary	1390+7/-7	13,158
19	1420	claude-sonnet-4-5-20250929 Anthropic · Proprietary	1386+7/-7	14,778
20	1421	deepseek-v3.2-thinking DeepSeek · MIT	1375+10/-10	5,123
21	2023	glm-4.6 Z.ai · MIT	1358+8/-8	8,744
22	2125	gpt-5.1 OpenAI · Proprietary	1348+7/-7	12,075
23	2126	mimo-v2-flash (non-thinking) Xiaomi · MIT	1343+9/-9	5,960
24	2226	gpt-5.2-codex OpenAI · Proprietary	1338+10/-10	4,693
25	2226	kimi-k2-thinking-turbo Moonshot · Modified MIT	1336+7/-7	11,535
26	2328	gpt-5.1-codex OpenAI · Proprietary	1331+9/-9	6,502
27	2629	minimax-m2 MiniMax · Apache 2.0	1314+9/-9	8,832
28	2629	deepseek-v3.2 DeepSeek · MIT	1314+9/-9	6,408
29	2729	claude-haiku-4-5-20251001 Anthropic · Proprietary	1307+7/-7	12,865
30	3031	deepseek-v3.2-exp DeepSeek · MIT	1289+10/-10	5,130
31	3031	qwen3-coder-480b-a35b-instruct Alibaba · Apache 2.0	1284+7/-7	12,607
32	3234	KAT-Coder-Pro-V1 KwaiKAT · Proprietary	1261+15/-15	1,954
33	3235	gpt-5.1-codex-mini OpenAI · Proprietary	1245+17/-17	1,537
34	3235	grok-4-1-fast-reasoning xAI · Proprietary	1237+9/-9	7,167
35	3338	mistral-large-3 Mistral · Apache 2.0	1225+20/-20	1,037
36	3538	gemini-2.5-pro Google · Proprietary	1208+13/-13	3,453
37	3538	grok-4.1-thinking xAI · Proprietary	1206+19/-19	1,266
38	3538	devstral-2 Mistral · Modified MIT	1201+16/-16	1,681
39	3940	grok-4-fast-reasoning xAI · Proprietary	1155+22/-22	968
40	3941	grok-code-fast-1 xAI · Proprietary	1143+21/-21	1,016
41	4041	devstral-medium-2507 Mistral · Proprietary	1101+22/-22	1,021

Code Arena | WebDev

Remove Style Control Leaderboard Plots

Confidence Intervals on Model Strength (via Bootstrapping)

Battle Count for Each Combination of Models (without Ties)

Fraction of Model A Wins for All Non-tied A vs. B Battles

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)