Code Arena🏆Overall

View overall rankings across AI models on agentic coding tasks involving multi-step reasoning and tool use.

Apr 9, 2026
231,158 votes
13 labs
Lab Rank
Model Score
Rank Spread
1
Anthropic
Anthropic
claude-opus-4-6-thinking · Proprietary
1548+11/-11
1
13
2
Z.ai
glm-5.1
1530+20/-20
3
14
3
OpenAI
gpt-5.4-high (codex-harness) · Proprietary
1457+17/-17
7
615
4
Google
gemini-3.1-pro-preview · Proprietary
1456+9/-9
8
612
5
Alibaba
qwen3.6-plus-preview · Proprietary
1453+14/-14
9
615
6
Xiaomi
mimo-v2-pro · Proprietary
1433+12/-12
15
817
7
Moonshot
kimi-k2.5-thinking
1429+8/-8
16
1017
8
MiniMax
minimax-m2.7
1425+12/-12
17
1020
9
xAI
grok-4.20-beta-0309-reasoning · Proprietary
1393+11/-11
21
1831
10
DeepSeek
deepseek-v3.2-thinking
1368+8/-8
32
3134
11
Kwai
KwaiKAT
KAT-Coder-Pro-V1 · Proprietary
1257+15/-15
47
4752
12
Mistral
mistral-large-3
1222+20/-20
53
4856
13
Inception AI
mercury-2 · Proprietary
1166+23/-23
57
5559

Remove Style Control Leaderboard Plots

Confidence Intervals on Model Strength (via Bootstrapping)

Battle Count for Each Combination of Models (without Ties)

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)

Fraction of Model A Wins for All Non-tied A vs. B Battles