Code Arena

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Last Updated

Feb 6, 2026

Total Votes

136,159

Total Models

39

Rank Spread
1
1◄─►1
1576+19/-191,422
Anthropic
Proprietary
2
2◄─►2
1502+9/-99,003
Anthropic
Proprietary
3
3◄─►6
1472+16/-161,691
OpenAI
Proprietary
4
3◄─►5
1470+9/-99,179
Anthropic
Proprietary
5
4◄─►8
1452+8/-815,193
Google
Proprietary
6
3◄─►8
1449+14/-142,123
Moonshot
Modified MIT
7
5◄─►8
1442+8/-810,736
Google
Proprietary
8
5◄─►8
1441+10/-105,125
Z.ai
MIT
9
9◄─►13
1408+9/-98,095
MiniMax
MIT
10
9◄─►17
1407+19/-191,056
Moonshot
Modified MIT
11
9◄─►15
1406+9/-96,788
Google
Proprietary
12
9◄─►18
1397+16/-161,632
OpenAI
Proprietary
13
9◄─►18
1394+12/-123,925
OpenAI
Proprietary
14
10◄─►18
1389+9/-98,980
Anthropic
Proprietary
15
10◄─►18
1389+9/-96,432
OpenAI
Proprietary
16
11◄─►18
1387+7/-712,309
Anthropic
Proprietary
17
11◄─►18
1386+7/-713,951
Anthropic
Proprietary
18
12◄─►19
1374+10/-104,449
DeepSeek
MIT
19
18◄─►21
1357+9/-98,741
Z.ai
MIT
20
19◄─►22
1349+8/-811,221
OpenAI
Proprietary
21
19◄─►24
1344+9/-95,156
Xiaomi
MIT
22
20◄─►24
1336+11/-113,852
OpenAI
Proprietary
23
21◄─►24
1331+8/-810,780
Moonshot
Modified MIT
24
21◄─►25
1329+9/-96,501
OpenAI
Proprietary
25
24◄─►27
Minimax
1313+9/-98,833
MiniMax
Apache 2.0
26
25◄─►27
1309+9/-95,654
DeepSeek
MIT
27
25◄─►28
1301+7/-712,024
Anthropic
Proprietary
28
27◄─►29
1287+10/-105,130
DeepSeek
MIT
29
28◄─►30
1281+7/-711,785
Alibaba
Apache 2.0
30
29◄─►32
1259+15/-151,954
KwaiKAT
Proprietary
31
30◄─►33
1243+17/-171,537
OpenAI
Proprietary
32
30◄─►33
1235+10/-106,480
xAI
Proprietary
33
31◄─►36
1223+20/-201,037
Mistral
Apache 2.0
34
33◄─►36
1206+13/-133,454
Google
Proprietary
35
33◄─►36
1205+19/-191,265
xAI
Proprietary
36
33◄─►36
1199+16/-161,678
Mistral
Modified MIT
37
37◄─►38
1153+23/-23968
xAI
Proprietary
38
37◄─►39
1141+21/-211,016
xAI
Proprietary
39
38◄─►39
1099+22/-221,021
Mistral
Proprietary

Remove Style Control Leaderboard Plots

Fraction of Model A Wins for All Non-tied A vs. B Battles

Confidence Intervals on Model Strength (via Bootstrapping)

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)

Battle Count for Each Combination of Models (without Ties)