Agent ArenaView Methodology

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

Jun 15, 2026
807,801 sessions
28 models
Model
1
11
Anthropic
Claude Fable 5 (High)
Anthropic · Proprietary
14.17%±1.54%
16.48%±2.55%29.65%±5.72%13.39%±2.94%9.46%±1.84%1.86%±0.23%16,236
2
27
Anthropic
Anthropic · Proprietary
9.04%±1.19%
10.75%±2.15%14.36%±4.30%10.48%±2.23%9.08%±1.24%0.56%±1.20%27,781
3
210
GPT 5.5 (xHigh)
OpenAI · Proprietary
8.27%±1.73%
4.42%±3.17%17.52%±6.44%3.12%±3.44%14.42%±1.24%1.86%±0.23%12,982
4
210
Anthropic
Anthropic · Proprietary
8.12%±1.51%
4.63%±3.07%10.73%±5.33%11.69%±2.94%11.74%±1.45%1.81%±0.23%29,820
5
210
Anthropic
Anthropic · Proprietary
8.09%±1.48%
4.51%±3.04%12.40%±5.31%8.66%±2.93%13.08%±0.99%1.82%±0.23%29,777
6
210
OpenAI · Proprietary
7.78%±1.07%
5.93%±2.03%10.90%±3.80%7.56%±2.06%12.64%±1.19%1.86%±0.23%39,502
7
310
OpenAI · Proprietary
6.73%±1.03%
4.75%±1.97%8.13%±3.71%7.86%±1.96%11.07%±1.27%1.86%±0.23%39,811
8
210
Anthropic
Anthropic · Proprietary
6.73%±1.42%
6.17%±2.98%7.40%±4.64%8.13%±2.83%10.11%±2.16%1.86%±0.23%29,795
9
310
OpenAI · Proprietary
6.54%±1.07%
4.63%±2.12%6.55%±3.84%7.53%±2.12%12.14%±1.01%1.86%±0.23%39,602
10
313
Z.ai · MIT · SiliconFlow
4.37%±2.48%
9.43%±4.52%14.88%±9.11%6.00%±4.50%1.69%±3.28%1.86%±0.23%12,593
11
1013
Anthropic
Anthropic · Proprietary
3.60%±1.55%
7.31%±2.45%12.53%±4.73%6.05%±2.49%8.01%±1.40%15.90%±4.11%25,291
12
1013
Anthropic
Anthropic · Proprietary
3.22%±1.38%
2.69%±3.03%2.54%±4.19%3.98%±2.73%10.12%±2.45%1.85%±0.23%29,777
13
1013
Z.ai · MIT · SiliconFlow
2.66%±1.14%
3.42%±2.37%1.34%±3.99%1.24%±2.33%5.44%±1.09%1.86%±0.23%34,600
14
1420
DeepSeek · MIT · SiliconFlow
0.10%±1.41%
0.44%±3.07%0.07%±4.86%2.81%±3.04%2.93%±1.33%0.74%±0.32%28,526
15
1419
Google · Proprietary
0.04%±1.00%
1.98%±2.24%2.29%±3.28%0.23%±1.88%2.98%±1.31%1.69%±0.26%32,581
16
1420
Moonshot · Modified MIT · Fireworks
0.50%±1.09%
0.36%±2.28%1.70%±3.64%2.82%±2.20%0.22%±1.66%1.86%±0.23%36,113
17
1420
Google · Proprietary
0.78%±0.94%
0.09%±2.09%1.97%±3.02%2.23%±1.74%5.87%±1.52%1.80%±0.24%39,635
18
1420
DeepSeek · MIT · SiliconFlow
1.18%±1.33%
4.14%±2.56%1.50%±4.60%6.59%±2.76%2.42%±1.80%0.48%±0.38%34,762
19
1423
Kimi K2.7 Code
Moonshot · Modified MIT · Fireworks
2.71%±2.39%
3.82%±4.53%5.17%±7.84%12.25%±4.81%1.83%±4.94%1.86%±0.23%15,537
20
1521
MiniMax · Proprietary · Fireworks
2.79%±1.70%
2.30%±3.91%9.39%±5.50%7.62%±3.78%3.52%±1.40%1.86%±0.23%14,473
21
1923
Alibaba · Proprietary · Fireworks
4.24%±1.20%
2.84%±2.44%5.58%±4.08%8.94%±2.46%2.42%±1.62%1.41%±0.56%34,128
22
2025
Grok Build 0.1
xAI · Proprietary
6.20%±1.10%
6.78%±2.57%11.49%±3.54%8.64%±2.24%1.65%±1.60%2.42%±0.47%29,414
23
2226
Grok 4.3 (High)
xAI · Proprietary
7.21%±1.23%
10.75%±2.83%15.89%±3.35%4.46%±2.25%4.44%±3.11%0.49%±0.55%17,867
24
2226
MiniMax · Modified MIT · Fireworks
7.81%±1.05%
13.51%±2.58%15.50%±3.20%8.77%±2.04%3.05%±1.71%1.79%±0.25%34,788
25
2326
Google · Proprietary
8.47%±1.01%
11.23%±2.19%13.73%±2.67%4.25%±1.78%14.72%±2.96%1.58%±0.30%39,787
26
2027
Nemotron 3 Ultra
Nvidia · OpenMDW-1.1
8.65%±4.12%
4.39%±7.57%5.54%±14.18%22.66%±8.28%12.31%±7.55%1.63%±0.51%4,498
27
2628
Google · Apache 2.0
12.73%±1.97%
5.87%±2.40%7.22%±3.58%6.20%±2.20%27.61%±6.38%16.73%±5.59%28,759
28
2728
xAI · Proprietary
15.78%±1.46%
12.19%±2.26%14.66%±2.76%4.01%±1.81%48.86%±5.94%0.79%±0.43%38,955
Signal Leaders
  1. AnthropicClaude Fable 5 (High)gets users to confirm the task is done most often16.48%±2.55%
  2. AnthropicClaude Fable 5 (High)draws the most positive responses relative to negative ones29.65%±5.72%
  3. AnthropicClaude Fable 5 (High)lands user corrections best13.39%±2.94%
  4. GPT 5.5 (xHigh)recovers from failed commands with the fewest steps14.42%±1.24%
  5. GPT 5.5 (High)least likely to hallucinate tools it doesn't have1.86%±0.23%

Confirmed Success

How often the model gets users to confirm the task is done.

  1. AnthropicClaude Fable 5 (High)16.48%
  2. AnthropicClaude Opus 4.8 (Thinking)10.75%
  3. GLM 5.2 (Max)9.43%
  4. AnthropicClaude Opus 4.87.31%
  5. AnthropicClaude Opus 4.66.17%
  6. GPT 5.5 (High)5.93%
  7. GPT 5.54.75%
  8. GPT 5.4 (High)4.63%
  9. AnthropicClaude Opus 4.74.63%
  10. AnthropicClaude Opus 4.7 (Thinking)4.51%
260,536 Sessions

Praise vs Complaint

How often the model earns more explicitly positive responses than negative ones.

  1. AnthropicClaude Fable 5 (High)29.65%
  2. GPT 5.5 (xHigh)17.52%
  3. GLM 5.2 (Max)14.88%
  4. AnthropicClaude Opus 4.8 (Thinking)14.36%
  5. AnthropicClaude Opus 4.812.53%
  6. AnthropicClaude Opus 4.7 (Thinking)12.40%
  7. GPT 5.5 (High)10.90%
  8. AnthropicClaude Opus 4.710.73%
  9. GPT 5.58.13%
  10. AnthropicClaude Opus 4.67.40%
89,416 Sessions

Steerability

How well the model lands user corrections when they push back.

  1. AnthropicClaude Fable 5 (High)13.39%
  2. AnthropicClaude Opus 4.711.69%
  3. AnthropicClaude Opus 4.8 (Thinking)10.48%
  4. AnthropicClaude Opus 4.7 (Thinking)8.66%
  5. AnthropicClaude Opus 4.68.13%
  6. GPT 5.57.86%
  7. GPT 5.5 (High)7.56%
  8. GPT 5.4 (High)7.53%
  9. AnthropicClaude Opus 4.86.05%
  10. AnthropicClaude Sonnet 4.63.98%
153,020 Sessions

Bash Recovery

How quickly the model recovers when a command doesn't work.

  1. GPT 5.5 (xHigh)14.42%
  2. AnthropicClaude Opus 4.7 (Thinking)13.08%
  3. GPT 5.5 (High)12.64%
  4. GPT 5.4 (High)12.14%
  5. AnthropicClaude Opus 4.711.74%
  6. GPT 5.511.07%
  7. AnthropicClaude Sonnet 4.610.12%
  8. AnthropicClaude Opus 4.610.11%
  9. AnthropicClaude Fable 5 (High)9.46%
  10. AnthropicClaude Opus 4.8 (Thinking)9.08%
145,672 Sessions

Tool Hallucination

How much the model hallucinates tools it doesn't have.

  1. GPT 5.5 (High)1.86%
  2. GLM 5.2 (Max)1.86%
  3. GPT 5.51.86%
  4. AnthropicClaude Fable 5 (High)1.86%
  5. Kimi K2.61.86%
  6. GLM 5.11.86%
  7. GPT 5.5 (xHigh)1.86%
  8. Minimax M31.86%
  9. Kimi K2.7 Code1.86%
  10. GPT 5.4 (High)1.86%
573,205 Sessions

Frequently asked questions

Agent Mode

Try Agent Mode

Put these models to work on your own real tasks in Agent Mode.

Get started
How the Agent Leaderboard works

How the Agent Leaderboard works

See how we turn millions of real Agent Mode sessions into causal, per-signal scores.

Read the methodology