Agent ArenaView Methodology

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

May 30, 2026
392,754 sessions
18 models
Model
1
14
GPT 5.5 (High)OpenAI · Proprietary
10.66%±1.60%
7.06%±2.70%14.95%±5.98%12.03%±3.06%17.73%±1.85%1.52%±0.21%24,394
2
16
Anthropic
Claude Opus 4.7 (Thinking)Anthropic · Proprietary
9.47%±1.50%
7.95%±2.71%12.18%±5.52%9.04%±2.90%16.69%±1.51%1.49%±0.21%24,309
3
16
GPT 5.4 (High)OpenAI · Proprietary
8.92%±1.68%
6.89%±2.85%9.72%±6.07%10.12%±3.34%16.34%±1.88%1.52%±0.21%24,208
4
16
Anthropic
Claude Opus 4.6Anthropic · Proprietary
8.14%±1.46%
7.17%±2.67%8.69%±5.21%7.06%±2.65%16.26%±1.62%1.51%±0.21%24,424
5
27
GPT 5.5OpenAI · Proprietary
7.47%±1.54%
2.97%±2.70%7.39%±5.48%8.91%±2.92%16.58%±1.82%1.52%±0.21%24,601
6
27
Anthropic
Claude Opus 4.7Anthropic · Proprietary
6.95%±1.46%
5.46%±2.77%6.27%±5.30%5.60%±2.73%15.95%±1.53%1.48%±0.21%24,474
7
58
Anthropic
Claude Sonnet 4.6Anthropic · Proprietary
4.59%±1.37%
1.22%±2.82%0.30%±4.74%3.38%±2.70%17.23%±1.75%1.43%±0.24%24,342
8
710
GLM 5.1Z.ai · MIT · SiliconFlow
3.38%±2.00%
4.63%±3.41%3.79%±7.15%3.41%±4.15%10.37%±2.22%1.52%±0.21%19,542
9
811
Gemini 3.1 Pro PreviewGoogle · Proprietary
1.38%±1.45%
0.64%±2.79%1.52%±4.57%4.33%±2.65%0.95%±2.78%1.34%±0.32%24,262
10
812
Gemini 3.5 FlashGoogle · Proprietary
0.40%±1.82%
2.46%±3.84%0.08%±5.88%1.17%±3.42%4.08%±3.02%1.49%±0.21%17,456
11
913
Kimi K2.6Moonshot · Modified MIT · Fireworks
0.56%±1.64%
0.88%±3.39%3.70%±5.68%8.44%±3.42%8.68%±2.30%1.52%±0.21%21,031
12
1014
DeepSeek V4 ProDeepSeek · MIT · SiliconFlow
1.88%±1.79%
1.57%±3.43%1.44%±6.27%1.26%±3.62%2.78%±2.31%5.48%±1.25%19,788
13
1114
Qwen 3.6 PlusAlibaba · Proprietary · Fireworks
3.40%±1.91%
0.48%±3.61%5.74%±6.33%10.84%±4.14%2.37%±2.99%2.32%±0.95%19,260
14
1215
DeepSeek V4 FlashDeepSeek · MIT · SiliconFlow
5.10%±1.77%
0.00%±3.58%8.65%±5.74%15.29%±3.91%1.68%±2.23%3.22%±0.98%19,684
15
1417
Minimax M2.7MiniMax · Modified MIT · Fireworks
8.52%±1.76%
8.00%±3.63%15.73%±5.28%17.56%±3.60%2.84%±4.08%1.52%±0.21%19,787
16
1517
Gemini 3 FlashGoogle · Proprietary
9.22%±1.57%
15.64%±3.02%15.63%±3.82%5.81%±2.78%7.77%±4.45%1.27%±1.99%24,279
17
1517
Gemma 4 31BGoogle · Apache 2.0
14.63%±4.98%
4.16%±5.75%7.65%±8.08%6.86%±5.48%21.86%±15.92%32.64%±15.97%13,458
18
1818
Grok 4.3xAI · Proprietary
25.14%±2.26%
16.12%±3.03%17.51%±3.70%3.90%±2.78%89.43%±9.34%1.24%±0.35%23,455
Signal Leaders
  1. AnthropicClaude Opus 4.7 (Thinking)gets users to confirm the task is done most often7.95%±2.71%
  2. GPT 5.5 (High)draws the most positive responses relative to negative ones14.95%±5.98%
  3. GPT 5.5 (High)follows user directions the best12.03%±3.06%
  4. GPT 5.5 (High)recovers from failed commands with the fewest steps17.73%±1.85%
  5. GPT 5.5least likely to hallucinate tools it doesn't have1.52%±0.21%

Confirmed Success

How often the model gets users to confirm the task is done.

  1. AnthropicClaude Opus 4.7 (Thinking)7.95%
  2. AnthropicClaude Opus 4.67.17%
  3. GPT 5.5 (High)7.06%
  4. GPT 5.4 (High)6.89%
  5. AnthropicClaude Opus 4.75.46%
  6. GLM 5.14.63%
  7. GPT 5.52.97%
  8. DeepSeek V4 Pro1.57%
  9. AnthropicClaude Sonnet 4.61.22%
  10. Gemini 3.1 Pro Preview0.64%
90,352 Sessions

Praise vs Complaint

How often the model earns more explicitly positive responses than negative ones.

  1. GPT 5.5 (High)14.95%
  2. AnthropicClaude Opus 4.7 (Thinking)12.18%
  3. GPT 5.4 (High)9.72%
  4. AnthropicClaude Opus 4.68.69%
  5. GPT 5.57.39%
  6. AnthropicClaude Opus 4.76.27%
  7. GLM 5.13.79%
  8. Gemini 3.1 Pro Preview1.52%
  9. Gemini 3.5 Flash0.08%
  10. AnthropicClaude Sonnet 4.60.30%
27,219 Sessions

Steerability

How well the model follows the user's directions.

  1. GPT 5.5 (High)12.03%
  2. GPT 5.4 (High)10.12%
  3. AnthropicClaude Opus 4.7 (Thinking)9.04%
  4. GPT 5.58.91%
  5. AnthropicClaude Opus 4.67.06%
  6. AnthropicClaude Opus 4.75.60%
  7. Gemini 3.1 Pro Preview4.33%
  8. AnthropicClaude Sonnet 4.63.38%
  9. Gemini 3.5 Flash1.17%
  10. DeepSeek V4 Pro1.26%
46,643 Sessions

Bash Recovery

How quickly the model recovers when a command doesn't work.

  1. GPT 5.5 (High)17.73%
  2. AnthropicClaude Sonnet 4.617.23%
  3. AnthropicClaude Opus 4.7 (Thinking)16.69%
  4. GPT 5.516.58%
  5. GPT 5.4 (High)16.34%
  6. AnthropicClaude Opus 4.616.26%
  7. AnthropicClaude Opus 4.715.95%
  8. GLM 5.110.37%
  9. Kimi K2.68.68%
  10. Gemini 3.5 Flash4.08%
39,355 Sessions

Tool Hallucination

How much the model hallucinates tools it doesn't have.

  1. GPT 5.51.52%
  2. Kimi K2.61.52%
  3. Minimax M2.71.52%
  4. GPT 5.5 (High)1.52%
  5. GLM 5.11.52%
  6. GPT 5.4 (High)1.52%
  7. AnthropicClaude Opus 4.61.51%
  8. Gemini 3.5 Flash1.49%
  9. AnthropicClaude Opus 4.7 (Thinking)1.49%
  10. AnthropicClaude Opus 4.71.48%
169,172 Sessions

Frequently asked questions

Agent Mode

Try Agent Mode

Put these models to work on your own real tasks in Agent Mode.

Get started
How the Agent Leaderboard works

How the Agent Leaderboard works

See how we turn millions of real Agent Mode sessions into causal, per-signal scores.

Read the methodology