Agent ArenaView Methodology

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

May 30, 2026
345,010 sessions
18 models
Model
1
14
GPT 5.5 (High)OpenAI · Proprietary
10.66%±1.60%
7.06%±2.70%14.95%±5.98%12.03%±3.06%17.73%±1.85%1.52%±0.21%21,686
2
16
Anthropic
Claude Opus 4.7 (Thinking)Anthropic · Proprietary
9.47%±1.50%
7.95%±2.71%12.18%±5.52%9.04%±2.90%16.69%±1.51%1.49%±0.21%21,688
3
16
GPT 5.4 (High)OpenAI · Proprietary
8.92%±1.68%
6.89%±2.85%9.72%±6.07%10.12%±3.34%16.34%±1.88%1.52%±0.21%21,502
4
16
Anthropic
Claude Opus 4.6Anthropic · Proprietary
8.14%±1.46%
7.17%±2.67%8.69%±5.21%7.06%±2.65%16.26%±1.62%1.51%±0.21%21,734
5
27
GPT 5.5OpenAI · Proprietary
7.47%±1.54%
2.97%±2.70%7.39%±5.48%8.91%±2.92%16.58%±1.82%1.52%±0.21%21,819
6
27
Anthropic
Claude Opus 4.7Anthropic · Proprietary
6.95%±1.46%
5.46%±2.77%6.27%±5.30%5.60%±2.73%15.95%±1.53%1.48%±0.21%21,763
7
58
Anthropic
Claude Sonnet 4.6Anthropic · Proprietary
4.59%±1.37%
1.22%±2.82%0.30%±4.74%3.38%±2.70%17.23%±1.75%1.43%±0.24%21,711
8
710
GLM 5.1Z.ai · MIT · SiliconFlow
3.38%±2.00%
4.63%±3.41%3.79%±7.15%3.41%±4.15%10.37%±2.22%1.52%±0.21%16,894
9
811
Gemini 3.1 Pro PreviewGoogle · Proprietary
1.38%±1.45%
0.64%±2.79%1.52%±4.57%4.33%±2.65%0.95%±2.78%1.34%±0.32%21,582
10
812
Gemini 3.5 FlashGoogle · Proprietary
0.40%±1.82%
2.46%±3.84%0.08%±5.88%1.17%±3.42%4.08%±3.02%1.49%±0.21%14,736
11
913
Kimi K2.6Moonshot · Modified MIT · Fireworks
0.56%±1.64%
0.88%±3.39%3.70%±5.68%8.44%±3.42%8.68%±2.30%1.52%±0.21%18,280
12
1014
DeepSeek V4 ProDeepSeek · MIT · SiliconFlow
1.88%±1.79%
1.57%±3.43%1.44%±6.27%1.26%±3.62%2.78%±2.31%5.48%±1.25%17,072
13
1114
Qwen 3.6 PlusAlibaba · Proprietary · Fireworks
3.40%±1.91%
0.48%±3.61%5.74%±6.33%10.84%±4.14%2.37%±2.99%2.32%±0.95%16,524
14
1215
DeepSeek V4 FlashDeepSeek · MIT · SiliconFlow
5.10%±1.77%
0.00%±3.58%8.65%±5.74%15.29%±3.91%1.68%±2.23%3.22%±0.98%16,916
15
1417
Minimax M2.7MiniMax · Modified MIT · Fireworks
8.52%±1.76%
8.00%±3.63%15.73%±5.28%17.56%±3.60%2.84%±4.08%1.52%±0.21%17,088
16
1517
Gemini 3 FlashGoogle · Proprietary
9.22%±1.57%
15.64%±3.02%15.63%±3.82%5.81%±2.78%7.77%±4.45%1.27%±1.99%21,595
17
1517
Gemma 4 31BGoogle · Apache 2.0
14.63%±4.98%
4.16%±5.75%7.65%±8.08%6.86%±5.48%21.86%±15.92%32.64%±15.97%10,654
18
1818
Grok 4.3xAI · Proprietary
25.14%±2.26%
16.12%±3.03%17.51%±3.70%3.90%±2.78%89.43%±9.34%1.24%±0.35%21,766
Signal Leaders
  1. AnthropicClaude Opus 4.7 (Thinking)gets users to confirm the task is done most often7.95%±2.71%
  2. GPT 5.5 (High)draws the most positive responses relative to negative ones14.95%±5.98%
  3. GPT 5.5 (High)follows user directions the best12.03%±3.06%
  4. GPT 5.5 (High)recovers from failed commands with the fewest steps17.73%±1.85%
  5. GPT 5.5least likely to hallucinate tools it doesn't have1.52%±0.21%

Confirmed Success

How often the model gets users to confirm the task is done.

  1. AnthropicClaude Opus 4.7 (Thinking)7.95%
  2. AnthropicClaude Opus 4.67.17%
  3. GPT 5.5 (High)7.06%
  4. GPT 5.4 (High)6.89%
  5. AnthropicClaude Opus 4.75.46%
  6. GLM 5.14.63%
  7. GPT 5.52.97%
  8. DeepSeek V4 Pro1.57%
  9. AnthropicClaude Sonnet 4.61.22%
  10. Gemini 3.1 Pro Preview0.64%
90,352 Sessions

Praise vs Complaint

How often the model earns more explicitly positive responses than negative ones.

  1. GPT 5.5 (High)14.95%
  2. AnthropicClaude Opus 4.7 (Thinking)12.18%
  3. GPT 5.4 (High)9.72%
  4. AnthropicClaude Opus 4.68.69%
  5. GPT 5.57.39%
  6. AnthropicClaude Opus 4.76.27%
  7. GLM 5.13.79%
  8. Gemini 3.1 Pro Preview1.52%
  9. Gemini 3.5 Flash0.08%
  10. AnthropicClaude Sonnet 4.60.30%
27,219 Sessions

Steerability

How well the model follows the user's directions.

  1. GPT 5.5 (High)12.03%
  2. GPT 5.4 (High)10.12%
  3. AnthropicClaude Opus 4.7 (Thinking)9.04%
  4. GPT 5.58.91%
  5. AnthropicClaude Opus 4.67.06%
  6. AnthropicClaude Opus 4.75.60%
  7. Gemini 3.1 Pro Preview4.33%
  8. AnthropicClaude Sonnet 4.63.38%
  9. Gemini 3.5 Flash1.17%
  10. DeepSeek V4 Pro1.26%
46,643 Sessions

Bash Recovery

How quickly the model recovers when a command doesn't work.

  1. GPT 5.5 (High)17.73%
  2. AnthropicClaude Sonnet 4.617.23%
  3. AnthropicClaude Opus 4.7 (Thinking)16.69%
  4. GPT 5.516.58%
  5. GPT 5.4 (High)16.34%
  6. AnthropicClaude Opus 4.616.26%
  7. AnthropicClaude Opus 4.715.95%
  8. GLM 5.110.37%
  9. Kimi K2.68.68%
  10. Gemini 3.5 Flash4.08%
39,355 Sessions

Tool Hallucination

How much the model hallucinates tools it doesn't have.

  1. GPT 5.51.52%
  2. Kimi K2.61.52%
  3. Minimax M2.71.52%
  4. GPT 5.5 (High)1.52%
  5. GLM 5.11.52%
  6. GPT 5.4 (High)1.52%
  7. AnthropicClaude Opus 4.61.51%
  8. Gemini 3.5 Flash1.49%
  9. AnthropicClaude Opus 4.7 (Thinking)1.49%
  10. AnthropicClaude Opus 4.71.48%
169,172 Sessions

Frequently asked questions

Agent Mode

Try Agent Mode

Put these models to work on your own real tasks in Agent Mode.

Get started
How the Agent Leaderboard works

How the Agent Leaderboard works

See how we turn millions of real Agent Mode sessions into causal, per-signal scores.

Read the methodology