Agent ArenaView Methodology

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

Jun 8, 2026
479,210 sessions
20 models
Model
1
15
GPT 5.5 (High)OpenAI · Proprietary
9.22%±1.29%
6.13%±2.30%14.52%±4.76%9.59%±2.39%14.13%±1.34%1.75%±0.21%28,073
2
16
Anthropic
Claude Opus 4.7 (Thinking)Anthropic · Proprietary
8.26%±1.21%
7.86%±2.25%9.12%±4.42%9.21%±2.28%13.44%±1.04%1.69%±0.22%26,957
3
16
Anthropic
Claude Opus 4.6Anthropic · Proprietary
7.91%±1.22%
7.29%±2.27%11.87%±4.32%6.87%±2.16%11.77%±1.36%1.75%±0.21%27,089
4
16
GPT 5.4 (High)OpenAI · Proprietary
7.79%±1.34%
7.20%±2.36%9.11%±4.97%8.01%±2.71%12.90%±1.34%1.75%±0.21%27,985
5
16
GPT 5.5OpenAI · Proprietary
7.68%±1.29%
1.85%±2.30%13.60%±4.71%8.11%±2.31%13.12%±1.36%1.75%±0.21%28,349
6
26
Anthropic
Claude Opus 4.7Anthropic · Proprietary
6.48%±1.25%
5.17%±2.33%7.75%±4.53%6.04%±2.30%11.73%±1.23%1.69%±0.21%27,115
7
78
Anthropic
Claude Sonnet 4.6Anthropic · Proprietary
3.37%±1.13%
1.34%±2.38%1.96%±3.74%3.54%±2.10%12.20%±2.02%1.73%±0.21%27,001
8
710
GLM 5.1Z.ai · MIT · SiliconFlow
1.87%±1.39%
4.57%±2.78%0.14%±4.88%3.01%±2.91%6.18%±1.50%1.75%±0.21%23,118
9
813
DeepSeek V4 ProDeepSeek · MIT · SiliconFlow
0.36%±1.39%
2.78%±2.69%1.66%±4.94%3.91%±2.88%1.16%±1.26%0.18%±0.42%23,447
10
813
Gemini 3.5 FlashGoogle · Proprietary
0.39%±1.24%
2.02%±2.65%2.64%±4.12%1.77%±2.37%2.79%±1.62%1.71%±0.21%21,048
11
913
Gemini 3.1 Pro PreviewGoogle · Proprietary
0.81%±1.13%
0.53%±2.39%1.97%±3.67%1.35%±2.08%4.56%±2.01%1.65%±0.24%28,209
12
913
Kimi K2.6Moonshot · Modified MIT · Fireworks
1.15%±1.26%
0.10%±2.57%5.48%±4.26%3.91%±2.68%1.77%±1.75%1.75%±0.21%24,578
13
914
DeepSeek V4 FlashDeepSeek · MIT · SiliconFlow
1.43%±1.61%
2.19%±3.00%2.25%±5.62%7.86%±3.29%0.99%±1.77%0.20%±0.49%23,281
14
1315
Qwen 3.6 PlusAlibaba · Proprietary · Fireworks
4.01%±1.43%
1.39%±2.81%5.90%±5.00%10.96%±3.07%0.08%±1.75%1.88%±0.64%22,784
15
1415
Grok Build 0.1xAI · Proprietary
5.31%±1.26%
6.33%±2.92%15.85%±3.92%7.00%±2.87%6.15%±1.57%3.53%±0.64%18,031
16
1618
Minimax M2.7MiniMax · Modified MIT · Fireworks
8.39%±1.24%
10.80%±2.96%20.06%±3.74%10.05%±2.61%2.77%±2.05%1.75%±0.21%23,322
17
1618
Grok 4.3 (High)xAI · Proprietary
9.45%±2.22%
15.85%±5.07%16.61%±6.10%9.30%±4.23%3.87%±5.11%1.62%±1.52%6,179
18
1618
Gemini 3 FlashGoogle · Proprietary
9.47%±1.23%
13.68%±2.55%14.49%±3.12%6.41%±2.16%13.69%±3.58%0.91%±0.61%28,191
19
1919
Gemma 4 31BGoogle · Apache 2.0
14.89%±2.40%
9.30%±3.03%11.50%±4.34%7.34%±2.85%30.32%±8.61%15.99%±5.55%17,115
20
2020
Grok 4.3xAI · Proprietary
23.31%±2.03%
13.52%±2.63%14.30%±3.55%6.17%±2.18%83.23%±8.53%0.65%±0.58%27,338
Signal Leaders
  1. AnthropicClaude Opus 4.7 (Thinking)gets users to confirm the task is done most often7.86%±2.25%
  2. GPT 5.5 (High)draws the most positive responses relative to negative ones14.52%±4.76%
  3. GPT 5.5 (High)follows user directions the best9.59%±2.39%
  4. GPT 5.5 (High)recovers from failed commands with the fewest steps14.13%±1.34%
  5. GPT 5.5least likely to hallucinate tools it doesn't have1.75%±0.21%

Confirmed Success

How often the model gets users to confirm the task is done.

  1. AnthropicClaude Opus 4.7 (Thinking)7.86%
  2. AnthropicClaude Opus 4.67.29%
  3. GPT 5.4 (High)7.20%
  4. GPT 5.5 (High)6.13%
  5. AnthropicClaude Opus 4.75.17%
  6. GLM 5.14.57%
  7. DeepSeek V4 Pro2.78%
  8. DeepSeek V4 Flash2.19%
  9. GPT 5.51.85%
  10. AnthropicClaude Sonnet 4.61.34%
156,803 Sessions

Praise vs Complaint

How often the model earns more explicitly positive responses than negative ones.

  1. GPT 5.5 (High)14.52%
  2. GPT 5.513.60%
  3. AnthropicClaude Opus 4.611.87%
  4. AnthropicClaude Opus 4.7 (Thinking)9.12%
  5. GPT 5.4 (High)9.11%
  6. AnthropicClaude Opus 4.77.75%
  7. GLM 5.10.14%
  8. DeepSeek V4 Pro1.66%
  9. AnthropicClaude Sonnet 4.61.96%
  10. Gemini 3.1 Pro Preview1.97%
53,044 Sessions

Steerability

How well the model follows the user's directions.

  1. GPT 5.5 (High)9.59%
  2. AnthropicClaude Opus 4.7 (Thinking)9.21%
  3. GPT 5.58.11%
  4. GPT 5.4 (High)8.01%
  5. AnthropicClaude Opus 4.66.87%
  6. AnthropicClaude Opus 4.76.04%
  7. AnthropicClaude Sonnet 4.63.54%
  8. Gemini 3.1 Pro Preview1.35%
  9. Gemini 3.5 Flash1.77%
  10. GLM 5.13.01%
91,203 Sessions

Bash Recovery

How quickly the model recovers when a command doesn't work.

  1. GPT 5.5 (High)14.13%
  2. AnthropicClaude Opus 4.7 (Thinking)13.44%
  3. GPT 5.513.12%
  4. GPT 5.4 (High)12.90%
  5. AnthropicClaude Sonnet 4.612.20%
  6. AnthropicClaude Opus 4.611.77%
  7. AnthropicClaude Opus 4.711.73%
  8. GLM 5.16.18%
  9. Grok Build 0.16.15%
  10. Gemini 3.5 Flash2.79%
88,574 Sessions

Tool Hallucination

How much the model hallucinates tools it doesn't have.

  1. GPT 5.51.75%
  2. GPT 5.5 (High)1.75%
  3. GLM 5.11.75%
  4. GPT 5.4 (High)1.75%
  5. Minimax M2.71.75%
  6. Kimi K2.61.75%
  7. AnthropicClaude Opus 4.61.75%
  8. AnthropicClaude Sonnet 4.61.73%
  9. Gemini 3.5 Flash1.71%
  10. AnthropicClaude Opus 4.71.69%
347,671 Sessions

Frequently asked questions

Agent Mode

Try Agent Mode

Put these models to work on your own real tasks in Agent Mode.

Get started
How the Agent Leaderboard works

How the Agent Leaderboard works

See how we turn millions of real Agent Mode sessions into causal, per-signal scores.

Read the methodology