Question 1

What is the agent leaderboard?

Accepted Answer

Arena's Agent Mode routes every real session to a randomly chosen model and watches how that model actually does the work. The current Agent leaderboard ranks those models on how well they perform as the orchestrator, the main model that decides which tools to call (bash, web search, fetching pages, writing files, and so on), across millions of real, in-the-wild Agent Mode interactions. Instead of asking people to vote on two side-by-side answers, Agent Mode collects single-threaded user feedback and scores models on what happened while they were doing real tasks.

Question 2

How is this different from other Arena leaderboards?

Accepted Answer

Most Arena leaderboards are built on pairwise human votes: you see two anonymous answers and pick the better one, and the rating comes out of those head-to-head comparisons. The Agent leaderboard works differently in three ways. First, it uses single-threaded traces, not battles: in agent mode, users interact with a single agent in a long-running thread, sometimes over hundreds of turns, whereas previously users interacted with two models at a time in a battle. Second, it uses a combination of explicit feedback and implicit signals as opposed to explicit feedback only: previously we calculated leaderboards using votes, which are a form of explicitly stated feedback, whereas now Agent Arena measures several implicit behavioral signals such as natural language praise and complaints, tool hallucination, and more, to calculate an aggregate leaderboard that goes beyond explicit feedback alone. Third, it uses a new methodology called causal tracing, not Bradley Terry regression: we mine traces for signals and then use causal inference techniques to calculate treatment effects for different subcomponents of the agent, reporting the causal effect of using a specific model compared to the average model.

Question 3

How does the ranking work?

Accepted Answer

Because Agent Mode sends every session to a random model, we can infer a model's causal treatment effect by observing its behavior. For each signal we compute a per-model score and express it as a contrast in percentage points against a randomized baseline signifying the average model. That per-model, per-signal contrast is the net improvement: how much better or worse a behavior becomes when substituting in a particular model. Positive means above average, negative means below average. As the average model gets stronger, the average improves, so the net improvement decreases for any particular model, which keeps the leaderboard live and relative to flagship models from all the labs. The headline rank is a weighted average of a model's net improvement across all the signals, so every signal gets one vote (today equally weighted). We also show a 95% confidence interval on each number so you can see when two models are genuinely separated versus too close to call. Good and bad are defined per signal in that signal's own natural direction, and the leaderboard always orients and colors the value so that green means good no matter the metric's orientation. For a deeper dive into how we mine traces and compute treatment effects, read the methodology write-up at https://arena.ai/blog/agent-arena-methodology/.

Question 4

What do the percentages mean?

Accepted Answer

Every score on the leaderboard is a treatment effect, signifying the improvement one would get in each signal if substituting any specific orchestrator for the average orchestrator. A green highlight means the model does better than a typical model, a red highlight means it does worse, and a score near zero means it is about average. So "+7%" with a green highlight means clearly above average and "-3%" with a red highlight means a bit below. The big number next to each model is its overall score: the average of its percentages across every signal, and each signal column shows that model's percentage for that one behavior. The little "±" after a number is the 95% confidence interval; a score of "+5% ± 2%" really means somewhere around 3% to 7%, so when two models' ranges overlap a lot, treat them as basically tied. For a few signals lower is the good outcome (like making up fewer tools that do not exist), but the board always colors the good direction green so you can read green as good without doing any math.

Question 5

What are signals?

Accepted Answer

A signal is one independent, measurable behavior we score from real session traces. Each one captures a different dimension of doing the work well, and the headline score is the equal-weighted average across all of them. The current signals are: Confirmed Success (how often users explicitly confirm the task is done, built from the final explicit task approval or disapproval within a trace; higher is better); Praise vs Complaint (within a task, whether users say more explicitly positive things than negative things, isolating natural language satisfaction separate from button clicks; higher is better); Steerability (when a user pushes back or corrects the model, whether the very next response actually lands instead of being rejected or going nowhere; higher is better); Bash Recovery (after a command fails because of the model's own mistake, how few retries it takes to get back to a working command; higher is better); and Tool Hallucination (how often the model calls a tool that does not exist; lower is better).

Question 6

Do the rankings change over time?

Accepted Answer

Yes. The leaderboard is a living measurement, not a one-time static score. It refreshes as new real Agent Mode sessions come in, so a model's score can move as we gather more evidence about how it behaves, and you can always see the last updated date and the number of observations behind the current leaderboard. Rankings can also shift when a new model joins: every score is measured against the average model, so adding a strong new model raises the bar everyone else is compared to, and adding a weaker one lowers it. That means a model's number can move a little even when its own behavior has not changed, simply because the competition did. As more sessions add up, the margins of error also get smaller, so close calls between models become clearer over time.

Question 7

Will there be more signals?

Accepted Answer

Yes. The current set is a starting point, and the framework is built to grow. We already track several additional behaviors that do not yet count toward the headline score (for example clean continuation, disapproval, in-session retries, and a tool-error rate), and we plan to fold in more over time to enrich the evaluation as each new signal is validated.

	Model
1 12	Claude Fable 5 (High) Anthropic · Proprietary	13.94%±1.56%	17.27%±2.75%	30.65%±5.67%	12.07%±2.94%	8.39%±1.81%	1.33%±0.15%	16,059
2 19	GPT 5.6 Sol (xHigh) OpenAI · Proprietary	10.94%±3.76%	10.93%±3.73%	17.64%±7.48%	17.26%±16.35%	7.53%±1.68%	1.33%±0.15%	7,881
3 27	Claude Opus 4.8 (Thinking) Anthropic · Proprietary	9.28%±1.35%	7.82%±2.63%	17.96%±4.91%	10.15%±2.48%	9.91%±1.00%	0.56%±0.71%	33,392
4 28	GPT 5.5 (xHigh) OpenAI · Proprietary	8.26%±0.87%	6.88%±1.71%	12.61%±3.14%	6.70%±1.70%	13.79%±0.77%	1.33%±0.15%	36,289
5 213	Claude Sonnet 5 (High) Anthropic · Proprietary	8.00%±1.71%	10.33%±3.39%	13.90%±6.50%	4.92%±3.24%	9.62%±0.88%	1.21%±0.16%	23,640
6 213	Claude Opus 4.7 (Thinking) Anthropic · Proprietary	7.73%±1.22%	5.22%±2.56%	11.36%±4.35%	9.17%±2.31%	11.70%±1.16%	1.22%±0.17%	34,357
7 213	Claude Opus 4.7 Anthropic · Proprietary	7.63%±1.22%	5.28%±2.54%	13.38%±4.39%	7.56%±2.27%	10.66%±1.32%	1.27%±0.16%	34,950
8 313	GPT 5.5 (High) OpenAI · Proprietary	7.16%±0.75%	6.42%±1.47%	7.61%±2.61%	8.72%±1.42%	11.75%±1.00%	1.33%±0.15%	61,455
9 414	GLM 5.2 (Max) Z.ai · MIT · SiliconFlow	6.24%±1.10%	8.72%±2.15%	12.59%±4.01%	4.10%±2.03%	4.45%±1.06%	1.33%±0.15%	31,993
10 514	GPT 5.5 OpenAI · Proprietary	6.05%±0.72%	4.33%±1.46%	5.53%±2.51%	7.99%±1.33%	11.06%±0.81%	1.33%±0.15%	62,335
11 514	Claude Opus 4.6 Anthropic · Proprietary	5.88%±1.23%	1.67%±2.66%	9.05%±4.17%	6.70%±2.25%	10.63%±1.33%	1.33%±0.15%	34,075
12 514	GPT 5.4 (High) OpenAI · Proprietary	5.86%±0.74%	6.15%±1.49%	4.40%±2.58%	7.88%±1.42%	9.55%±0.89%	1.33%±0.15%	61,743
13 514	Grok 4.5 SpaceXAI · Proprietary	4.99%±1.87%	6.95%±3.21%	6.09%±5.58%	0.88%±6.12%	9.72%±1.34%	1.33%±0.15%	9,851
14 915	Claude Opus 4.8 Anthropic · Proprietary	3.90%±1.54%	6.75%±2.72%	11.17%±4.77%	7.79%±2.58%	9.44%±1.22%	15.63%±3.62%	31,463
15 1418	Claude Sonnet 4.6 Anthropic · Proprietary	1.99%±1.12%	1.95%±2.64%	0.75%±3.63%	0.64%±2.13%	10.71%±1.39%	1.30%±0.16%	34,873
16 1520	GLM 5.1 Z.ai · MIT · SiliconFlow	1.19%±0.82%	0.78%±1.80%	0.53%±2.83%	0.62%±1.68%	3.95%±0.86%	1.33%±0.15%	51,362
17 1526	Muse Spark 1.1 Meta · Proprietary	0.17%±2.04%	2.06%±3.70%	0.61%±5.26%	7.21%±6.54%	4.04%±3.44%	1.33%±0.15%	9,262
18 1626	Qwen3.7 Plus Alibaba · Proprietary	0.61%±1.38%	1.10%±3.18%	4.04%±4.31%	2.73%±3.04%	4.65%±1.73%	0.19%±0.68%	10,048
19 1724	Gemini 3.1 Pro Preview Google · Proprietary	0.70%±0.69%	0.63%±1.53%	0.07%±2.19%	1.85%±1.26%	7.21%±1.15%	1.29%±0.15%	61,459
20 1526	Kimi K2.7 Code Moonshot · Modified MIT	0.76%±2.01%	4.51%±4.03%	1.38%±7.14%	10.78%±4.26%	0.23%±2.20%	1.33%±0.15%	7,468
21 1626	Qwen3.7 Max Alibaba · Proprietary	0.81%±1.39%	2.25%±3.30%	5.38%±4.50%	4.16%±2.97%	6.84%±1.63%	0.91%±0.28%	9,944
22 1726	DeepSeek V4 Pro DeepSeek · MIT	1.11%±1.36%	2.95%±3.30%	4.79%±4.46%	3.10%±2.94%	4.50%±1.19%	0.80%±0.34%	10,335
23 1726	Gemini 3.5 Flash (High) Google · Proprietary	1.12%±0.77%	1.66%±1.74%	4.58%±2.42%	0.80%±1.43%	0.71%±1.29%	1.17%±0.35%	45,991
24 1727	Kimi K2.6 Moonshot · Modified MIT	2.09%±2.09%	2.36%±3.98%	0.20%±6.64%	9.16%±4.32%	4.79%±4.00%	1.33%±0.15%	7,620
25 1827	Mimo V2.5 Pro Xiaomi · MIT	3.17%±1.40%	6.00%±3.39%	9.22%±4.41%	4.81%±3.16%	3.55%±1.76%	0.63%±0.32%	10,379
26 1827	Minimax M3 MiniMax · MiniMax Community License	3.23%±1.36%	7.51%±3.38%	8.13%±4.29%	7.92%±3.02%	6.27%±0.98%	1.14%±0.20%	9,913
27 2428	DeepSeek V4 Flash DeepSeek · MIT	4.64%±1.30%	6.86%±3.44%	12.90%±3.90%	6.99%±2.89%	4.10%±1.17%	0.53%±0.48%	10,051
28 2731	Gemini 3.5 Flash (Medium) Google · Proprietary	7.04%±1.79%	11.84%±4.22%	8.07%±5.27%	10.43%±3.28%	5.90%±4.10%	1.04%±0.37%	7,041
29 2831	Grok Build 0.1 SpaceXAI · Proprietary	8.21%±0.79%	5.27%±1.79%	13.65%±2.36%	12.84%±1.64%	9.83%±1.59%	0.53%±0.17%	52,802
30 2831	Grok 4.3 (High) SpaceXAI · Proprietary	8.31%±0.80%	8.81%±1.76%	15.63%±1.99%	7.97%±1.38%	10.06%±2.27%	0.92%±0.17%	41,665
31 2831	Gemini 3 Flash Google · Proprietary	8.54%±0.77%	8.46%±1.58%	12.83%±1.90%	5.24%±1.27%	16.46%±2.09%	0.28%±1.26%	62,100
32 3234	Minimax M2.7 MiniMax · Modified MIT	11.46%±1.63%	16.97%±3.84%	15.68%±4.65%	17.75%±3.43%	7.94%±3.22%	1.06%±0.26%	10,183
33 3235	Gemma 4 31B Google · Apache 2.0	13.58%±1.42%	3.54%±1.68%	6.03%±2.40%	5.33%±1.50%	30.16%±4.55%	22.86%±4.41%	51,231
34 3235	Nemotron 3 Ultra Nvidia · OpenMDW-1.1	13.68%±2.57%	13.69%±5.33%	11.10%±7.85%	23.99%±5.09%	19.12%±5.92%	0.51%±0.81%	9,511
35 3335	Grok 4.3 SpaceXAI · Proprietary	15.31%±1.01%	10.12%±1.62%	16.64%±1.86%	8.96%±1.24%	41.90%±4.11%	1.10%±0.17%	61,497

Agent ArenaView Methodology

Confirmed Success

Praise vs Complaint

Steerability

Bash Recovery

Tool Hallucination

Frequently asked questions

Try Agent Mode

How the Agent Leaderboard works