Question 1

What is the agent leaderboard?

Accepted Answer

Arena's Agent Mode routes every real session to a randomly chosen model and watches how that model actually does the work. The current Agent leaderboard ranks those models on how well they perform as the orchestrator, the main model that decides which tools to call (bash, web search, fetching pages, writing files, and so on), across millions of real, in-the-wild Agent Mode interactions. Instead of asking people to vote on two side-by-side answers, Agent Mode collects single-threaded user feedback and scores models on what happened while they were doing real tasks.

Question 2

How is this different from other Arena leaderboards?

Accepted Answer

Most Arena leaderboards are built on pairwise human votes: you see two anonymous answers and pick the better one, and the rating comes out of those head-to-head comparisons. The Agent leaderboard works differently in three ways. First, it uses single-threaded traces, not battles: in agent mode, users interact with a single agent in a long-running thread, sometimes over hundreds of turns, whereas previously users interacted with two models at a time in a battle. Second, it uses a combination of explicit feedback and implicit signals as opposed to explicit feedback only: previously we calculated leaderboards using votes, which are a form of explicitly stated feedback, whereas now Agent Arena measures several implicit behavioral signals such as natural language praise and complaints, tool hallucination, and more, to calculate an aggregate leaderboard that goes beyond explicit feedback alone. Third, it uses a new methodology called causal tracing, not Bradley Terry regression: we mine traces for signals and then use causal inference techniques to calculate treatment effects for different subcomponents of the agent, reporting the causal effect of using a specific model compared to the average model.

Question 3

How does the ranking work?

Accepted Answer

Because Agent Mode sends every session to a random model, we can infer a model's causal treatment effect by observing its behavior. For each signal we compute a per-model score and express it as a contrast in percentage points against a randomized baseline signifying the average model. That per-model, per-signal contrast is the net improvement: how much better or worse a behavior becomes when substituting in a particular model. Positive means above average, negative means below average. As the average model gets stronger, the average improves, so the net improvement decreases for any particular model, which keeps the leaderboard live and relative to flagship models from all the labs. The headline rank is a weighted average of a model's net improvement across all the signals, so every signal gets one vote (today equally weighted). We also show a 95% confidence interval on each number so you can see when two models are genuinely separated versus too close to call. Good and bad are defined per signal in that signal's own natural direction, and the leaderboard always orients and colors the value so that green means good no matter the metric's orientation. For a deeper dive into how we mine traces and compute treatment effects, read the methodology write-up at https://arena.ai/blog/agent-arena-methodology/.

Question 4

What do the percentages mean?

Accepted Answer

Every score on the leaderboard is a treatment effect, signifying the improvement one would get in each signal if substituting any specific orchestrator for the average orchestrator. A green highlight means the model does better than a typical model, a red highlight means it does worse, and a score near zero means it is about average. So "+7%" with a green highlight means clearly above average and "-3%" with a red highlight means a bit below. The big number next to each model is its overall score: the average of its percentages across every signal, and each signal column shows that model's percentage for that one behavior. The little "±" after a number is the 95% confidence interval; a score of "+5% ± 2%" really means somewhere around 3% to 7%, so when two models' ranges overlap a lot, treat them as basically tied. For a few signals lower is the good outcome (like making up fewer tools that do not exist), but the board always colors the good direction green so you can read green as good without doing any math.

Question 5

What are signals?

Accepted Answer

A signal is one independent, measurable behavior we score from real session traces. Each one captures a different dimension of doing the work well, and the headline score is the equal-weighted average across all of them. The current signals are: Confirmed Success (how often users explicitly confirm the task is done, built from the final explicit task approval or disapproval within a trace; higher is better); Praise vs Complaint (within a task, whether users say more explicitly positive things than negative things, isolating natural language satisfaction separate from button clicks; higher is better); Steerability (when a user pushes back or corrects the model, whether the very next response actually lands instead of being rejected or going nowhere; higher is better); Bash Recovery (after a command fails because of the model's own mistake, how few retries it takes to get back to a working command; higher is better); and Tool Hallucination (how often the model calls a tool that does not exist; lower is better).

Question 6

Do the rankings change over time?

Accepted Answer

Yes. The leaderboard is a living measurement, not a one-time static score. It refreshes as new real Agent Mode sessions come in, so a model's score can move as we gather more evidence about how it behaves, and you can always see the last updated date and the number of observations behind the current leaderboard. Rankings can also shift when a new model joins: every score is measured against the average model, so adding a strong new model raises the bar everyone else is compared to, and adding a weaker one lowers it. That means a model's number can move a little even when its own behavior has not changed, simply because the competition did. As more sessions add up, the margins of error also get smaller, so close calls between models become clearer over time.

Question 7

Will there be more signals?

Accepted Answer

Yes. The current set is a starting point, and the framework is built to grow. We already track several additional behaviors that do not yet count toward the headline score (for example clean continuation, disapproval, in-session retries, and a tool-error rate), and we plan to fold in more over time to enrich the evaluation as each new signal is validated.

	Model
1 12	Claude Fable 5 (High) Anthropic · Proprietary	13.68%±1.59%	17.21%±2.59%	27.74%±5.97%	11.27%±3.24%	10.16%±1.40%	2.00%±0.24%	16,258
2 17	GPT 5.5 (xHigh) OpenAI · Proprietary	11.03%±2.52%	5.69%±4.32%	27.74%±9.54%	5.05%±5.05%	14.67%±1.76%	2.00%±0.24%	5,946
3 28	Claude Opus 4.8 (Thinking) Anthropic · Proprietary	9.05%±1.34%	10.12%±2.36%	16.10%±4.84%	9.34%±2.42%	9.23%±1.22%	0.44%±1.77%	23,337
4 29	Claude Opus 4.7 (Thinking) Anthropic · Proprietary	8.45%±1.26%	7.38%±2.57%	11.31%±4.52%	9.11%±2.53%	12.51%±0.94%	1.96%±0.24%	28,005
5 29	GPT 5.5 (High) OpenAI · Proprietary	7.80%±1.10%	5.64%±2.10%	10.87%±3.91%	7.10%±2.15%	13.40%±1.14%	2.00%±0.24%	32,449
6 29	Claude Opus 4.7 Anthropic · Proprietary	7.68%±1.27%	6.72%±2.58%	9.01%±4.51%	8.81%±2.57%	11.90%±1.11%	1.96%±0.24%	28,125
7 29	Claude Opus 4.6 Anthropic · Proprietary	7.65%±1.26%	7.75%±2.57%	10.09%±4.20%	8.03%±2.52%	10.36%±1.76%	2.00%±0.24%	28,157
8 310	GPT 5.4 (High) OpenAI · Proprietary	6.80%±1.14%	5.92%±2.16%	8.45%±4.13%	5.48%±2.23%	12.14%±1.04%	2.00%±0.24%	32,553
9 410	GPT 5.5 OpenAI · Proprietary	6.57%±1.07%	3.43%±2.06%	7.77%±3.80%	8.04%±2.02%	11.60%±1.29%	2.00%±0.24%	32,810
10 812	Claude Opus 4.8 Anthropic · Proprietary	4.56%±1.66%	7.99%±2.64%	13.59%±5.27%	6.86%±2.70%	7.64%±1.54%	13.27%±4.17%	20,878
11 1012	Claude Sonnet 4.6 Anthropic · Proprietary	3.02%±1.20%	1.73%±2.70%	2.40%±3.70%	2.37%±2.39%	11.38%±1.90%	1.99%±0.24%	28,045
12 1012	GLM 5.1 Z.ai · MIT · SiliconFlow	2.88%±1.21%	5.16%±2.46%	1.22%±4.21%	1.09%±2.46%	4.94%±1.38%	2.00%±0.24%	27,614
13 1317	DeepSeek V4 Pro DeepSeek · MIT · SiliconFlow	0.22%±1.18%	2.42%±2.50%	1.35%±4.12%	2.74%±2.46%	2.15%±1.10%	0.61%±0.37%	26,859
14 1317	Gemini 3.5 Flash Google · Proprietary	0.06%±1.06%	2.85%±2.38%	1.65%±3.55%	0.24%±2.02%	3.08%±1.27%	1.94%±0.24%	25,582
15 1318	Kimi K2.6 Moonshot · Modified MIT · Fireworks	0.63%±1.09%	0.62%±2.34%	4.08%±3.66%	3.14%±2.25%	1.45%±1.49%	2.00%±0.24%	29,091
16 1318	Gemini 3.1 Pro Preview Google · Proprietary	0.66%±0.96%	0.57%±2.15%	1.84%±3.07%	2.09%±1.79%	4.94%±1.59%	1.97%±0.24%	32,689
17 1318	DeepSeek V4 Flash DeepSeek · MIT · SiliconFlow	0.80%±1.38%	3.58%±2.68%	0.85%±4.78%	6.31%±2.85%	0.93%±1.63%	0.52%±0.41%	27,808
18 1820	Qwen 3.6 Plus Alibaba · Proprietary · Fireworks	4.52%±1.21%	1.74%±2.51%	8.04%±4.02%	9.57%±2.61%	1.79%±1.75%	1.47%±0.59%	27,191
19 1822	Grok Build 0.1 xAI · Proprietary	6.28%±1.11%	6.19%±2.62%	12.45%±3.51%	9.84%±2.37%	1.93%±1.60%	4.85%±0.69%	22,614
20 1524	Nemotron 3 Ultra Nvidia · OpenMDW-1.1	6.81%±5.22%	3.00%±8.86%	2.58%±19.26%	23.87%±10.36%	11.75%±9.55%	2.00%±0.24%	2,724
21 1923	Minimax M2.7 MiniMax · Modified MIT · Fireworks	7.43%±1.10%	10.72%±2.65%	16.78%±3.36%	9.09%±2.16%	2.51%±1.77%	1.96%±0.25%	27,869
22 1923	Grok 4.3 (High) xAI · Proprietary	8.09%±1.38%	11.68%±3.32%	17.34%±3.80%	5.66%±2.56%	3.34%±3.25%	2.43%±0.98%	10,745
23 2023	Gemini 3 Flash Google · Proprietary	8.89%±1.06%	13.14%±2.30%	13.90%±2.73%	4.70%±1.84%	14.28%±3.13%	1.56%±0.38%	32,747
24 2324	Gemma 4 31B Google · Apache 2.0	12.55%±1.89%	7.12%±2.56%	6.66%±3.88%	6.86%±2.33%	26.74%±6.24%	15.35%±4.55%	21,741
25 2525	Grok 4.3 xAI · Proprietary	18.30%±1.61%	14.14%±2.38%	13.75%±2.95%	3.63%±1.90%	60.23%±6.53%	0.26%±0.65%	31,907

Agent ArenaView Methodology

Confirmed Success

Praise vs Complaint

Steerability

Bash Recovery

Tool Hallucination

Frequently asked questions

Try Agent Mode

How the Agent Leaderboard works