Research

Introducing Max

Today we are releasing Max, Arena's model router powered by our community’s 5+ million real-world votes. Max acts as an intelligent orchestrator—it routes each user prompt to the most capable model for that specific prompt.

Today we are releasing Max, Arena's model router powered by our community’s 5+ million real-world votes. Max acts as an intelligent orchestrator—it routes each user prompt to the most capable model for that specific prompt. Through this, Max achieves top performance across all domains.

In today’s rapidly advancing AI landscape, models and providers are evolving to fill different niches— some models are great at coding, others are strong in math; some answer quickly, while others think longer. Max intelligently leverages the varying strength profiles of different models to produce a unified experience that is reliable across the full usage spectrum.

We recently deployed a Max version in Battle mode, codenamed theta-hat, which achieved #1 on the Arena Overall leaderboard with a score of 1500. This base version of Max is also #1 across all major categories, including Coding, Math, and Expert.

We can also make a latency-aware version of Max, providing top-level performance while keeping response latency low. Our latest latency-aware Max, codenamed arcstride, achieved an Arena score of 1495 while also reducing time-to-first-token latency by more than 16 seconds compared to the next-best model.

Going forward, latency-aware Max will be our default experience in our Direct Chat mode. We plan on continually updating and improving Max over time. Max is now available at arena.ai/max.

Base Router Performance

Table 1: Max Arena Scores

Model Arena scores overall and across performance categories

Category	Max (base)	gemini-3-pro	grok-4.1-thinking	gemini-3-flash	claude-opus-4-5-thinking-32k
Overall	1500	1488	1476	1471	1468
Expert	1528	1501	1490	1494	1508
Hard Prompts	1527	1503	1487	1488	1501
Coding	1567	1519	1508	1505	1539
Math	1489	1485	1454	1470	1468
Creative Writing	1493	1491	1437	1462	1456
Instruction Following	1484	1472	1437	1453	1478
Longer Query	1503	1492	1449	1465	1494

Figure 1: Overall Routing Distribution

Top 10 models selected across all prompts for Max (base)

gemini-3-pro 37.68%

grok-4.1-thinking 15.35%

claude-opus-4-5-thinking-32k 15.05%

ernie-5.0-preview-1203 9.54%

gemini-2.5-pro 7.55%

claude-sonnet-4-5 3.13%

claude-opus-4-5 2.93%

gemini-3-flash 2.23%

claude-sonnet-4-5-thinking-32k 1.79%

ernie-5.0-0110 1.65%

Other 3.10%

Figure 2: Category-wise Routing Distribution Comparison

Top 5 routed models by category shown as stacked percentage bars for Max (base)

other: 12.49%

ernie-5.0-preview: 4.69%

gemini-2.5-pro: 9.22%

grok-4.1-thinking: 10.00%

claude-opus-4-5-thinking: 6.41%

gemini-3-pro: 57.19%

other: 9.94%

ernie-5.0-preview: 5.13%

gemini-2.5-pro: 6.09%

grok-4.1-thinking: 7.37%

claude-opus-4-5-thinking: 20.51%

gemini-3-pro: 50.96%

other: 13.30%

gpt-5.1-high: 4.68%

claude-opus-4-5: 5.76%

gemini-2.5-pro: 12.95%

claude-opus-4-5-thinking: 34.53%

gemini-3-pro: 28.78%

other: 17.24%

claude-sonnet-4-5: 5.71%

claude-opus-4-5: 7.14%

gemini-2.5-pro: 12.14%

claude-opus-4-5-thinking: 30.27%

gemini-3-pro: 27.50%

other: 17.26%

ernie-5.0-preview: 6.22%

gemini-2.5-pro: 8.51%

grok-4.1-thinking: 14.64%

claude-opus-4-5-thinking: 25.95%

gemini-3-pro: 27.42%

other: 16.97%

claude-sonnet-4-5: 6.77%

claude-opus-4-5: 5.35%

gemini-2.5-pro: 12.83%

claude-opus-4-5-thinking: 35.05%

gemini-3-pro: 23.03%

other: 9.36%

claude-sonnet-4-5-thinking: 4.73%

claude-sonnet-4-5: 5.31%

grok-4.1-thinking: 12.24%

claude-opus-4-5-thinking: 51.62%

gemini-3-pro: 16.74%

Creative Writing 640 prompts

Math 312 prompts

Expert 278 prompts

Instruction Following 1,120 prompts

Hard 2,316 prompts

Longer Query 990 prompts

Code 866 prompts

other

gemini-3-pro

claude-opus-4-5-thinking-32k

grok-4.1-thinking

gemini-2.5-pro

claude-sonnet-4-5

claude-sonnet-4-5-thinking-32k

claude-opus-4-5

gpt-5.1-high

ernie-5.0-preview

Latency-Aware Router Performance

Table 2: Latency-Aware Router Arena Score vs. Latency

Performance and latency metrics in Battle mode

Model	Score	TTFT (s)
Max (latency-aware)	1495	3.44
gemini-3-pro	1488	19.72
grok-4.1-thinking	1476	7.19
gemini-3-flash	1471	5.83
claude-opus-4-5-thinking-32k	1468	11.58

Figure 3: Latency-Aware Router Provider Distribution

Model selection grouped by provider for Max (latency-aware)

25,882

total selections

Google 42.46%

xAI 33.58%

Anthropic 23.94%

Other 0.02%

In December 2025, we launched six experimental latency-aware versions of Max, with various tradeoffs between speed and performance. The majority of these models are on the Pareto frontier between latency and Arena score on the current leaderboard.

Figure 4: Arena Score vs. Time to First Token

Figure 5: Arena Score vs. End to End Generation Time

Say Hello to Max

Max provides our users with a single entrypoint where anyone can leverage the diverse skillsets of the latest cutting-edge LLMs in the most effective way possible. The latency-aware version we initially released balances both speed and performance, offering a smooth, reliable, and helpful chat experience.

Appendix

Benchmarking Performance

We also ran Max on a variety of relevant static benchmarks. Max was not explicitly optimized for top performance on these benchmarks, but was still able to achieve results on par with the current top models. Moreover, our latency-aware versions of Max also effectively handle latency tradeoffs on these benchmarks.

Table 3: Benchmark Performance

Accuracy scores across major benchmarks

Benchmark	Max (base)	gemini-3-pro	claude-opus-4-5-thinking-32k	gpt-5.2-high
HLE	38.1%	38.7%	26.6%	31.1%
GPQA Diamond	91.0%	90.5%	84.9%	87.7%
SimpleQA Verified	70.4%	72.0%	40.8%	35.4%
MMLU-Pro	89.7%	89.8%	89.5%	87.4%
MMMLU	91.8%	91.8%	90.8%	89.6%
AIME 2025	95.3%	95.7%	91.3%	99.0%

Latency Tradeoffs on Benchmark Performance

HLE: Latency vs. Accuracy Tradeoffs

Figure 6: HLE: Latency vs. Accuracy Tradeoffs

GPQA Diamond: Latency vs. Accuracy Tradeoffs

Figure 7: GPQA Diamond: Latency vs. Accuracy Tradeoffs

SimpleQA Verified: Latency vs. Accuracy Tradeoffs

Figure 8: SimpleQA Verified: Latency vs. Accuracy Tradeoffs