Introducing Max

Today we are releasing Max, Arena's model router powered by our community’s 5+ million real-world votes. Max acts as an intelligent orchestrator—it routes each user prompt to the most capable model for that specific prompt.

Introducing Max

Today we are releasing Max, Arena's model router powered by our community’s 5+ million real-world votes. Max acts as an intelligent orchestrator—it routes each user prompt to the most capable model for that specific prompt. Through this, Max achieves top performance across all domains.

In today’s rapidly advancing AI landscape, models and providers are evolving to fill different niches— some models are great at coding, others are strong in math; some answer quickly, while others think longer. Max intelligently leverages the varying strength profiles of different models to produce a unified experience that is reliable across the full usage spectrum. 

We recently deployed a Max version in Battle mode, codenamed theta-hat, which achieved #1 on the Arena Overall leaderboard with a score of 1500. This base version of Max is also #1 across all major categories, including Coding, Math, and Expert.

We can also make a latency-aware version of Max, providing top-level performance while keeping response latency low. Our latest latency-aware Max, codenamed arcstride, achieved an Arena score of 1495 while also reducing time-to-first-token latency by more than 16 seconds compared to the next-best model.

Going forward, latency-aware Max will be our default experience in our Direct Chat mode. We plan on continually updating and improving Max over time. Max is now available at arena.ai/max.


Base Router Performance

Table 1: Max Arena Scores

Model Arena scores overall and across performance categories

Category Max (base) gemini-3-pro grok-4.1-thinking gemini-3-flash claude-opus-4-5-thinking-32k
Overall 1500 1488 1476 1471 1468
Expert 1528 1501 1490 1494 1508
Hard Prompts 1527 1503 1487 1488 1501
Coding 1567 1519 1508 1505 1539
Math 1489 1485 1454 1470 1468
Creative Writing 1493 1491 1437 1462 1456
Instruction Following 1484 1472 1437 1453 1478
Longer Query 1503 1492 1449 1465 1494

Figure 1: Overall Routing Distribution

Top 10 models selected across all prompts for Max (base)

gemini-3-pro 37.68%
grok-4.1-thinking 15.35%
claude-opus-4-5-thinking-32k 15.05%
ernie-5.0-preview-1203 9.54%
gemini-2.5-pro 7.55%
claude-sonnet-4-5 3.13%
claude-opus-4-5 2.93%
gemini-3-flash 2.23%
claude-sonnet-4-5-thinking-32k 1.79%
ernie-5.0-0110 1.65%
Other 3.10%

Figure 2: Category-wise Routing Distribution Comparison

Top 5 routed models by category shown as stacked percentage bars for Max (base)

other: 12.49%
ernie-5.0-preview: 4.69%
gemini-2.5-pro: 9.22%
grok-4.1-thinking: 10.00%
claude-opus-4-5-thinking: 6.41%
gemini-3-pro: 57.19%
other: 9.94%
ernie-5.0-preview: 5.13%
gemini-2.5-pro: 6.09%
grok-4.1-thinking: 7.37%
claude-opus-4-5-thinking: 20.51%
gemini-3-pro: 50.96%
other: 13.30%
gpt-5.1-high: 4.68%
claude-opus-4-5: 5.76%
gemini-2.5-pro: 12.95%
claude-opus-4-5-thinking: 34.53%
gemini-3-pro: 28.78%
other: 17.24%
claude-sonnet-4-5: 5.71%
claude-opus-4-5: 7.14%
gemini-2.5-pro: 12.14%
claude-opus-4-5-thinking: 30.27%
gemini-3-pro: 27.50%
other: 17.26%
ernie-5.0-preview: 6.22%
gemini-2.5-pro: 8.51%
grok-4.1-thinking: 14.64%
claude-opus-4-5-thinking: 25.95%
gemini-3-pro: 27.42%
other: 16.97%
claude-sonnet-4-5: 6.77%
claude-opus-4-5: 5.35%
gemini-2.5-pro: 12.83%
claude-opus-4-5-thinking: 35.05%
gemini-3-pro: 23.03%
other: 9.36%
claude-sonnet-4-5-thinking: 4.73%
claude-sonnet-4-5: 5.31%
grok-4.1-thinking: 12.24%
claude-opus-4-5-thinking: 51.62%
gemini-3-pro: 16.74%
Creative Writing 640 prompts
Math 312 prompts
Expert 278 prompts
Instruction Following 1,120 prompts
Hard 2,316 prompts
Longer Query 990 prompts
Code 866 prompts
other
gemini-3-pro
claude-opus-4-5-thinking-32k
grok-4.1-thinking
gemini-2.5-pro
claude-sonnet-4-5
claude-sonnet-4-5-thinking-32k
claude-opus-4-5
gpt-5.1-high
ernie-5.0-preview

Latency-Aware Router Performance

Table 2: Latency-Aware Router Arena Score vs. Latency

Performance and latency metrics in Battle mode

Model Score TTFT (s)
Max (latency-aware) 1495 3.44
gemini-3-pro 1488 19.72
grok-4.1-thinking 1476 7.19
gemini-3-flash 1471 5.83
claude-opus-4-5-thinking-32k 1468 11.58

Figure 3: Latency-Aware Router Provider Distribution

Model selection grouped by provider for Max (latency-aware)

25,882
total selections
Google 42.46%
xAI 33.58%
Anthropic 23.94%
Other 0.02%

In December 2025, we launched six experimental latency-aware versions of Max, with various tradeoffs between speed and performance. The majority of these models are on the Pareto frontier between latency and Arena score on the current leaderboard.

Figure 4: Arena Score vs. Time to First Token

Figure 4: Arena Score vs. Time to First Token

Figure 5: Arena Score vs. End to End Generation Time

Figure 5: Arena Score vs. End to End Generation Time


Say Hello to Max

Max provides our users with a single entrypoint where anyone can leverage the diverse skillsets of the latest cutting-edge LLMs in the most effective way possible. The latency-aware version we initially released balances both speed and performance, offering a smooth, reliable, and helpful chat experience. 


Appendix

Benchmarking Performance

We also ran Max on a variety of relevant static benchmarks. Max was not explicitly optimized for top performance on these benchmarks, but was still able to achieve results on par with the current top models. Moreover, our latency-aware versions of Max also effectively handle latency tradeoffs on these benchmarks.

Table 3: Benchmark Performance

Accuracy scores across major benchmarks

Benchmark Max (base) gemini-3-pro claude-opus-4-5-thinking-32k gpt-5.2-high
HLE 38.1% 38.7% 26.6% 31.1%
GPQA Diamond 91.0% 90.5% 84.9% 87.7%
SimpleQA Verified 70.4% 72.0% 40.8% 35.4%
MMLU-Pro 89.7% 89.8% 89.5% 87.4%
MMMLU 91.8% 91.8% 90.8% 89.6%
AIME 2025 95.3% 95.7% 91.3% 99.0%

Latency Tradeoffs on Benchmark Performance

HLE: Latency vs. Accuracy Tradeoffs

Figure 6: HLE: Latency vs. Accuracy Tradeoffs

GPQA Diamond: Latency vs. Accuracy Tradeoffs

Figure 7: GPQA Diamond: Latency vs. Accuracy Tradeoffs

SimpleQA Verified: Latency vs. Accuracy Tradeoffs

Figure 8: SimpleQA Verified: Latency vs. Accuracy Tradeoffs