Agent Arena: Causal Evaluation of Agents in the Real World Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.
Empowering Users to Get More Done With Agent Mode The future of AI is not single-modality chat; it is in powerful agentic capabilities. Today, we are excited to introduce Agent Mode, designed to help everyone from everyday users looking to get more done to entrepreneurs looking to maximize agentic efficacy across complex use cases. While traditional chat requires
New Categories for Web Development in Code Arena AI coding models are increasingly used to build web apps, but aggregated leaderboards obscure key performance differences. After analyzing 250k+ Code Arena prompts, we identified major front-end task categories and built new leaderboard views to compare model strengths and weaknesses.
Multimodal Max Text Arena Score vs. Time to First Token Note: displayed models were the top public models when the router was created. Search Arena Score vs. Time to First Token Note: displayed models were the top public models when the router was created. Vision Arena Score vs. Time to First Token
Arena Leaderboard Dataset For almost three years, Arena has been publishing leaderboards covering frontier AI capabilities across 10 arenas, dozens of categories, and hundreds of models, and today we're releasing the entire history of those leaderboards as a public-access dataset. We've watched how our leaderboards are cited, referenced,
March 2026: Arena Updates across Product, Leaderboard Rankings & Research March 2026 brought major updates to the Arena leaderboard, including new rankings across document, video, text, and code models. In this monthly roundup, we break down the best AI models, latest LLM benchmarks, and key trends shaping AI evaluation. Whether you're a daily voter or just checking in
Supporting Independent Research in AI Evaluation Arena’s Academic Partnerships Program provides funding and support for independent research advancing the scientific foundations of AI evaluation.