From Live Data to High-Quality Benchmarks - The Arena-Hard Pipeline
Authors
Tianle Li*
Wei-Lin Chiang*
Evan Frick
Lisa Dunlap
Banghua Zhu
Joseph E. Gonzalez
Ion Stoica
Building an affordable and reliable benchmark for LLM chatbots has become a critical challenge. A high-quality benchmark should 1) robustly separate model capability, 2) reflect human preference in real-world use cases, and 3) frequently