Confirmed Success
How often the model gets users to confirm the task is done.
- Claude Opus 4.7 (Thinking)7.95%Claude Opus 4.7 (Thinking)7.95%
- Claude Opus 4.67.17%Claude Opus 4.67.17%
- GPT 5.5 (High)7.06%GPT 5.5 (High)7.06%
- GPT 5.4 (High)6.89%GPT 5.4 (High)6.89%
- Claude Opus 4.75.46%Claude Opus 4.75.46%
- GLM 5.14.63%GLM 5.14.63%
- GPT 5.52.97%GPT 5.52.97%
- DeepSeek V4 Pro1.57%DeepSeek V4 Pro1.57%
- Claude Sonnet 4.61.22%Claude Sonnet 4.61.22%
- Gemini 3.1 Pro Preview0.64%Gemini 3.1 Pro Preview0.64%

