Confirmed Success
How often the model gets users to confirm the task is done.
- Claude Opus 4.7 (Thinking)7.86%Claude Opus 4.7 (Thinking)7.86%
- Claude Opus 4.67.29%Claude Opus 4.67.29%
- GPT 5.4 (High)7.20%GPT 5.4 (High)7.20%
- GPT 5.5 (High)6.13%GPT 5.5 (High)6.13%
- Claude Opus 4.75.17%Claude Opus 4.75.17%
- GLM 5.14.57%GLM 5.14.57%
- DeepSeek V4 Pro2.78%DeepSeek V4 Pro2.78%
- DeepSeek V4 Flash2.19%DeepSeek V4 Flash2.19%
- GPT 5.51.85%GPT 5.51.85%
- Claude Sonnet 4.61.34%Claude Sonnet 4.61.34%

