Confirmed Success
How often the model gets users to confirm the task is done.
- Claude Fable 5 (High)16.27%Claude Fable 5 (High)16.27%
- Claude Opus 4.8 (Thinking)10.65%Claude Opus 4.8 (Thinking)10.65%
- GLM 5.2 (Max)9.96%GLM 5.2 (Max)9.96%
- GPT 5.5 (High)6.33%GPT 5.5 (High)6.33%
- GPT 5.4 (High)5.61%GPT 5.4 (High)5.61%
- Claude Opus 4.85.60%Claude Opus 4.85.60%
- GPT 5.5 (xHigh)5.13%GPT 5.5 (xHigh)5.13%
- Claude Opus 4.65.11%Claude Opus 4.65.11%
- Claude Opus 4.75.00%Claude Opus 4.75.00%
- GPT 5.54.97%GPT 5.54.97%

