Confirmed Success
How often the model gets users to confirm the task is done.
- Claude Fable 5 (High)16.48%Claude Fable 5 (High)16.48%
- Claude Opus 4.8 (Thinking)10.75%Claude Opus 4.8 (Thinking)10.75%
- GLM 5.2 (Max)9.43%GLM 5.2 (Max)9.43%
- Claude Opus 4.87.31%Claude Opus 4.87.31%
- Claude Opus 4.66.17%Claude Opus 4.66.17%
- GPT 5.5 (High)5.93%GPT 5.5 (High)5.93%
- GPT 5.54.75%GPT 5.54.75%
- GPT 5.4 (High)4.63%GPT 5.4 (High)4.63%
- Claude Opus 4.74.63%Claude Opus 4.74.63%
- Claude Opus 4.7 (Thinking)4.51%Claude Opus 4.7 (Thinking)4.51%

