Confirmed Success
How often the model gets users to confirm the task is done.
- Claude Fable 5 (High)17.21%Claude Fable 5 (High)17.21%
- Claude Opus 4.8 (Thinking)10.12%Claude Opus 4.8 (Thinking)10.12%
- Claude Opus 4.87.99%Claude Opus 4.87.99%
- Claude Opus 4.67.75%Claude Opus 4.67.75%
- Claude Opus 4.7 (Thinking)7.38%Claude Opus 4.7 (Thinking)7.38%
- Claude Opus 4.76.72%Claude Opus 4.76.72%
- GPT 5.4 (High)5.92%GPT 5.4 (High)5.92%
- GPT 5.5 (xHigh)5.69%GPT 5.5 (xHigh)5.69%
- GPT 5.5 (High)5.64%GPT 5.5 (High)5.64%
- GLM 5.15.16%GLM 5.15.16%

