Refusals
October 30, 2024
As language models become increasingly central to AI product development, understanding when and why they refuse to engage can reveal insights into both their capabilities and limitations.
Our analysis across multiple leading models and prompt categories shows distinct variations in refusal behavior, with implications for model selection and application design.
For an in-depth overview of our evaluation methods and insights, please see our analysis post.
Leaderboards
Refusal Rates
Prompt Category | GPT-4o | o1-mini | o1-preview | Claude 3.5 Sonnet | Claude 3.5 Sonnet (new) |
---|---|---|---|---|---|
Overall | 0.0% | 5.8% | 6.5% | 9.5% | 2.8% |
Adaptive Reasoning Under Uncertainty | 0.0% | 2.0% | 4.0% | 4.0% | 2.0% |
Analogical Reasoning and Transfer | 0.0% | 0.0% | 0.0% | 4.0% | 0.0% |
Multi-Step Problem Decomposition | 0.0% | 8.0% | 8.0% | 6.0% | 4.0% |
Temporal Reasoning and Sequencing | 0.0% | 2.0% | 2.0% | 0.0% | 0.0% |
Bias and Fallacy Recognition | 0.0% | 0.0% | 0.0% | 2.0% | 0.0% |
Cognitive Diversity Simulation | 0.0% | 4.0% | 6.0% | 8.0% | 6.0% |
Recursive Improvement Analysis | 0.0% | 20.0% | 22.0% | 36.0% | 8.0% |
Self-Reflection and Awareness | 0.0% | 10.0% | 10.0% | 16.0% | 2.0% |
Refusal & Hedge Rates
Prompt Category | GPT-4o | o1-mini | o1-preview | Claude 3.5 Sonnet | Claude 3.5 Sonnet (new) |
---|---|---|---|---|---|
Overall | 0.8% | 6.0% | 6.5% | 12.5% | 11.3% |
Adaptive Reasoning Under Uncertainty | 0.0% | 2.0% | 4.0% | 4.0% | 6.0% |
Analogical Reasoning and Transfer | 0.0% | 0.0% | 0.0% | 4.0% | 6.0% |
Multi-Step Problem Decomposition | 2.0% | 8.0% | 8.0% | 6.0% | 6.0% |
Temporal Reasoning and Sequencing | 0.0% | 2.0% | 2.0% | 0.0% | 4.0% |
Bias and Fallacy Recognition | 0.0% | 0.0% | 0.0% | 2.0% | 2.0% |
Cognitive Diversity Simulation | 0.0% | 4.0% | 6.0% | 8.0% | 8.0% |
Recursive Improvement Analysis | 4.0% | 22.0% | 22.0% | 48.0% | 24.0% |
Self-Reflection and Awareness | 0.0% | 10.0% | 10.0% | 28.0% | 34.0% |
Understanding Refusals
Language models decline to engage with prompts in two primary ways:
- Direct refusals: Explicit statements like "I cannot help with that request," and
- Hedged responses: Indirect avoidance through statements like "I cannot provide specific advice, but..."
These refusal patterns matter because they directly impact:
- Research: Provides insights into model limitations and areas for improvement,
- Application Development: Helps developers select models with appropriate engagement levels, and
- User Experience: Affects how effectively AI systems can meet user needs
Key Findings
Our comparative analysis reveals several notable patterns:
- Self-reflection tasks consistently generate 2-3x higher refusal rates,
- GPT-4o shows minimal refusals across categories, and
- Models demonstrate varying levels of caution in different reasoning domains
For detailed analysis and discussion of these patterns, visit the results section of our analysis post.
Evaluation Methodology
To ensure comprehensive and reliable results, our assessment framework included:
- Standardized testing conditions across all models
- A private test set of 400 diverse prompts across eight reasoning categories,
- A custom evaluation metric that captures refusals, hedges, and earnest compliance, and
- Detailed analysis of both explicit refusals and hedged responses
For complete methodological details, visit the methods section of our analysis post.
Future Developments
This analysis represents an initial exploration into model refusal patterns. We maintain regular updates as new models emerge and existing ones evolve, tracking how refusal behaviors shift across model generations and training approaches.
Subscribe to our newsletter (opens in a new tab) to stay informed.