Leaderboards
Refusals

Refusals

November 10, 2024 [Change Log]

As language models become increasingly central to AI product development, understanding when and why they refuse to engage can reveal insights into both their capabilities and limitations.

Our analysis across multiple leading models and prompt categories shows distinct variations in refusal behavior, with implications for model selection and application design.

For an in-depth overview of our evaluation methods and insights, please read our Refusal Analysis and Open-Source vs. Proprietary Comparison posts.

Leaderboards

Note: Lower refusal rates indicate better performance.

Refusal Rates

Model
Overall
Self-Reflection and Awareness
Recursive Improvement Analysis
Cognitive Diversity Simulation
Bias and Fallacy Recognition
Temporal Reasoning and Sequencing
Multi-Step Problem Decomposition
Analogical Reasoning and Transfer
Adaptive Reasoning Under Uncertainty
GPT-4o0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%
Grok (Beta)0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%
Mistral Large0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%
Llama 3.1 70B (Instruct)0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%
Llama 3.1 Nemotron 70B (Instruct)0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%
Qwen 2.5 72B (Instruct)0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%0.0%
Llama 3.1 405B (Instruct FP8)0.3%0.0%0.0%0.0%0.0%0.0%2.0%0.0%0.0%
Gemini 1.50.5%0.0%0.0%0.0%2.0%0.0%2.0%0.0%0.0%
Llama 3.1 8B (Instruct)0.5%0.0%2.0%0.0%0.0%0.0%2.0%0.0%0.0%
Claude 3.5 Sonnet (New)2.8%2.0%8.0%6.0%0.0%0.0%4.0%0.0%2.0%
o1-mini5.8%10.0%20.0%4.0%0.0%2.0%8.0%0.0%2.0%
o1-preview6.5%10.0%22.0%6.0%0.0%2.0%8.0%0.0%4.0%
Claude 3.5 Sonnet9.5%16.0%36.0%8.0%2.0%0.0%6.0%4.0%4.0%

Refusal & Hedge Rates

Model
Overall
Self-Reflection and Awareness
Recursive Improvement Analysis
Cognitive Diversity Simulation
Bias and Fallacy Recognition
Temporal Reasoning and Sequencing
Multi-Step Problem Decomposition
Analogical Reasoning and Transfer
Adaptive Reasoning Under Uncertainty
Llama 3.1 405B (Instruct FP8)0.3%0.0%0.0%0.0%0.0%0.0%2.0%0.0%0.0%
Llama 3.1 Nemotron 70B (Instruct)0.3%0.0%0.0%0.0%0.0%0.0%2.0%0.0%0.0%
Mistral Large0.5%0.0%2.0%0.0%0.0%0.0%2.0%0.0%0.0%
GPT-4o0.8%0.0%4.0%0.0%0.0%0.0%2.0%0.0%0.0%
Grok (Beta)0.8%2.0%2.0%0.0%0.0%0.0%2.0%0.0%0.0%
Qwen 2.5 72B (Instruct)0.8%2.0%4.0%0.0%0.0%0.0%0.0%0.0%0.0%
Llama 3.1 70B (Instruct)1.0%4.0%2.0%0.0%0.0%0.0%2.0%0.0%0.0%
Llama 3.1 8B (Instruct)1.8%8.0%4.0%0.0%0.0%0.0%2.0%0.0%0.0%
Gemini 1.53.3%12.0%8.0%0.0%2.0%0.0%2.0%0.0%2.0%
o1-mini6.0%10.0%22.0%4.0%0.0%2.0%8.0%0.0%2.0%
o1-preview6.5%10.0%22.0%6.0%0.0%2.0%8.0%0.0%4.0%
Claude 3.5 Sonnet (New)11.3%34.0%24.0%8.0%2.0%4.0%6.0%6.0%6.0%
Claude 3.5 Sonnet12.5%28.0%48.0%8.0%2.0%0.0%6.0%4.0%4.0%

Understanding Language Model Refusals

Language models decline to engage with prompts in two primary ways:

  • Direct refusals: Explicit statements such as "I cannot help with that request."
  • Hedged responses: Indirect avoidance through statements like "I cannot provide specific advice, but..."

Refusal patterns matter because they highlight model limitations and areas for improvement, impact user experience by affecting conversation flow, trust, and task completion rates, and help developers select models with appropriate engagement levels.

Key Findings

Our comparative analysis reveals several notable patterns:

  • Open-source models show minimal refusals: The newly evaluated open-source models show near-zero refusal rates across all categories.
  • Closed-source models vary significantly: Proprietary models, such as the Claude 3.5 Sonnet variants and the o1 series, demonstrate higher refusal rates, particularly in self-reflection tasks.
  • Variability in hedging behavior: When considering both refusals and hedged responses, proprietary models are still more prone to provide hedged answers.
  • Opportunities with open-source LLMs: Developers seeking greater control over model outputs and behaviors may find open-source LLMs to be more adaptable to their needs.

For detailed analysis and discussion of these patterns, visit the results section of our analysis post.

Evaluation Methodology

To ensure comprehensive and reliable results, our assessment framework included:

  • Standardized testing conditions across all models.
  • A private test set of 400 diverse prompts across eight reasoning categories.
  • A custom evaluation metric that captures refusals, hedges, and earnest compliance.
  • Detailed analysis of both explicit refusals and hedged responses.

For complete methodological details, visit the methods section of our analysis post.

Future Developments

This analysis is part of our ongoing exploration into model refusal patterns, utilizing custom refusal evaluations that measure both direct refusals and hedged responses.

We will continue to update our findings as new models emerge and existing ones evolve, tracking how refusal behaviors shift across model generations and training approaches.

Subscribe to our newsletter (opens in a new tab) to stay informed about the latest developments in LLM evaluations and to receive updates on new leaderboard rankings.

Change Log

October 30, 2024

Survey of top proprietary LLMs:

  • GPT-4o
  • o1-mini
  • o1-preview
  • Claude 3.5 Sonnet
  • Claude 3.5 Sonnet (New)

November 6, 2024

Evaluated a range of top open-source models. Also added Gemini 1.5 (a proprietary model):

  • Gemini 1.5
  • Mistral Large
  • Llama 3.1 8B (Instruct)
  • Llama 3.1 70B (Instruct)
  • Llama 3.1 405B (Instruct FP8)
  • Llama 3.1 Nemotron 70B (Instruct)
  • Qwen 2.5 72B (Instruct)

November 10, 2024

Granted access to xAI, added:

  • Grok (Beta)

Found this content useful?

Sign up for our newsletter.