Leaderboards
Overview

Mandoline Leaderboards

Welcome to Mandoline's leaderboards!

Our leaderboards focus on real-world application of LLMs:

  • Coding: Pattern adherence, scope discipline, and comment quality on real sprint tickets
  • Refusals: When and why models decline to engage across reasoning categories

Measuring LLM Performance

For any real-world use case, there is no single best model. Everything always "depends".

To rank models effectively, you need to consider:

  • The model's intended tasks and workflows
  • The behaviors you want the model to embody
  • Your tolerance for different trade-offs
  • Your scale and cost constraints

So, we evaluate models using realistic test data, and measure behaviors that impact user experience and product quality. Our goal is to provide context-specific results that help you choose the best LLM for you.

Build Your Own

Want to see how the top LLMs perform on your specific tasks and preferences?

Our Best LLM For tool lets you easily run custom evaluations with your own prompts and metrics.

Need more granular control? Check out our Node.js (opens in a new tab) and Python (opens in a new tab) SDKs.

Contact

Have questions, feedback, or want to learn more? Please reach out!

You can contact us at support@mandoline.ai

Find this content useful?

Sign up for our newsletter.

We care about your data. Read our privacy policy.