Mandoline Leaderboards

Welcome to Mandoline's leaderboards!

Our leaderboards focus on real-world application of LLMs:

Coding: Pattern adherence, scope discipline, and comment quality on real sprint tickets
Refusals: When and why models decline to engage across reasoning categories

Measuring LLM Performance

For any real-world use case, there is no single best model. Everything always "depends".

To rank models effectively, you need to consider:

The model's intended tasks and workflows
The behaviors you want the model to embody
Your tolerance for different trade-offs
Your scale and cost constraints

So, we evaluate models using realistic test data, and measure behaviors that impact user experience and product quality. Our goal is to provide context-specific results that help you choose the best LLM for you.

Build Your Own

Want to see how the top LLMs perform on your specific tasks and preferences?

Our Best LLM For tool lets you easily run custom evaluations with your own prompts and metrics.

Need more granular control? Check out our Node.js (opens in a new tab) and Python (opens in a new tab) SDKs.

Contact

Have questions, feedback, or want to learn more? Please reach out!

You can contact us at support@mandoline.ai

Coding

Mandoline Leaderboards

Measuring LLM Performance

Build Your Own

Contact

Find this content useful?