# Mandoline > Find the best LLM for your specific tasks and preferences. Public benchmarks now separate the leading models by thin margins, but our experience shows they can diverge meaningfully on narrow, context-heavy tasks. These are the contexts that most end users care about. A generic winner can still underperform on your prompts, codebase, or tone guidelines, so "best" is always situational. [Mandoline](https://mandoline.ai) lets you surface these differences on your own data with metrics that capture the behaviors you care about. Test models against real data (prompts, codebases, and documents), then get ranked comparisons on quality, latency, and cost. ## Tools - [Best LLM For](https://mandoline.ai/best-llm-for): Web interface for determining what's the best LLM for a given task. Upload prompts, compare models, and get ranked results. - [Mandoline MCP](https://mandoline.ai/mcp): Embed evaluation tools into Claude Code, Claude Desktop, and Cursor. Let your assistant evaluate and improve its own performance. ## Docs - [Getting Started](https://mandoline.ai/docs/getting-started-with-mandoline): Quick start guide. Create a metric and run your first evaluation. - [Core Concepts](https://mandoline.ai/docs/mandoline-core-concepts): Conceptual overview of metrics and evaluations in Mandoline. - [What's The Best LLM For ...](https://mandoline.ai/docs/best-llm-for): How to run "Best LLM For" experiments. - [API Reference](https://mandoline.ai/docs/mandoline-api-reference): Endpoints, schemas, and examples. ### Tutorials - [Model Selection for Creative Tasks](https://mandoline.ai/docs/tutorials/model-selection-compare-llms-for-creative-tasks): Divergent-thinking metrics to pick between GPT-4 & Claude; ready to run JS and Python scripts. - [Multimodal Evaluation](https://mandoline.ai/docs/tutorials/multimodal-evaluation-text-and-vision-tasks): Office-layout example showing vision metrics and image + text inputs. - [Reducing Unwanted Behaviors](https://mandoline.ai/docs/tutorials/prompt-engineering-reduce-unwanted-llm-behaviors): Prompt-engineering based on user feedback (curbing a model's overly moralistic tone). ## Analysis & Insights - [What's the Best LLM for Coding?](https://mandoline.ai/blog/whats-the-best-llm-for-coding): 14-model evaluation on real sprint tickets using custom metrics for pattern adherence, scope discipline, and comment quality. - [Multimodal Language Model Evaluation: A Creative Coding Challenge](https://mandoline.ai/blog/multimodal-evals-creative-coding): Code-art challenge that probes code and vision reasoning abilities. - [Comparing Refusal Behavior Across Top Language Models](https://mandoline.ai/blog/comparing-llm-refusal-behavior): How top models refuse and hedge across reasoning task categories. - [Refusal Rates in Open-Source vs. Proprietary Language Models](https://mandoline.ai/blog/open-source-vs-proprietary-llm-refusals): 0.1% average refusal for open-source vs 4.2% for proprietary models. ## Leaderboards - [Coding](https://mandoline.ai/leaderboards/coding): Performance on real engineering tasks, ranked using custom evaluation metrics. - [Refusals](https://mandoline.ai/leaderboards/refusals): Refusal & hedge rates across models and task categories. ## Code - [MCP Server](https://github.com/mandoline-ai/mandoline-mcp-server): Open-source server implementation and setup guides. - [Mandoline CI](https://github.com/mandoline-ai/mandoline-ci): Integrate custom code evaluations into CI pipelines. - [Node.js SDK](https://github.com/mandoline-ai/mandoline-node): TypeScript client and examples; install via `npm install mandoline`. - [Python SDK](https://github.com/mandoline-ai/mandoline-python): Python client and examples; install via `pip install mandoline`. ## Support - [Email](mailto:support@mandoline.ai): Questions, comments, or need help? Get in touch!