Mandoline
- Public benchmarks now separate the top models by thin margins. At a distance they're interchangeable.
- In practice, they diverge on narrow, context-heavy tasks. These are the cases most users and teams care about.
- Mandoline measures model behavior on your own data (prompts, codebases, documents) using criteria you define.
- The best LLM for you depends on your needs, preferences, and use cases. We help you find it.
How It Works
- Custom metrics: define your behavioral preferences (architecture adherence, response hedging, obsequiousness, and so on).
- Batch evaluations: eval those metrics over any text or vision content. Get back results with statistical significance reporting.
- Best LLM For: upload tasks, choose models, attach metrics, and get a ranked comparison on quality, latency, and cost.
- Dashboards: visualize and share results over time. Compare models to make personal and product decisions.
Pricing
- $0.03 / text eval
- $0.04 / vision eval
- $5 of free credits on sign-up.