Core Concepts

What is Mandoline?

Mandoline helps you evaluate and improve LLM applications in ways that matter to users. Through application-specific metrics and evaluations, we help you measure what truly impacts user satisfaction and task success in your specific context.

Main Features

Custom Metrics: Create evaluation criteria that align with your unique use case and user requirements.
User-Focused: Assess your LLM's performance in real situations, whether working with text, images, or both.
Progress Tracking: Monitor how your LLM improves over time.
Informed Decisions: Get insights to guide your LLM setup and configuration choices.
Scalable: Apply your custom metrics efficiently across large numbers of LLM outputs.
Easy Integration: Integrate Mandoline into your existing development workflow.

Metrics

In Mandoline, metrics measure specific LLM behaviors or outputs. Each metric should focus on a single, well-defined aspect that can be evaluated from one prompt-response pair. Metrics should target user-relevant, subjective qualities that are hard to automate.

Key aspects of Metrics:

Customizable: Define metrics that matter for your specific use case.
Scalable: Apply metrics consistently across a large number of LLM interactions.
Composable: Combine multiple metrics to create more complex evaluations.

Example: Creating a Custom Metric

Let's say you're using LLMs to generate dialogue for a set of video game characters. Your goal is to create engaging, complex, and realistic characters that make the game more fun.

We'll focus on a "bully" character, whose main role is to create tension and conflict in the game's story. Here's how you might create a metric to track this behavior:

const antagonismMetric = await mandoline.createMetric({
  name: "Antagonism",
  description:
    "Measures the character's ability to create conflict and disruption in interactions.",
  tags: ["personality", "narrative_impact", "social_interaction"],
});

You can now use this metric to evaluate how well the bully character's dialogue creates conflict across different game situations.

Evaluations

Evaluations in Mandoline apply your custom metrics to specific LLM interactions. They output a score (from -1.0 to 1.0) that reflects the quality of a particular response.

Key aspects of Evaluations:

Context-rich: Include relevant information about the metric, prompt, response, and surrounding context.
Aggregatable: Combine multiple evaluations to analyze trends and patterns.
Actionable: Provide insights that can guide improvements to your LLM pipeline.

Example: Evaluating Character Dialogue

Let's evaluate dialogue generation for a video game character using a custom metric:

const promptTemplate =
  "Generate dialogue for a {character_type} character in response to: '{situation}'.";
const characterType = "unrepentant bully";
const situation = "A new student asks for directions";
 
const prompt = promptTemplate
  .replace("{character_type}", characterType)
  .replace("{situation}", situation);
const response = await yourLLM.generate(prompt);
 
const evaluation = await mandoline.createEvaluation({
  metricId: antagonismMetric.id,
  prompt,
  response,
  properties: {
    promptTemplate,
    characterType,
    situation,
  },
});
 
console.log(`Antagonism score: ${evaluation.score}`);

Example: Evaluating Layout Suggestions

You can also evaluate tasks that combine text and images, like improving the layout of an office space:

const officeImageUrl = "data:image/png;base64,..."; // your image as data URL
const promptText = "How should we rearrange this office to add a second desk?";
const response = await yourLLM.generate({
  text: promptText,
  image: officeImageUrl,
});
 
const evaluation = await mandoline.createEvaluation({
  metricId: layoutMetric.id,
  prompt: promptText,
  promptImage: officeImageUrl,
  response,
  properties: {
    task: "layout-planning",
    roomType: "office",
  },
});
 
console.log(`Layout planning score: ${evaluation.score}`);

In both examples, if scores are unexpectedly low or start to vary widely over time, you might need to adjust your prompts or perhaps integrate a more suitable model.

Putting It All Together

Through application-specific metrics and evaluations, you can:

Spot patterns in your LLM's performance across different scenarios.
Find specific areas to improve in your prompts or model fine-tuning.
Track how changes to your LLM pipeline affect performance over time.
Make smart choices about which models to use and how to set them up.

Want to learn more? Check out our Tutorials for practical examples and step-by-step guides on using Mandoline to solve real-world AI challenges.

Getting Started What's The Best LLM For...