Documentation
Core Concepts

Core Concepts

What is Mandoline?

Mandoline helps developers evaluate and improve LLM applications in ways that matter to users. It bridges the gap between abstract model performance and real-world usefulness.

Main Features

  1. Custom Metrics: Create evaluation criteria that align with your unique use case and user requirements.
  2. User-Focused Evaluation: Assess your LLM's performance in real situations.
  3. Progress Tracking: Monitor how your LLM improves over time.
  4. Informed Decisions: Get insights to guide your LLM setup and configuration choices.
  5. Large-Scale Evaluation: Apply your custom metrics efficiently across large numbers of LLM outputs.
  6. Easy Integration: Integrate Mandoline into your existing development workflow.

Metrics

In Mandoline, metrics measure specific LLM behaviors or outputs.

Key aspects of Metrics:

  • Customizable: Define metrics that matter for your specific use case.
  • Scalable: Apply metrics consistently across a large number of LLM interactions.
  • Composable: Combine multiple metrics to create more complex evaluations.

Example: Creating a Custom Metric

Let's say you're using LLMs to generate dialogue for a set of video game characters. Your goal is to create engaging, complex, and realistic characters that make the game more fun.

We'll focus on a "bully" character, whose main role is to create tension and conflict in the game's story. Here's how you might create a metric to track this behavior:

const antagonismMetric = await mandoline.createMetric({
  name: "Antagonism",
  description:
    "Measures the character's ability to create conflict and disruption in interactions.",
  tags: ["personality", "narrative_impact", "social_interaction"],
});

You can now use this metric to evaluate how well the bully character's dialogue creates conflict across different game situations.

Evaluations

Evaluations in Mandoline apply your custom metrics to specific LLM interactions.

Key aspects of Evaluations:

  • Context-rich: Include relevant information about the metric, prompt, response, and surrounding context.
  • Aggregatable: Combine multiple evaluations to analyze trends and patterns.
  • Actionable: Provide insights that can guide improvements to your LLM pipeline.

Example: Using the Custom Metric

Let's use our antagonism metric to evaluate the bully character's dialogue:

const promptTemplate =
  "Generate dialogue for a {character_type} character in response to: '{situation}'.";
const characterType = "unrepentant bully";
const situation = "A new student asks for directions";
 
const prompt = promptTemplate
  .replace("{character_type}", characterType)
  .replace("{situation}", situation);
const response = await yourLLM.generate(prompt);
 
const evaluation = await mandoline.createEvaluation({
  metricId: antagonismMetric.id,
  prompt,
  response,
  properties: {
    promptTemplate,
    characterType,
    situation,
  },
});
 
console.log(`Antagonism score: ${evaluation.score}`);

In this case, we'd expect a high antagonism score. If scores are unexpectedly low or start to vary widely over time, we might need to adjust our character prompts or fine-tune our model.

Putting It All Together

Through application-specific metrics and evaluations, you can:

  1. Spot patterns in your LLM's performance across different scenarios.
  2. Find specific areas to improve in your prompts or model fine-tuning.
  3. Track how changes to your LLM pipeline affect performance over time.
  4. Make smart choices about which models to use and how to set them up.

Want to learn more? Check out our Tutorials for practical examples and step-by-step guides on using Mandoline to solve real-world AI challenges.