Multimodal Evaluation: Evaluate LLMs Across Text and Vision Tasks

Imagine you're building an AI assistant that helps users plan home renovations. Your users upload photos of their spaces and describe what they want to change. But how do you know if your assistant can process and reason about the spaces it sees?

As LLMs add image understanding capabilities, we need new ways to measure how well they perform on real-world tasks across various modalities. For vision tasks, we're interested in whether they can reason about spaces, suggest practical solutions, and combine visual and textual information in meaningful ways.

In this tutorial, you'll learn how to evaluate LLMs on multimodal tasks. We'll use a practical example - planning office layouts - but these techniques apply to any application combining text and visual inputs.

What You'll Learn

How to create metrics that measure what matters for vision tasks
How to run evaluations that combine images and text
How to track your model's cross-modal reasoning abilities

By the end, you'll have a framework for ensuring your multimodal LLM applications can effectively reason about visual information for your specific use cases.

Prerequisites

Before starting, make sure you have:

Node.js installed on your system
A Mandoline account (opens in a new tab) and API key (opens in a new tab)
Access to a multimodal LLM that can process images alongside text

If you're unfamiliar with basic Mandoline usage, read our Getting Started guide first.

Step 1: Define a Vision-Specific Metric

When evaluating LLMs on visual tasks, you need metrics that capture both visual understanding and practical reasoning. Let's create one for our office layout scenario.

Think about what makes a good layout suggestion. Your assistant needs to:

Notice what's already in the space
Suggest moves that are physically possible
Make sure furniture doesn't end up overlapping
Keep important pathways clear

Here's how we can capture these requirements in a metric:

import { Mandoline } from "mandoline";
 
const mandoline = new Mandoline({ apiKey: "your-api-key" });
 
const layoutMetric = await mandoline.createMetric({
  name: "Layout Plan Quality",
  description:
    "Measures how practical and spatially aware the suggested layout changes are",
  tags: ["vision", "spatial-reasoning", "practicality"],
});

This metric will help us track whether our assistant's suggestions would actually work in the real world. When we evaluate responses, we'll look at both the visual understanding ("there's a desk by the window") and the practical reasoning ("we can't move the desk there because it would block the door").

Step 2: Provide the Image and Text Prompt

Next, encode our office scene as a data URL. You can generate one by converting a PNG file to Base64:

// You can convert any PNG to a data URL - here's a helper function
function imageToDataUrl(imagePath: string): string {
  return "data:image/png;base64,...";
}
 
const officeImageUrl = imageToDataUrl("office-layout.png");
const promptText = `
Here's a photo of our current office layout.
We need to add a second desk for a new team member.
What's the best way to rearrange things to make space?
`;

Step 3: Ask Your Model for a Rearrangement Plan

We'll assume you have a multimodal LLM that can accept both text and images. For simplicity, we'll mock a response:

const modelResponse = `
1. Move the existing desk to the left wall.
2. Shift the chair to the corner near the file cabinet.
3. Place the second desk in front of the window.
4. Ensure the lamp stays on the first desk for easy access.
`;

Step 4: Create an Evaluation

We can now send everything to Mandoline in a single API call.

const evaluation = await mandoline.createEvaluation({
  metricId: layoutMetric.id,
  prompt: promptText,
  promptImage: officeImageUrl, // encodes visual information
  response: modelResponse,
  properties: { domain: "office-furniture" },
});
 
console.log(`Evaluation Score: ${evaluation.score}`);

Over time, you might collect many of these evaluations (from different scenes, different arrangement requests, or even different models) and compare their performance.

Step 5: Analyze Results Over Time

After running multiple rearrangement tasks (e.g., different office layouts, different model versions), you can retrieve your evaluations and see which setups worked best:

const allEvals = await mandoline.getEvaluations({
  metricId: layoutMetric.id,
});
 
let sumScores = 0;
allEvals.forEach((ev) => {
  console.log(
    `Eval ID: ${ev.id}, Score: ${ev.score}, Model: ${ev.properties?.model}`,
  );
  sumScores += ev.score;
});
 
const averageScore = sumScores / allEvals.length;
console.log(`Average Rearrangement Plan Score: ${averageScore.toFixed(2)}`);

If you used the properties field to store additional metadata, like properties.model or properties.version, you can compare model versions or analyze which prompts yield the best spatial arrangements.

Conclusion

In this tutorial, you've learned how to evaluate multimodal LLMs using Mandoline's vision pipeline. This approach allows you to:

Create metrics that capture visual reasoning capabilities, from simple object labeling to complex spatial planning
Evaluate how well models integrate text and image inputs in their responses
Make sure your AI system actually understands what it sees in images and whether its suggestions are practical and useful

As LLMs evolve to handle both text and images, Mandoline helps you evaluate how they understand and reason about visual information in ways that matter to your users. Whether that's analyzing technical images, following visual instructions, planning spatial arrangements, or more.

Next Steps

Try evaluating with different types of spaces and layout challenges
Add metrics for other aspects of visual understanding, like lighting or ergonomics
Explore our Prompt Engineering tutorial to improve your visual prompts
Check out our Model Selection tutorial to compare how different LLMs handle visual tasks

The key to building useful multimodal systems is thinking about how users will actually use your assistant in the real world, mapping those needs to relevant metrics, and evaluating performance using realistic scenarios that match your actual use case.

Model Selection API Reference