Multimodal Evaluation: Evaluate LLMs Across Text and Vision Tasks
Imagine you're building an AI assistant that helps users plan home renovations. Your users upload photos of their spaces and describe what they want to change. But how do you know if your assistant can process and reason about the spaces it sees?
As LLMs add image understanding capabilities, we need new ways to measure how well they actually perform on real-world tasks across various modalities. For vision tasks, it's not just about whether they can identify objects in photos - it's about whether they can reason about spaces, suggest practical solutions, and combine visual and textual information in meaningful ways.
In this tutorial, you'll learn how to evaluate LLMs on multimodal tasks. We'll use a practical example - planning office layouts - but these techniques apply to any application combining text and visual inputs.
What You'll Learn
- How to create metrics that measure what matters for vision tasks
- How to run evaluations that combine images and text
- How to track your model's cross-modal reasoning abilities
By the end, you'll have a framework for ensuring your multimodal LLM applications can effectively reason about visual information for your specific use cases.
Prerequisites
Before starting, make sure you have:
- Node.js installed on your system
- A Mandoline account (opens in a new tab) and API key (opens in a new tab)
- Access to a multimodal LLM that can process images alongside text
If you're unfamiliar with basic Mandoline usage, read our Getting Started guide first.
Step 1: Define a Vision-Specific Metric
When evaluating LLMs on visual tasks, you need metrics that capture both visual understanding and practical reasoning. Let's create one for our office layout scenario.
Think about what makes a good layout suggestion. Your assistant needs to:
- Notice what's already in the space
- Suggest moves that are physically possible
- Make sure furniture doesn't end up overlapping
- Keep important pathways clear
Here's how we can capture these requirements in a metric:
import { Mandoline } from "mandoline";
const mandoline = new Mandoline({ apiKey: "your-api-key" });
const layoutMetric = await mandoline.createMetric({
name: "Layout Plan Quality",
description:
"Measures how practical and spatially aware the suggested layout changes are",
tags: ["vision", "spatial-reasoning", "practicality"],
});
This metric will help us track whether our assistant's suggestions would actually work in the real world. When we evaluate responses, we'll look at both the visual understanding ("there's a desk by the window") and the practical reasoning ("we can't move the desk there because it would block the door").
Step 2: Provide the Image and Text Prompt
Next, encode our office scene as a data URL. You can generate one by converting a PNG file to Base64:
// You can convert any PNG to a data URL - here's a helper function
function imageToDataUrl(imagePath: string): string {
return "data:image/png;base64,...";
}
const officeImageUrl = imageToDataUrl("office-layout.png");
const promptText = `
Here's a photo of our current office layout.
We need to add a second desk for a new team member.
What's the best way to rearrange things to make space?
`;
Step 3: Ask Your Model for a Rearrangement Plan
We'll assume you have a multimodal LLM that can accept both text and images. For simplicity, we'll mock a response:
const modelResponse = `
1. Move the existing desk to the left wall.
2. Shift the chair to the corner near the file cabinet.
3. Place the second desk in front of the window.
4. Ensure the lamp stays on the first desk for easy access.
`;
Step 4: Create an Evaluation
We can now send everything to Mandoline in a single API call.
const evaluation = await mandoline.createEvaluation({
metricId: layoutMetric.id,
prompt: promptText,
promptImage: officeImageUrl, // encodes visual information
response: modelResponse,
properties: { domain: "office-furniture" },
});
console.log(`Evaluation Score: ${evaluation.score}`);
Over time, you might collect many of these evaluations (from different scenes, different arrangement requests, or even different models) and compare their performance.
Step 5: Analyze Results Over Time
After running multiple rearrangement tasks (e.g., different office layouts, different model versions), you can retrieve your evaluations and see which setups worked best:
const allEvals = await mandoline.getEvaluations({
metricId: layoutMetric.id,
});
let sumScores = 0;
allEvals.forEach((ev) => {
console.log(
`Eval ID: ${ev.id}, Score: ${ev.score}, Model: ${ev.properties?.model}`,
);
sumScores += ev.score;
});
const averageScore = sumScores / allEvals.length;
console.log(`Average Rearrangement Plan Score: ${averageScore.toFixed(2)}`);
If you used the properties field to store additional metadata, like properties.model
or properties.version
, you can compare model versions or analyze which prompts yield the best spatial arrangements.
Conclusion
In this tutorial, you've learned how to evaluate multimodal LLMs using Mandoline's vision pipeline. This approach allows you to:
- Create metrics that capture visual reasoning capabilities, from simple object labeling to complex spatial planning
- Evaluate how well models integrate text and image inputs in their responses
- Make sure your AI system actually understands what it sees in images and whether its suggestions are practical and useful
As LLMs evolve to handle both text and images, Mandoline helps you evaluate how they understand and reason about visual information in ways that matter to your users – whether that's analyzing technical images, following visual instructions, planning spatial arrangements, or more.
Next Steps
- Try evaluating with different types of spaces and layout challenges
- Add metrics for other aspects of visual understanding, like lighting or ergonomics
- Explore our Prompt Engineering tutorial to improve your visual prompts
- Check out our Model Selection tutorial to compare how different LLMs handle visual tasks
The key to building useful multimodal systems is thinking about how users will actually use your assistant in the real world, mapping those needs to relevant metrics, and evaluating performance using realistic scenarios that match your actual use case.