Model Selection: Comparing LLMs for Creative Tasks

Suppose you're building a creative brainstorming app. You think LLMs could help users generate creative ideas through divergent thinking.

But which LLM is best for this task? You're not sure which LLM is the most "creative". Different models might excel in various aspects of divergent thinking.

In this tutorial, we'll show you how to use Mandoline to compare the performance of OpenAI's GPT-4 and Anthropic's Claude. We'll evaluate them on various aspects of creative thinking to help you make an informed decision.

This tutorial is also available as a ready-to-run script in both Node.js (opens in a new tab) and Python (opens in a new tab).

Note, for a "no-code" approach to model selection, see our Best LLM For tool.

What You'll Learn

How to define custom metrics for LLM evaluation
How to run a systematic comparison between different models
How to analyze results to inform model selection

Prerequisites

Before starting, make sure you have:

Node.js installed on your system
A Mandoline account (opens in a new tab) and API key (opens in a new tab)
An OpenAI API key
An Anthropic API key

Step 1: Set Up Your Experiment

First, install the needed packages:

npm install mandoline openai @anthropic-ai/sdk

Now, initialize each client:

import { Mandoline } from "mandoline";
import OpenAI from "openai";
import Anthropic from "@anthropic-ai/sdk";
 
const mandoline = new Mandoline();
const openai = new OpenAI();
const anthropic = new Anthropic();

Note, all API keys have been set using environment variables.

Step 2: Define Metrics

Let's create metrics to evaluate several different aspects of creative thinking:

// Helper function to create a metric
const createMetric = async (name: string, description: string) => {
  return await mandoline.createMetric({ name, description });
};
 
// Create metrics for evaluation
const metrics = await Promise.all([
  createMetric(
    "Conceptual Leap",
    "Assesses the model's ability to generate unconventional ideas.",
  ),
  createMetric(
    "Contextual Reframing",
    "Measures how the model approaches problems from different perspectives.",
  ),
  createMetric(
    "Idea Synthesis",
    "Evaluates the model's capacity to connect disparate concepts.",
  ),
  createMetric(
    "Constraint Navigation",
    "Examines how the model handles limitations creatively.",
  ),
  createMetric(
    "Metaphorical Thinking",
    "Looks at the model's use of figurative language to explore ideas.",
  ),
]);

These metrics will help us understand LLM performance across the various aspects relevant to our use case.

Step 3: Generate Responses

Now, let's create a function to get responses from both models:

async function generateIdeas(
  prompt: string,
  model: "gpt-4" | "claude",
): Promise<string> {
  if (model === "gpt-4") {
    // Generate ideas using GPT-4
    const completion = await openai.chat.completions.create({
      messages: [{ role: "user", content: prompt }],
      model: "gpt-4o-2024-08-06",
    });
    return completion.choices[0].message.content || "";
  } else if (model === "claude") {
    // Generate ideas using Claude
    const msg = await anthropic.messages.create({
      model: "claude-3-5-sonnet-20240620",
      max_tokens: 1024,
      messages: [{ role: "user", content: prompt }],
    });
    return msg.content[0].text;
  }
  throw new Error("Unsupported model");
}

This function takes a prompt and a model name, then returns the generated ideas as a string.

Step 4: Evaluate Responses

Let's create a function to evaluate each response:

async function evaluateResponse(
  metric: { id: string },
  prompt: string,
  response: string,
  model: string,
) {
  // Create an evaluation in Mandoline for the given metric
  return await mandoline.createEvaluation({
    metricId: metric.id,
    prompt,
    response,
    properties: { model }, // Include the model name for later analysis
  });
}

This function creates an evaluation in Mandoline for a given metric.

Step 5: Run Experiments

Now, let's compare the models:

async function runExperiment(prompt: string) {
  const models = ["gpt-4", "claude"] as const;
  const results: Record<string, any> = {};
 
  for (const model of models) {
    // Generate ideas using the current model
    const response = await generateIdeas(prompt, model);
 
    // Evaluate the response on all five metrics
    results[model] = {
      response,
      evaluations: await Promise.all(
        metrics.map((metric) =>
          evaluateResponse(metric, prompt, response, model),
        ),
      ),
    };
  }
 
  return results;
}
 
// Example prompt
const prompt =
  "If humans could photosynthesize like plants, how would our daily lives and global systems be different?";
 
// Run the experiment and log results
const experimentResults = await runExperiment(prompt);
console.log(JSON.stringify(experimentResults, null, 2));

This function runs the experiment for both models, generating responses and evaluating them on all five metrics.

Step 6: Analyze Results

After running multiple experiments, analyze the results:

async function analyzeResults(metricId: string) {
  // Fetch evaluations for the given metric
  const evaluations = await mandoline.getEvaluations({ metricId });
 
  // Group evaluations by model
  const groupedByModel = groupBy(
    evaluations,
    (evaluation) => evaluation.properties.model,
  );
 
  // Calculate and display average scores for each model
  Object.entries(groupedByModel).forEach(([model, evals]) => {
    const avgScore =
      evals.reduce((sum, evaluation) => sum + evaluation.score, 0) /
      evals.length;
    console.log(`Average score for ${model}: ${avgScore.toFixed(2)}`);
  });
}
 
// Helper function to group evaluations by model
function groupBy<T>(arr: T[], key: (item: T) => string): Record<string, T[]> {
  return arr.reduce(
    (groups, item) => {
      const groupKey = key(item);
      if (!groups[groupKey]) {
        groups[groupKey] = [];
      }
      groups[groupKey].push(item);
      return groups;
    },
    {} as Record<string, T[]>,
  );
}
 
// Analyze results for each metric
for (const metric of metrics) {
  await analyzeResults(metric.id);
}

This analysis will show how GPT-4 and Claude compare across different dimensions of creative thinking.

Conclusion

You've now set up a system to compare LLMs for your specific use case. This approach allows you to:

Create custom metrics for evaluating LLM performance
Systematically evaluate responses from different models
Analyze performance across various dimensions

By repeating this process with different prompts and analyzing the results, you can:

Identify strengths and weaknesses of each model
Refine prompts to get better results
Make informed decisions about which LLM to use for your task

By using Mandoline to evaluate AI models, you can choose the best LLM for your creative tasks based on real data. This helps you build AI-powered apps that better meet your users' needs.

Next Steps

Try more prompts to get a fuller picture of each model's strengths.
Use Mandoline to keep track of how models improve over time.
Check out our Prompt Engineering tutorial to learn how to get even better results from your chosen model.

Remember, the best model for you depends on your specific use case. Keep testing and measuring to find the right fit for your project.

Prompt Engineering Multimodal Evaluation