Documentation
Prompt Engineering

Prompt Engineering: Reducing Unwanted LLM Behaviors

Imagine you've built an app for learning about historical events. You've fine-tuned an open source LLM to drive the core interactive chat functionality for this product.

However, you've received some concerning user feedback. Users are frustrated by the model's tendency to lecture them on ethical matters, regardless of whether such input was requested. This is particularly problematic when users are trying to learn about complex or nuanced historical topics.

In this tutorial, you'll learn how to use Mandoline to improve your LLM's responses for this particular behavior through prompt engineering.

Note, this tutorial is also available as a ready-to-run script in both Node.js (opens in a new tab) and Python (opens in a new tab).

What You'll Learn

  • How to create a custom metric for evaluating LLM responses
  • How to test different prompt structures
  • How to analyze results to improve your LLM's conversational style

Prerequisites

Before starting, make sure you have:

Step 1: Set Up Your Experiment

First, install Mandoline:

npm install mandoline

Then, set up your Mandoline client:

import { Mandoline } from "mandoline";
 
const mandoline = new Mandoline();

Note, we've set the Mandoline API key using the MANDOLINE_API_KEY environment variable.

Step 2: Create a Use-Case Specific Metric

Let's create a metric to measure moralistic language:

const metric = await mandoline.createMetric({
  name: "Moralistic Tendency",
  description:
    "Assesses how frequently the model adopts a moralistic tone or attempts to lecture users on ethical matters.",
  tags: ["tone", "personality", "user_experience"],
});

This metric directly addresses the frustration you've identified by talking to users.

Step 3: Test Different Prompts

Now, let's test different prompt structures against a series of controversial historical events:

async function testPrompt(template: string, event: string) {
  const prompt = template.replace("{event}", event);
  const response = await yourLLM.generate(prompt);
 
  return mandoline.createEvaluation({
    metricId: metric.id,
    prompt,
    response,
    properties: { template, event },
  });
}
 
const events = [
  "The use of atomic bombs in World War II",
  "The Industrial Revolution",
  // Add more events...
];
 
const promptTemplates = [
  "Discuss the historical event: {event}",
  "Provide an objective overview of: {event}",
  "Describe the facts surrounding: {event}",
  "Outline key points of: {event} without moral judgment",
  // Add more templates...
];
 
const results = await Promise.all(
  events.flatMap((event) =>
    promptTemplates.map((template) => testPrompt(template, event)),
  ),
);

Note: The properties field stores information about your experiment, which will help with later analysis.

Step 4: Analyze the Results

Let's dig deeper into our data:

// Overall moralistic tendency
const avgScore = results.reduce((sum, r) => sum + r.score, 0) / results.length;
console.log(`Average Moralistic Tendency: ${avgScore.toFixed(2)}`);
 
// Moralistic tendency by event
const eventScores = groupBy(results, "properties.event");
Object.entries(eventScores).forEach(([event, evals]) => {
  const eventAvg = evals.reduce((sum, e) => sum + e.score, 0) / evals.length;
  console.log(`${event}: ${eventAvg.toFixed(2)}`);
});
 
// Best prompt structure
const promptScores = groupBy(results, "properties.template");
const bestPrompt = Object.entries(promptScores)
  .map(([template, evals]) => ({
    template,
    avgScore: evals.reduce((sum, e) => sum + e.score, 0) / evals.length,
  }))
  .reduce((best, current) =>
    current.avgScore < best.avgScore ? current : best,
  );
 
console.log(`Best prompt: ${bestPrompt.template}`);

This analysis helps you understand:

  • How moralistic your LLM's responses are overall
  • Which events trigger more moralistic responses
  • Which prompt structures lead to more balanced responses

Step 5: Refine Your Approach

Based on these insights, you can now:

  1. Understand which topics trigger more moralistic responses
  2. Identify effective prompt structures for reducing moralistic tendencies
  3. Improve your LLM application to meet users' preferences for objective historical discussions

Conclusion

You've now used Mandoline to:

  1. Create a custom metric targeting a specific user frustration
  2. Test different prompt structures to address this issue
  3. Analyze results to improve your LLM's responses

This process helps you act directly on user feedback about unwanted moralistic tendencies. Hopefully this creates better user experiences for your customers.

Next Steps

  • Apply this process to other aspects of your app, perhaps creating other user-centric metrics
  • Use Mandoline to track your LLM's performance over time as you implement changes
  • Explore our Model Selection tutorial to learn how to compare different LLMs for your use case

By periodically refining your prompts and monitoring performance with Mandoline, you can ensure your app provides the objective, informative experience your users want.