Blog
Multimodal Evals: A Creative Coding Challenge

Multimodal Language Model Evaluation: A Creative Coding Challenge

January 27, 2025

Introduction

Creative code-based art challenges offer a way to explore and test how language models solve problems requiring both technical and visual thinking.

Recently, we challenged four leading LLMs (DeepSeek-R1, Claude 3.5 Sonnet, Gemini 1.5 Flash, and o1) to produce original code-based digital art. Genuary (opens in a new tab) is a month-long event where participants create code art based on daily prompts.

Specifically, we tasked each model to complete the Jan 27 (opens in a new tab) prompt: "Make something interesting with no randomness or noise or trig."

We then evaluated each submission against three user-focused visual metrics, using Mandoline's multimodal evaluation capabilities.

Below we discuss our experiment and share key insights on using user-relevant evaluation metrics in creative contexts.

Experiment

We tested four leading multimodal language models:

Each model was prompted using identical instructions, which included:

  • The code-based art challenge
  • A request to brainstorm approaches
  • Implementation instructions (output must be a single, self-contained HTML file)

We then evaluated each response using three metrics:

  1. Conceptual Depth: Does the code exhibit thoughtful, novel, and thematically interesting logic?
  2. Creative Originality: Is the approach unusual or surprising within the constraints?
  3. Aesthetic Impact: How visually compelling or striking is the result?

Each metric is scored from -1.0 to 1.0, with higher scores indicating better performance.

Results

Here's how each model responded to the challenge:

DeepSeek-R1

DeepSeek-R1

Claude 3.5 Sonnet

Claude 3.5 Sonnet

Gemini 1.5 Flash

Gemini 1.5 Flash

o1

o1

Evaluations

The models scored across our three metrics as follows:

ModelConceptual DepthCreative OriginalityAesthetic ImpactNotable Approach
DeepSeek-R10.280.260.70Modular arithmetic with phase-shifted color transitions.
Claude 3.5 Sonnet0.250.260.25Fibonacci sequences with "crystalline" expansions.
Gemini 1.5 Flash0.300.160.10Cellular automata based on neighbor rules.
o10.10-0.10-0.10Pixel-based color evolution system.

DeepSeek-R1's modular arithmetic approach with phase-shifted color transitions led to significantly higher aesthetic impact (0.70) compared to other models.

Claude 3.5 Sonnet and Gemini 1.5 Flash showed similar conceptual strength (0.25 and 0.30) with their respective Fibonacci and cellular automata approaches, but couldn't match DeepSeek-R1's visual appeal.

o1's pixel-based system scored lowest despite its technical sophistication, highlighting that complex implementations don't necessarily yield better creative results.

Note - these scores reflect performance on this specific creative coding task - additional evaluation across more tasks would be needed for broader capability claims.

Conclusion

When we give language models open-ended, creative tasks we learn something valuable about how they solve problems. Each model handled this challenge differently, providing some insight into their strengths and limitations.

For tasks like generative art, the end result - how good it looks - is what matters most. Technical polish like memory efficiency or code optimization doesn't matter as much as the quality of the resultant visual design.

This is why we're building structured ways to evaluate AI beyond pure technical metrics - looking at things like conceptual thinking, creativity, and visual design. Standard benchmarks often miss what matters in real applications.

If you're interested in similar challenges, please reach out at support@mandoline.ai.

Next Steps

Find this content useful?

Sign up for our newsletter.

We care about your data. Read our privacy policy.