Multimodal Language Model Evaluation: A Creative Coding Challenge

January 27, 2025 (Updated February 25, 2025)

Introduction

Creative code-based art challenges offer a way to explore and test how language models solve problems requiring both technical and visual thinking.

Recently, we challenged several leading LLMs (Claude 3.7 Sonnet, DeepSeek-R1, Claude 3.5 Sonnet, Gemini 1.5 Flash, and o1) to produce original code-based digital art. Genuary (opens in a new tab) is a month-long event where participants create code art based on daily prompts.

Specifically, we tasked each model to complete the January 27 (opens in a new tab) prompt: "Make something interesting with no randomness or noise or trig."

We then evaluated each submission against three user-focused visual metrics, using Mandoline's multimodal evaluation capabilities.

Below we discuss our experiment and share key insights on using user-relevant evaluation metrics in creative contexts.

Experiment

We tested the following leading multimodal language models:

Claude 3.7 Sonnet (opens in a new tab) (normal and extended thinking mode)
DeepSeek-R1 (opens in a new tab)
Claude 3.5 Sonnet (opens in a new tab)
Gemini 1.5 Flash (opens in a new tab)
o1 (opens in a new tab)

Each model received identical instructions, which included:

The code-based art challenge
A request to brainstorm approaches
Implementation instructions (output must be a single, self-contained HTML file)

We then evaluated each response using three metrics:

Conceptual Depth: Does the code exhibit thoughtful, novel, and thematically interesting logic?
Creative Originality: Is the approach unusual or surprising within the constraints?
Aesthetic Impact: How visually compelling or striking is the result?

Each metric is scored from -1.0 to 1.0, with higher scores indicating better performance.

Results

Here's how each model responded to the challenge:

Claude 3.7 Sonnet (Extended Thinking)

Claude 3.7 Sonnet with Extended Thinking

DeepSeek-R1

Claude 3.5 Sonnet

Gemini 1.5 Flash

o1

Claude 3.7 Sonnet

Evaluations

The models scored across our three metrics as follows:

Model	Conceptual Depth	Creative Originality	Aesthetic Impact	Approach
Claude 3.7 Sonnet (Extended Thinking)	0.70	0.84	0.87	Number theory with emergent complexity
DeepSeek-R1	0.28	0.26	0.70	Modular arithmetic with phase-shifted color transitions
Claude 3.5 Sonnet	0.25	0.26	0.25	Fibonacci sequences with "crystalline" expansions
Gemini 1.5 Flash	0.30	0.16	0.10	Cellular automata based on neighbor rules
o1	0.10	-0.10	-0.10	Pixel-based color evolution system
Claude 3.7 Sonnet	0.00	-0.20	-0.60	Non-functional result

Claude 3.7 Sonnet with extended thinking outperformed all other models on this task. In contrast, "normal" Claude 3.7 Sonnet produced a non-functional result that scored poorly across all metrics. This highlights the impact that test-time compute can have on model performance.

DeepSeek-R1, the previous top-scorer, used a modular arithmetic approach with phase-shifted color transitions. This led to a high aesthetic impact score (0.70) compared to most other models.

Claude 3.5 Sonnet and Gemini 1.5 Flash showed similar conceptual strength (0.25 and 0.30) with their respective Fibonacci and cellular automata approaches, but couldn't match DeepSeek-R1's visual appeal.

o1's pixel-based system scored lowest despite its technical sophistication, highlighting that complex implementations don't necessarily yield better creative results.

Note - these scores reflect performance on this specific creative coding task - additional evaluations across more tasks are needed in order to make any general capability claims.

Conclusion

When we give language models open-ended, creative tasks we learn something valuable about how they solve problems. Each model handled this challenge differently, providing some insight into their strengths and limitations.

For tasks like generative art, the end result - how good it looks - is what matters most. Technical polish like memory efficiency or code optimization doesn't matter as much as the quality of the resultant visual design.

This is why we're building structured ways to evaluate AI beyond pure technical metrics, looking at things like conceptual thinking, creativity, and visual design. Standard benchmarks often miss what matters in real applications.

If you're interested in similar challenges, please reach out at support@mandoline.ai.

Next Steps

Check our Getting Started guide
Explore our Multimodal Evaluation tutorial
For more on creative applications, check out our model selection guide

What's the Best LLM for Coding?Comparing Refusal Behavior Across Top Language Models