Multimodal Language Model Evaluation: A Creative Coding Challenge
January 27, 2025
Introduction
Creative code-based art challenges offer a way to explore and test how language models solve problems requiring both technical and visual thinking.
Recently, we challenged four leading LLMs (DeepSeek-R1, Claude 3.5 Sonnet, Gemini 1.5 Flash, and o1) to produce original code-based digital art. Genuary (opens in a new tab) is a month-long event where participants create code art based on daily prompts.
Specifically, we tasked each model to complete the Jan 27 (opens in a new tab) prompt: "Make something interesting with no randomness or noise or trig."
We then evaluated each submission against three user-focused visual metrics, using Mandoline's multimodal evaluation capabilities.
Below we discuss our experiment and share key insights on using user-relevant evaluation metrics in creative contexts.
Experiment
We tested four leading multimodal language models:
- DeepSeek-R1 (opens in a new tab)
- Claude 3.5 Sonnet (opens in a new tab)
- Gemini 1.5 Flash (opens in a new tab)
- o1 (opens in a new tab)
Each model was prompted using identical instructions, which included:
- The code-based art challenge
- A request to brainstorm approaches
- Implementation instructions (output must be a single, self-contained HTML file)
We then evaluated each response using three metrics:
- Conceptual Depth: Does the code exhibit thoughtful, novel, and thematically interesting logic?
- Creative Originality: Is the approach unusual or surprising within the constraints?
- Aesthetic Impact: How visually compelling or striking is the result?
Each metric is scored from -1.0 to 1.0, with higher scores indicating better performance.
Results
Here's how each model responded to the challenge:
DeepSeek-R1
Claude 3.5 Sonnet
Gemini 1.5 Flash
o1
Evaluations
The models scored across our three metrics as follows:
Model | Conceptual Depth | Creative Originality | Aesthetic Impact | Notable Approach |
---|---|---|---|---|
DeepSeek-R1 | 0.28 | 0.26 | 0.70 | Modular arithmetic with phase-shifted color transitions. |
Claude 3.5 Sonnet | 0.25 | 0.26 | 0.25 | Fibonacci sequences with "crystalline" expansions. |
Gemini 1.5 Flash | 0.30 | 0.16 | 0.10 | Cellular automata based on neighbor rules. |
o1 | 0.10 | -0.10 | -0.10 | Pixel-based color evolution system. |
DeepSeek-R1's modular arithmetic approach with phase-shifted color transitions led to significantly higher aesthetic impact (0.70) compared to other models.
Claude 3.5 Sonnet and Gemini 1.5 Flash showed similar conceptual strength (0.25 and 0.30) with their respective Fibonacci and cellular automata approaches, but couldn't match DeepSeek-R1's visual appeal.
o1's pixel-based system scored lowest despite its technical sophistication, highlighting that complex implementations don't necessarily yield better creative results.
Note - these scores reflect performance on this specific creative coding task - additional evaluation across more tasks would be needed for broader capability claims.
Conclusion
When we give language models open-ended, creative tasks we learn something valuable about how they solve problems. Each model handled this challenge differently, providing some insight into their strengths and limitations.
For tasks like generative art, the end result - how good it looks - is what matters most. Technical polish like memory efficiency or code optimization doesn't matter as much as the quality of the resultant visual design.
This is why we're building structured ways to evaluate AI beyond pure technical metrics - looking at things like conceptual thinking, creativity, and visual design. Standard benchmarks often miss what matters in real applications.
If you're interested in similar challenges, please reach out at support@mandoline.ai.
Next Steps
- Check our Getting Started guide
- Explore our Multimodal Evaluation tutorial
- For more on creative applications, check out our model selection guide