What's the Best LLM for Coding?

TL;DR: We evaluated 14 LLMs on real engineering tasks from our sprint backlog. o3 performs best for high-complexity, low-frequency tasks. o3-mini handles difficult, higher-scale work. Gemini 2.5 Flash is a good fit for documentation. For full rankings, see our coding leaderboard. However, results depend on your codebase and preferences. Your "best" may differ!

Problem

After years of coding with LLMs, we've experienced three recurring pain points:

Pattern Adherence: Models ignore existing architectural patterns and implement naive solutions
Scope Discipline: Models make unsolicited refactors, turning simple tasks into sprawling diffs
Comment Quality: Models write verbose comments that restate what's obvious from the code

Standard benchmarks don't capture these real-world software engineering concerns.

Solution

So, we ran our own experiments to find the best models for our workflow and preferences.

We used our Best LLM For tool to evaluate 14 leading models against real sprint tickets from our internal codebases (tasks like "optimize database connections after Supabase migration").

We evaluated each model on our three pain points using custom metrics, scoring responses from -1.0 to 1.0.

Results

Top performers:

o3 for high-complexity, low-frequency tasks
o3-mini for difficult tasks at scale
Gemini 2.5 Flash for documentation and high-volume work

Full eval dashboards:

To see all evaluated models ranked, check out our coding leaderboard.

One Caveat...

The "best" LLM for coding is the one that performs best on data from your work environment, with respect to the behaviors you care about.

Our results reflect our specific coding environment and priorities. Your results may differ!

To run custom evals against your own tasks, using your own custom metrics, give the Best LLM For tool a try.

Welcome Multimodal Evals: A Creative Coding Challenge

What's the Best LLM for Coding?

Problem

Solution

Results

One Caveat...

Find this content useful?