Blog
What's the Best LLM for Coding?

What's the Best LLM for Coding?

TL;DR: We evaluated 14 LLMs on real engineering tasks from our sprint backlog. o3 performs best for high-complexity, low-frequency tasks. o3-mini handles difficult, higher-scale work. Gemini 2.5 Flash is a good fit for documentation. For full rankings, see our coding leaderboard. However, results depend on your codebase and preferences. Your "best" may differ!

Problem

After years of coding with LLMs, we've experienced three recurring pain points:

  • Pattern Adherence: Models ignore existing architectural patterns and implement naive solutions
  • Scope Discipline: Models make unsolicited refactors, turning simple tasks into sprawling diffs
  • Comment Quality: Models write verbose comments that restate what's obvious from the code

Standard benchmarks don't capture these real-world software engineering concerns.

Solution

So, we ran our own experiments to find the best models for our workflow and preferences.

We used our Best LLM For tool to evaluate 14 leading models against real sprint tickets from our internal codebases (tasks like "optimize database connections after Supabase migration").

We evaluated each model on our three pain points using custom metrics, scoring responses from -1.0 to 1.0.

Results

Top performers:

  • o3 for high-complexity, low-frequency tasks
  • o3-mini for difficult tasks at scale
  • Gemini 2.5 Flash for documentation and high-volume work

Full eval dashboards:

To see all evaluated models ranked, check out our coding leaderboard.

One Caveat...

The "best" LLM for coding is the one that performs best on data from your work environment, with respect to the behaviors you care about.

Our results reflect our specific coding environment and priorities. Your results may differ!

To run custom evals against your own tasks, using your own custom metrics, give the Best LLM For tool a try.

Find this content useful?

Sign up for our newsletter.

We care about your data. Read our privacy policy.