Leaderboards
Coding

What's the Best LLM for Coding?

July 21, 2025

We evaluated 14 top LLMs on real sprint tickets, measuring three pain points: Pattern Adherence (architectural thinking), Scope Discipline (staying focused), and Comment Quality (useful documentation). Here are the results.

For more details on our eval methodology, see our What's the Best LLM for Coding? post.

Leaderboard

Model
Overall
Pattern Adherence
Scope Discipline
Comment Quality
o3 medium0.53 ± 0.050.54 ± 0.040.66 ± 0.090.39 ± 0.12
o4-mini medium0.48 ± 0.050.43 ± 0.070.64 ± 0.100.36 ± 0.11
o3-mini medium0.46 ± 0.050.39 ± 0.120.72 ± 0.040.26 ± 0.10
Gemini 2.5 Pro Preview0.43 ± 0.060.41 ± 0.080.50 ± 0.120.39 ± 0.10
o1 medium0.42 ± 0.070.32 ± 0.160.67 ± 0.070.29 ± 0.08
Gemini 2.5 Flash Preview0.42 ± 0.070.44 ± 0.090.47 ± 0.150.35 ± 0.12
Claude 4 Sonnet0.41 ± 0.070.50 ± 0.050.58 ± 0.100.16 ± 0.17
Claude 4 Opus Thinking0.39 ± 0.080.48 ± 0.080.47 ± 0.130.22 ± 0.17
Grok 3 Beta0.38 ± 0.070.34 ± 0.110.62 ± 0.110.19 ± 0.13
Claude 4 Opus0.34 ± 0.080.34 ± 0.170.51 ± 0.120.17 ± 0.12
Claude 4 Sonnet Thinking0.34 ± 0.070.35 ± 0.110.48 ± 0.130.17 ± 0.11
Grok 20.22 ± 0.080.18 ± 0.130.53 ± 0.15-0.05 ± 0.16
Codestral0.21 ± 0.060.17 ± 0.130.65 ± 0.05-0.18 ± 0.09
Mistral Large0.14 ± 0.090.10 ± 0.150.52 ± 0.16-0.21 ± 0.14

Note: Scores range from -1.0 to 1.0 (higher is better). The "Overall" column is the mean over the three metric scores.

Insights

There is no singular "best" LLM for coding.

For our specific coding environment and preferences, we've implemented a tiered approach:

  • o3 for complex, low-frequency tasks where quality matters most
  • o3-mini for difficult work at higher scale
  • Gemini 2.5 Flash for documentation tasks and high-volume work

Some additional observations:

  • The top three "Overall" scores are within 0.07, which makes secondary factors like latency, price, and niche behaviors the deciding factors within that gap.
  • "Scope Discipline" is largely decorrelated from the other two metrics (correlation with "Pattern Adherence" is ~0.05, with "Comment Quality" is ~0.09). If scope creep is important to you (it is to us!), you may want to weight this axis higher.
  • Every model struggles with "Comment Quality". The top score is 0.391 and three models are below zero. Since performance is low across the board, prompt / content engineering might be a viable approach.
  • "Thinking" doesn't always help. Claude 4 Sonnet performs worse on "Pattern Adherence" (-0.146) and "Scope Discipline" (-0.095) in thinking mode; Opus Thinking gains "Pattern Adherence" (+0.135) but does worse on "Scope Discipline" (-0.034).

Try it Yourself

Your optimal picks will depend on your codebase, task types, and what behaviors matter most to your team.

To determine what works best for your team, try our Best LLM For tool with your own prompts and evaluation criteria.

Find this content useful?

Sign up for our newsletter.

We care about your data. Read our privacy policy.