What's the Best LLM for Coding?

July 21, 2025

We evaluated 14 top LLMs on real sprint tickets, measuring three pain points: Pattern Adherence (architectural thinking), Scope Discipline (staying focused), and Comment Quality (useful documentation). Here are the results.

For more details on our eval methodology, see our What's the Best LLM for Coding? post.

Leaderboard

Model	Overall	Pattern Adherence	Scope Discipline	Comment Quality
o3 medium	0.53 ± 0.05	0.54 ± 0.04	0.66 ± 0.09	0.39 ± 0.12
o4-mini medium	0.48 ± 0.05	0.43 ± 0.07	0.64 ± 0.10	0.36 ± 0.11
o3-mini medium	0.46 ± 0.05	0.39 ± 0.12	0.72 ± 0.04	0.26 ± 0.10
Gemini 2.5 Pro Preview	0.43 ± 0.06	0.41 ± 0.08	0.50 ± 0.12	0.39 ± 0.10
o1 medium	0.42 ± 0.07	0.32 ± 0.16	0.67 ± 0.07	0.29 ± 0.08
Gemini 2.5 Flash Preview	0.42 ± 0.07	0.44 ± 0.09	0.47 ± 0.15	0.35 ± 0.12
Claude 4 Sonnet	0.41 ± 0.07	0.50 ± 0.05	0.58 ± 0.10	0.16 ± 0.17
Claude 4 Opus Thinking	0.39 ± 0.08	0.48 ± 0.08	0.47 ± 0.13	0.22 ± 0.17
Grok 3 Beta	0.38 ± 0.07	0.34 ± 0.11	0.62 ± 0.11	0.19 ± 0.13
Claude 4 Opus	0.34 ± 0.08	0.34 ± 0.17	0.51 ± 0.12	0.17 ± 0.12
Claude 4 Sonnet Thinking	0.34 ± 0.07	0.35 ± 0.11	0.48 ± 0.13	0.17 ± 0.11
Grok 2	0.22 ± 0.08	0.18 ± 0.13	0.53 ± 0.15	-0.05 ± 0.16
Codestral	0.21 ± 0.06	0.17 ± 0.13	0.65 ± 0.05	-0.18 ± 0.09
Mistral Large	0.14 ± 0.09	0.10 ± 0.15	0.52 ± 0.16	-0.21 ± 0.14

Note: Scores range from -1.0 to 1.0 (higher is better). The "Overall" column is the mean over the three metric scores.

Insights

There is no singular "best" LLM for coding.

For our specific coding environment and preferences, we've implemented a tiered approach:

o3 for complex, low-frequency tasks where quality matters most
o3-mini for difficult work at higher scale
Gemini 2.5 Flash for documentation tasks and high-volume work

Some additional observations:

The top three "Overall" scores are within 0.07, which makes secondary factors like latency, price, and niche behaviors the deciding factors within that gap.
"Scope Discipline" is largely decorrelated from the other two metrics (correlation with "Pattern Adherence" is ~0.05, with "Comment Quality" is ~0.09). If scope creep is important to you (it is to us!), you may want to weight this axis higher.
Every model struggles with "Comment Quality". The top score is 0.391 and three models are below zero. Since performance is low across the board, prompt / content engineering might be a viable approach.
"Thinking" doesn't always help. Claude 4 Sonnet performs worse on "Pattern Adherence" (-0.146) and "Scope Discipline" (-0.095) in thinking mode; Opus Thinking gains "Pattern Adherence" (+0.135) but does worse on "Scope Discipline" (-0.034).

Try it Yourself

Your optimal picks will depend on your codebase, task types, and what behaviors matter most to your team.

To determine what works best for your team, try our Best LLM For tool with your own prompts and evaluation criteria.

Overview Refusals

What's the Best LLM for Coding?

Leaderboard

Insights

Try it Yourself

Find this content useful?