What's the Best LLM for Coding?
July 21, 2025
We evaluated 14 top LLMs on real sprint tickets, measuring three pain points: Pattern Adherence (architectural thinking), Scope Discipline (staying focused), and Comment Quality (useful documentation). Here are the results.
For more details on our eval methodology, see our What's the Best LLM for Coding? post.
Leaderboard
Model | Overall | Pattern Adherence | Scope Discipline | Comment Quality |
---|---|---|---|---|
o3 medium | 0.53 ± 0.05 | 0.54 ± 0.04 | 0.66 ± 0.09 | 0.39 ± 0.12 |
o4-mini medium | 0.48 ± 0.05 | 0.43 ± 0.07 | 0.64 ± 0.10 | 0.36 ± 0.11 |
o3-mini medium | 0.46 ± 0.05 | 0.39 ± 0.12 | 0.72 ± 0.04 | 0.26 ± 0.10 |
Gemini 2.5 Pro Preview | 0.43 ± 0.06 | 0.41 ± 0.08 | 0.50 ± 0.12 | 0.39 ± 0.10 |
o1 medium | 0.42 ± 0.07 | 0.32 ± 0.16 | 0.67 ± 0.07 | 0.29 ± 0.08 |
Gemini 2.5 Flash Preview | 0.42 ± 0.07 | 0.44 ± 0.09 | 0.47 ± 0.15 | 0.35 ± 0.12 |
Claude 4 Sonnet | 0.41 ± 0.07 | 0.50 ± 0.05 | 0.58 ± 0.10 | 0.16 ± 0.17 |
Claude 4 Opus Thinking | 0.39 ± 0.08 | 0.48 ± 0.08 | 0.47 ± 0.13 | 0.22 ± 0.17 |
Grok 3 Beta | 0.38 ± 0.07 | 0.34 ± 0.11 | 0.62 ± 0.11 | 0.19 ± 0.13 |
Claude 4 Opus | 0.34 ± 0.08 | 0.34 ± 0.17 | 0.51 ± 0.12 | 0.17 ± 0.12 |
Claude 4 Sonnet Thinking | 0.34 ± 0.07 | 0.35 ± 0.11 | 0.48 ± 0.13 | 0.17 ± 0.11 |
Grok 2 | 0.22 ± 0.08 | 0.18 ± 0.13 | 0.53 ± 0.15 | -0.05 ± 0.16 |
Codestral | 0.21 ± 0.06 | 0.17 ± 0.13 | 0.65 ± 0.05 | -0.18 ± 0.09 |
Mistral Large | 0.14 ± 0.09 | 0.10 ± 0.15 | 0.52 ± 0.16 | -0.21 ± 0.14 |
Note: Scores range from -1.0 to 1.0 (higher is better). The "Overall" column is the mean over the three metric scores.
Insights
There is no singular "best" LLM for coding.
For our specific coding environment and preferences, we've implemented a tiered approach:
- o3 for complex, low-frequency tasks where quality matters most
- o3-mini for difficult work at higher scale
- Gemini 2.5 Flash for documentation tasks and high-volume work
Some additional observations:
- The top three "Overall" scores are within 0.07, which makes secondary factors like latency, price, and niche behaviors the deciding factors within that gap.
- "Scope Discipline" is largely decorrelated from the other two metrics (correlation with "Pattern Adherence" is ~0.05, with "Comment Quality" is ~0.09). If scope creep is important to you (it is to us!), you may want to weight this axis higher.
- Every model struggles with "Comment Quality". The top score is 0.391 and three models are below zero. Since performance is low across the board, prompt / content engineering might be a viable approach.
- "Thinking" doesn't always help. Claude 4 Sonnet performs worse on "Pattern Adherence" (-0.146) and "Scope Discipline" (-0.095) in thinking mode; Opus Thinking gains "Pattern Adherence" (+0.135) but does worse on "Scope Discipline" (-0.034).
Try it Yourself
Your optimal picks will depend on your codebase, task types, and what behaviors matter most to your team.
To determine what works best for your team, try our Best LLM For tool with your own prompts and evaluation criteria.