AI Coding Benchmarks

Created: Jun 25, 2026 | B quality | Low importance

Ever since the LLM wars started, people have been developing benchmarks to try to compare which ones are "better", which ones are the best. These are trotted out in every new model release to explain to (probably mostly investors) just how far we've come.

Setting aside the practical problems with such benchmarks, my opinion is that these benchmarks just don't really matter at all.

When I'm using Claude Code, I'm much more concerned about the experience than the "raw algorithmic ouput". That's probably because I'm mostly working on projects that have some kind of web app or user interface, and verifying that it works requires Human in the Loop (HITL). My experience prompting, getting feedback, iterating, etc is much more important than the "raw ultimate coding power" that the model provides. I don't need Claude Code to write a perfect JIT compiler, or media decoding algorithm, or tree shaker for an obscure language. I need it to be able to create and iterate on a UI with me, ask intelligent questions that surface clear tradeoffs in architecture, and respond to my own questions.

I'm sorry, I know loop-based agentic whatever is the future, but for now I'm getting the best results with an iterative HITL approach.


Comments

With an account on the Fediverse or Mastodon, you can respond to this post. Simply visit the post on its original server and leave your comment. It and other known non-private replies will be displayed below. Learn how this is implemented here and here.