AI Coding Benchmarks

Created: Jun 25, 2026 | B quality | Low importance

Ever since the LLM wars started, people have been developing benchmarks to try to compare which ones are "better", which ones are the best. These are trotted out in every new model release to explain to (probably mostly investors) just how far we've come.

Setting aside the practical problems with such benchmarks, my opinion is that these benchmarks just don't really matter at all.

When I'm using Claude Code, I'm much more concerned about the experience than the "raw algorithmic ouput". That's probably because I'm mostly working on projects that have some kind of web app or user interface, and verifying that it works requires Human in the Loop (HITL). My experience prompting, getting feedback, iterating, etc is much more important than the "raw ultimate coding power" that the model provides. I don't need Claude Code to write a perfect JIT compiler, or media decoding algorithm, or tree shaker for an obscure language. I need it to be able to create and iterate on a UI with me, ask intelligent questions that surface clear tradeoffs in architecture, and respond to my own questions.

I'm sorry, I know loop-based agentic whatever is the future, but for now I'm getting the best results with an iterative HITL approach.

Backlinks

AI

As I write this, it is April 13th, 2026. ChatGPT was released in November 2022, with an estimated 5 million users within a week of launch, slamming the world into the "AI era" that we find ourselves in now. That was three and a half years
Harness Matters More

Many have theorized that the coding harness matters more than the backend model when doing AI assisted or "vibe" coding. The harness is the set of system prompts, instructions for tool use, ways of loading skills/MCP and utilizing them,

Comments

With an account on the Fediverse or Mastodon, you can respond to this post. Simply visit the post on its original server and leave your comment. It and other known non-private replies will be displayed below. Learn how this is implemented here and here.

Loading comments relies on JavaScript. Try enabling JavaScript and reloading, or visit the original post on Mastodon.