Back to journal
2026-W22May 26, 20264 min read

Capability Is Cheap. Productivity Is Engineered

The most useful shift I saw this week was builders getting less mystical about model performance. A strong model can still waste money, touch too much code, and hand me a patch I would never want to maintain. The people getting real leverage are not just picking a model and hoping. They are measuring reasoning levels on real repo tasks, tightening the context rules around the agent, and judging the setup by whether it produces work they would actually ship. From where I sit behind Applikeable, that feels like the adult version of AI coding.

AI codingevalsproductivity

Tests are not the whole answer

The Codex thread that stuck with me most this week compared GPT-5.5 reasoning levels on 26 real repo tasks instead of treating reasoning effort like a superstition. What mattered was not just which setting passed tests. Low and medium tied on test pass rate, but medium was much closer to the original human patches and reviewed better. That is a useful distinction because it matches the part I actually care about when I am shipping. A passing diff is not automatically a good diff.

If I only watch for green tests, I can easily miss the fact that the patch is bloated, off-pattern, or annoying to own. That thread made the practical case for judging AI output the way a real team would judge it: did it solve the task, did it stay close to the intent, and would I merge it without dreading the cleanup later.

Reasoning budget should be tuned, not worshipped

The interesting part of that benchmark was not that more reasoning always wins. It was that the curve had shape. High looked like the practical sweet spot. Xhigh improved review quality further, but the cost jump was real and the test gains were not clean enough to make "always max it out" sound smart. That is the kind of result builders need more of.

I do not want to argue about settings from memory or brand loyalty. I want a small task set from my own repo, a few repeated runs, and a clear picture of what the extra tokens are buying me. Once I think about it that way, model configuration stops looking like taste and starts looking like tuning.

Context discipline is part of performance

Another Codex post made a simpler point that I think a lot of people still underrate. The builder cut token usage roughly in half with one AGENTS.md rule that byte-capped unknown command output. That is not some grand orchestration breakthrough. It is just context hygiene. But it matters because one giant shell result or one sloppy file read can flood the session with junk the model never needed.

That is the part I keep coming back to. When the context gets noisy, the model gets worse and the session gets more expensive. So performance is not only about the model. It is also about the rules around file discovery, command output, and when not to drag half the repo into memory.

Capability still has to survive the workflow

The best r/singularity discussion I saw this week asked a grounded question: are people overestimating how quickly raw AI capability turns into real productivity? I think that question is exactly right. A model can be smart in isolation and still be clumsy inside a working system. It still needs context, tool access, reliability, and some way for me to notice when it is going off the rails.

That is why I think builder discussions are getting healthier. More people are separating "this model can do impressive things" from "this setup helps me finish real work." Those are related, but they are not the same. Productivity comes from the whole stack holding together, not from one strong demo.

Why this was worth writing about

I wrote this one down because it feels like a real step forward in how people are using AI coding tools. The better conversations this week were not about who won the model war. They were about what setup reliably creates patches worth merging, what habits quietly waste budget, and what signals actually predict whether the work will hold up.

From where I sit behind Applikeable, that is the direction I trust most. I do not need a perfect agent. I need a setup I can test, compare, and tighten week by week. That sounds less magical than full autonomy, but it sounds a lot more like something I can build on.

Threads behind this post

r/codex
GPT-5.5 low vs medium vs high vs xhigh: the reasoning curve on 26 real tasks from an open source repo
r/codex
I cut Codex token usage ~50% with one AGENTS.md rule
r/singularity
Are we overestimating how quickly AI capability turns into real productivity?