Back to journal
2026-W21May 18, 20264 min read

Stop Making One Agent Do Everything

The most practical shift I saw this week had nothing to do with a shiny new model. It was builders getting more disciplined about role separation. One model plans. Another gathers context. Another reviews the result. That sounds less magical than full autonomy, but it is much closer to how I actually want to build software at Applikeable.

AI codingorchestrationworkflow

One giant context blob is not a workflow

A good Claude Code thread this week made the point cleanly. Once planning context gets flooded with implementation chatter, the model stops acting like a planner and starts acting like a patch machine. That matches what I keep seeing in practice. If I ask one agent to understand the system, decide the approach, write the code, react to failing tests, and judge its own result, I should not be surprised when the quality gets mushy halfway through.

The useful move is not more prompt poetry. It is giving the steps some separation. Let one model think about architecture and sequencing while the context is still clean. Then let a worker model deal with the noisy part, where files change, logs pile up, and the task becomes more mechanical.

The expensive model should think, not shovel

Another thread that stuck with me described a tiered setup where a cheap model filters routine CI failures and gathers the exact data before the expensive model ever gets involved. The headline was that Opus could end up costing less than Sonnet when it was used for judgment instead of for everything. That is a better mental model than most people have right now.

If the strongest model spends its time scanning raw logs, fetching obvious facts, or repeating simple checks, I am wasting the part I am actually paying for. I would rather use the expensive model for problem framing and decision-making, then let cheaper models or plain scripts do the grunt work.

Tool choice still matters, but role choice matters more

The loudest Codex thread I saw this week was from someone who felt a real productivity jump after switching away from Claude for repo-scale work. What I found more useful than the vendor comparison was the shape of the complaint. They were tired of babysitting, tired of incomplete implementations, and tired of spending more energy managing the model than shipping software.

That is why I do not think the lesson is "pick your forever winner." Models are still moving around too fast for that. The better lesson is to decide which role needs to be strong right now. If I need planning, I should reach for the tool that thinks clearly. If I need disciplined execution, I should reach for the tool that stays on task. Those are not always the same model, and they probably will not stay the same all year.

The setup I would actually keep

If I were tightening my own stack around this pattern, I would keep it pretty lean. One planner for architecture, tradeoffs, and task breakdown. One worker for implementation inside the repo. One separate review pass, either from another model or from deterministic checks that force the weak spots into view. I do not need a little AI org chart with twelve pretend employees. I need clean handoffs.

I also want the cheap parts to be boring on purpose. Scripts should fetch logs. Tests should verify behavior. Linters should police style. Small models can summarize diffs, inspect output, and challenge assumptions. The stronger model should spend its budget on the parts that still benefit from judgment.

Why this was worth writing about

I wrote this one down because it feels like a real sign of maturity. A few months ago the discourse was full of single-agent fantasies. This week felt more grounded. Builders are getting less interested in whether one model can do everything and more interested in how to divide the work so the whole system holds up.

From where I sit behind Applikeable, that is the healthier direction. Small teams do not need more mythology. We need workflows that stay understandable when a session runs long, a quota gets tight, or a model has an off day. Treating planning, execution, and review as different jobs is not flashy, but it is one of the more useful ideas I saw all week.

Threads behind this post

r/ClaudeCode
AI Coding Got Better Once I Split Planning From Execution
r/ClaudeCode
Switched from Claude Sonnet to Opus and costs went down - the tiered routing architecture is why
r/codex
Codex Feels Like a Vibe Coder’s Dream After Months of Fighting Claude
r/singularity
The Orchestrator Era: The Great Recalibration