The Verification Tax

The old question was whether the agent could do the work. That question still matters, but it is not the one that breaks teams. The sharper question is whether the human can verify the work cheaply enough for the whole loop to be worth running.

01The old unit

The demo stopped being the unit of truth.

For a while, the agent demo was enough. The model wrote the function, opened the browser, changed the CSS, explained the patch, and everyone in the room could feel the future move one chair closer. The right response was not cynicism. The thing was real. A tool that could move through a codebase, hold a task, and come back with a working artifact deserved attention.

But the demo measured the wrong surface. It measured whether the agent could produce something plausible at speed. It did not measure the second half of the loop: whether the result could be checked without turning the human into a slow parser for machine output.

That is the verification tax. It is the review time, evidence gathering, reruns, diff reading, browser checking, log scanning, and uneasy rereading that shows up after the agent says it is done. It is not a moral objection to automation. It is the bill for trusting work you did not personally perform.

02The bill

The tax appears in four places.

Generation is visible. Verification is mostly invisible. That is why the numbers feel so good until they do not. A system that produces ten times as much work can still be negative if it produces fifteen times as much review debt.

Tax 01

Behavior

Does the change do the thing, or did it merely satisfy the text of the task? The agent optimizes for completion unless the task makes evidence part of completion.

Tax 02

Substrate

Did it edit the right files, preserve the right source of truth, and leave the system in a state another person can reason about?

Tax 03

Price

Was the generated value larger than the tokens, wall time, retries, tool calls, and human review needed to get there?

Tax 04

Ownership

Who is willing to be named when the output reaches production, the customer, the audit trail, or the on-call rotation?

Most bad agent workflows hide at least one of these. They show the patch and skip the evidence. They show the green check and skip the changed fixture. They show the artifact and skip the source that generated it. That is how the tax compounds.

03The rule

A useful agent run has a cheap check.

The best question before handing work to an agent is not “can it do this?” The frontier models can do an absurd amount. The better question is “how will I know?” If the answer is “I will read everything it wrote and hope my attention holds,” the loop is already expensive.

Do not ask whether the agent can produce it. Ask whether you can verify it cheaply.

Cheap verification is not one thing. Sometimes it is a typecheck. Sometimes it is a failing test made green. Sometimes it is a screenshot before and after. Sometimes it is a SQL query, a link checker, a browser smoke test, a line count, a diff against generated output, or a human opening exactly three pages and looking for exactly two promises.

What matters is that the check is smaller than the work. If verification is as hard as doing the task yourself, the agent may still be useful as a study partner, but it is not leverage yet. It is a different way to spend the same attention.

04The cliff

The tax goes nonlinear when output outruns review.

The first agent patch is delightful. The fifth is productive. The fiftieth can be a landfill if the surrounding system has not changed. Review does not scale just because generation does. A human can inspect a small diff with intent. A human cannot responsibly absorb a full afternoon of broad edits, test changes, generated docs, renamed files, and confident summaries without a better structure.

This is why large agent tasks so often feel good during the run and worse the next morning. The agent did not fail in the dramatic way. It completed a lot. That is the problem. Completion is cheap; confidence is expensive.

The danger zone is easy to recognize. The diff is too wide to hold in your head. The agent changed both the code and the tests. The summary contains claims that cannot be mapped to commands. The result looks polished but has no audit trail. The page renders, but nobody checked the smaller viewport. The script passes, but it passed before the change too.

The agent does not have to be wrong to create risk. It only has to be faster than your ability to know what changed.

That is the real bottleneck shift. The agent is no longer the scarce resource in many loops. Human verification is.

05The ledgers

Keep three ledgers, or accept fog.

The fix is not to slow the agent back down to human speed. The fix is to make the work auditable as it moves. I trust agent output much more when the loop leaves three small ledgers behind.

Work ledger

What did we ask for, what changed, and what files or systems were touched?

Evidence ledger

What was run, what passed, what was opened, and what specific proof backs the claim?

Risk ledger

What was not checked, what remains speculative, and where can the change still bite?

These do not have to be ceremonies. They can be a PR description, a scratch file, a final note, a CI job, a screenshot folder, or a tiny script. The point is that the evidence must survive the moment. A summary that says “all set” is not evidence. A command, a result, and a remaining risk are evidence.

This is also where HTML keeps earning its place on this site. Not because every answer deserves a designed page. It does not. HTML earns its place when the artifact needs to be reviewed rendered, shared outside the editor, or kept as a small piece of evidence. The format is useful when it lowers the verification tax. It is waste when it raises it.

06The operating style

Design the task around the check.

The easiest way to improve agent work is to specify the verifier before the worker starts. “Make this better” is not a task; it is a mood. “Make this page load at the new URL, add the archive link, run the local link checker, rebuild the container, and smoke test the live route” is a task. The second one has edges. The agent can still make mistakes, but the mistakes have less room to hide.

For codeAsk for the failing test, the smallest diff, and the command that proves the behavior changed.
For UIAsk for the viewport, the screenshot, the interaction checked, and the element that must not shift.
For researchAsk for source classes, dates, contrary evidence, and which claims still need verification.
For proseAsk what the piece is trying to make the reader believe, then check every section against that job.

Notice the pattern: the task is not framed around the agent's effort. It is framed around the human's ability to verify the result. That is the shape of mature agent work. Generation is delegated. Accountability is not.

07The limit

Some work should not be delegated until the check exists.

The uncomfortable part is that cheap verification rules work out of the queue. If a task has no credible check, do not treat the agent as leverage yet. Treat it as an assistant, a drafter, a critic, or a search surface. Let it explore. Let it propose. Let it make throwaway artifacts. But do not count the work as done merely because it came back with confidence and structure.

This is where agent enthusiasm often gets sloppy. It treats all impressive output as progress. A generated migration with no rollback path is not progress. A generated policy with no owner is not progress. A refactor that passes tests only because the agent updated the tests is not progress. A beautiful document with unchecked claims is not progress. These may be useful intermediate states, but they are not finished work.

Finished work has an owner and a check. Everything else is draft material.

08The instrument

The argument, compiled into a skill.

A discipline you cannot run is just a mood. So here is the same thesis folded into one thing you keep open while you work: the four taxes, the three ledgers, and the cheap check, turned into a contract the agent fills before it starts and you audit when it returns. The essay makes the case once. The instrument makes it every time you delegate.

Everything upstream of the work happens in one block. Fill it before the agent generates. If a field will not fill with something concrete, the task is exploration, not delivery, and you reframe it instead of shipping it.

Agent task · paste before you generate

Agent task

Expected change:
- Change:
- Do not change:
- Done means:

Cheap verifier:
- I will verify by:

Evidence required:
- Commands / results:
- Screenshots / logs / diff summary:
- Before / after behavior:

Residual risk:
- Remaining risk:
- Untested areas:
- Owner:

Ready to delegate only when you can finish the line: I can verify this cheaply by checking ______, and the remaining risk is ______.

And the loop around it, in order, so the evidence survives the moment instead of evaporating with the run:

Write the task block before generating. If the four fields will not fill, mark the task exploration, not delivery.
Keep the output reviewable. Prefer the smallest diff. Split wide work before it outruns review.
Run the cheap verifier and capture its real output, not a description of it.
Emit the three ledgers: Work, Evidence, Risk.
Name an owner for anything headed to production.
Withhold “done” until a check has passed and an owner exists. Generation is delegated. Accountability is not.

Take the instrument

Drop it into your agent. It triggers itself whenever substantial output has to be trusted: before a pull request, before you ship, any time you are deciding whether something is actually done.

↓ Download the skill .claude/skills/verification-tax/SKILL.md

The agent is not the bottleneck anymore.

The next advantage will not come from asking for bigger outputs. It will come from shaping smaller loops where evidence is cheap, risk is named, and the human can say yes for a reason. The tax never disappears. Good agent work just keeps it visible, payable, and smaller than the value on the other side.

The verification tax.