Evals and accuracy

Why small error rates explode over a long agent loop

Manish Kumar, Principal Engineer, AI Systems · 3 July 2026 · 5 min read

"The agent is 90 percent accurate." It sounds like a pass. Nine times out of ten it does the right thing, and the tenth is a rounding error you can clean up later. Teams ship on that number all the time. Then the agent is asked to run a real workflow, twenty or fifty steps long, and it fails more often than it succeeds. Nothing regressed. The 90 percent was always going to do this. The mistake was treating a per-step number as an end-to-end one.

Accuracy compounds, and compounding is brutal

An agent that is right 90 percent of the time on a single step is right 0.9 times 0.9 times 0.9, once per step, across the whole chain. Get every step right and the workflow succeeds; miss one and it derails. So the end-to-end success rate is roughly the per-step accuracy raised to the number of steps, and that exponent is merciless.

A 9.9-point per-step gap becomes the difference between 95 percent and near zero once the steps chain.

Run the numbers. At 90 percent per step, ten steps land you near 35 percent end to end, and fifty steps at about half of one percent: effectively broken. Now take an agent at 99.9 percent per step. Ten steps holds at 99 percent, and fifty steps still holds at 95 percent. The gap between those two agents on a single step is 9.9 points and easy to wave away. Over fifty steps it is the difference between a system that works and one that never finishes. Per-step accuracy is not a vanity metric. It is the base of an exponent.

Real work is long

You might hope your workflows are short. They are not. A single "resolve this ticket" or "process this claim" unfolds into many steps once you count them honestly: read the input, retrieve context, call a tool, interpret the result, decide, update a system of record, handle the exception, verify. Every one of those is a step where the agent can be wrong, and every tool call, retry and hand-off adds to the exponent. Agentic systems are valuable precisely because they chain many steps without a human between each one. That same chaining is what turns a respectable per-step error rate into an unusable end-to-end one.

Two levers, and you need both

There are exactly two ways to beat compounding, and mature systems use both.

Raise per-step accuracy. Because the number is an exponent, small improvements pay enormous dividends. Moving a step from 95 to 99.5 percent looks like a minor tune-up and changes the end-to-end outcome completely. This is what the unglamorous work buys you: an eval suite that measures each step, better tools so the model is not guessing, tighter retrieval so it has the right context, and guardrails that catch a bad step before it propagates. You cannot improve what you do not measure per step.

Reset the error. The other lever is to stop carrying a mistake through forty more steps. Checkpoints, verification stages and human review act like error-correction: at each one, the accumulated drift is caught and the chain restarts from a known-good state instead of compounding from a corrupted one. Confidence routing is the sharpest version of this. When the agent is unsure, the case goes to a person, the error is reset to zero at exactly the point it was about to explode, and the correction becomes training data. This is the real, mathematical argument for keeping humans in the loop: not caution for its own sake, but the cheapest way to break an exponent.

What to measure, and what to ask

The practical consequences are direct. Measure per-step accuracy, not just end-to-end, because the end-to-end number hides which step is bleeding you. Count the steps in your workflow honestly; the length is your multiplier. Put verification gates and human checkpoints where a wrong step is expensive or irreversible, so the loop cannot run fifty steps past a mistake. And prefer several short, checkpointed loops over one long unbroken one, because the exponent resets at every gate.

When a vendor tells you an agent is "90 percent accurate," that is a per-step number wearing an end-to-end costume. Ask how many steps a real task takes, what the per-step accuracy is, and where the checkpoints are that keep the chain from compounding to zero. If there are no checkpoints and the loop is long, the demo will look brilliant and production will not survive contact with the twentieth step.

Accuracy compounds, and compounding is brutal

Real work is long

Two levers, and you need both

What to measure, and what to ask

Working on something in this territory?