Agentic AI

The AI pilot rescue checklist: five checks before you call it dead

a casemodelscores confidenceautomatedhigh confidencehuman reviewthe hard casescorrections raise accuracy over time
Confidence-based routing: the machine handles what it is sure of, people handle the rest.

Your AI pilot stalled, and someone is about to call it dead. Before that meeting, run this checklist. We wrote about why most AI pilots die between demo and production, and the anatomy is consistent: the model is almost never the problem; the missing operating system around it is. This is the working version of that argument. Five checks, each with a pass condition, a fail condition, and a first fix. Score honestly. The result tells you whether you are looking at a rescue, or at a use case that was wrong from the start.

1. Accuracy is a number

Pass: There is a written target for what the system must get right, per case type, plus a golden set of real cases with known answers that is rerun on every change. Everyone in the room quotes the same number.

Fail: Accuracy is an adjective. "Pretty good." "Mostly right." Or it was measured once, at demo time, on cases picked by the people who built the demo.

First fix: Pull real cases from production traffic, not curated examples, and have the people who own the outcome label the correct answers. Run the pilot against them and write the score down. This is uncomfortable, and that is the point. Our Accuracy Gap Assessment does exactly this fast: where the system actually fails, what each failure costs, and what closes the gap.

2. The workflow is mapped

Pass: A one-page map shows where work enters, what the AI decides, which systems of record it updates, and where exceptions go. Every arrow ends at a system or a named person.

Fail: The pilot answers questions in a sandbox. Nobody can say what happens to its output, who acts on it, or which step of the real process it replaced.

First fix: Sit with the person who does the job today and trace one real case end to end, including the input that arrives in the wrong format and the deadline nobody mentioned. Most pilots automate the easy middle of a process. The map shows you the painful edges that were quietly left to a human.

3. Failures have a path

Pass: The system knows when it is unsure. Low-confidence cases route to a review queue with an owner and a turnaround time. Failures are counted, categorized, and fed back as corrections.

Fail: Failures surface as screenshots in Slack. Nobody knows the failure rate, because nothing counts failures.

First fix: Add confidence-based routing, even a crude first version. The machine handles what it is sure of; everything else lands in a queue a human actually watches. This one change turns "the AI is unreliable" into "the AI handles the volume and people handle the rest", which is a system you can ship.

4. Cost per outcome is known

Pass: You can state what one outcome costs: one resolved ticket, one processed document, one published article. And you can compare it to what the old way costs.

Fail: You know the monthly model bill and nothing else. Retries, long contexts and tool calls were discovered by finance, not designed by engineering.

First fix: Instrument one number, cost per outcome, and put it on a dashboard with a ceiling. An autonomous newsroom we operate publishes every day at under $2 per article; that number exists because it was designed in from the start, not found later. Our One Workflow, Automated engagement ships with a before-and-after cost baseline for exactly this reason.

5. The pilot has an owner

Pass: One named person is accountable for this reaching production. They have time in their week, budget authority, and permission to change the human workflow around the system, not just the system.

Fail: The pilot belongs to "the innovation team", or to a vendor, or to a champion who has quietly stopped mentioning it. When something breaks, the fix waits for a meeting.

First fix: Name the owner and hand them two numbers to beat: the accuracy target from check one and the cost target from check four. A pilot with an owner and a number behaves like a project. A pilot with neither behaves like a demo that outlived its meeting.

What your score means

Count the passes.

  1. Five. You do not have a pilot problem; you have a rollout decision. Ship deliberately, keep the eval suite running, and watch the dashboard.
  2. Three or four. Rescue territory. In our experience the gap is rarely a rebuild; it is focused engineering on the checks that failed. A scoped Agent Readiness Audit maps the workflow, scores it for agent-fit, and gives you the plan with an honest go or no-go.
  3. Two or fewer. The use case itself may be wrong, and that is worth knowing this week, not months into the slow death. We have told clients not to build, in writing. A clear no is cheaper than a stalled yes.

One pattern holds across every rescue we have run: the demo proved the value exists, and the checklist shows what was never built around it. Bring us the pilot as it is, stalled and slightly embarrassing. Thirty minutes with a founder, and you will know which of the three outcomes you are holding.

Vinayaka RaoPrincipal Engineer, AgentsBuilds and hardens agentic systems: evals, guardrails, tool permissions and the monitoring that keeps them safe in production. extendfuture on LinkedIn. Reviewed by Amol Patil, Founder.

Working on something in this territory?

Tell us what you are trying to win. We answer within one business day, from the people who build.