Buying guides

How to choose an AI agency: 12 questions serious buyers should ask

demolooked greatpilotstalls herethe gapevals · guardrailsmonitoring · humansproductionit ships
Most pilots die in the gap: the operating layer around the model, not the model itself.

Searching for an AI agency returns hundreds of firms with nearly identical websites. Most were selling something else not long ago. The demos all look impressive, which tells you nothing, because the demo is the cheapest part of the job. What you are actually buying is everything that happens after the demo: the evals, the guardrails, the security posture, the people, and the handover.

You can learn most of that before signing anything. Here are twelve questions to ask in the first call, why each one matters, and what a good answer sounds like.

Proof, not promises

1. What do you run in production today? Not built. Run. A system that has survived real users, real data and a few model updates is a different artifact from a demo. Push for specifics. A production claim sounds like field onboarding cut from about two days to about eight minutes, working offline. A demo claim sounds like "we improved efficiency". Vague answers in the sales call predict vague delivery.

2. Can I talk to a client whose system is still live? Launch-week references are easy. Ask for one whose system has been running long enough for accuracy and costs to drift, because both drift. That call tells you how the agency behaves when things wobble, which is most of what you are buying.

3. Tell me about one that failed. Every team that has shipped real AI has a failure story: a use case that was wrong, a model that never reached the accuracy bar, a client they told not to build. An agency with no such story has either shipped very little or is editing. The follow-up matters more: what changed in how they work? Most AI pilots stall between demo and production, and the teams worth hiring can tell you exactly why theirs did.

The engineering under the demo

4. How do you measure accuracy? Show me an eval stack. "It works well" is not a number. Production AI needs a golden set of real cases with known answers, an accuracy score, and a harness that runs on every change. Ask to see one from a past project; anonymized is fine. If there is no eval suite to show, quality on your project will be a feeling, and feelings collide in steering meetings.

5. What happens when the model is wrong? It will be wrong on some slice of cases. The question is whether that slice routes somewhere designed: a review queue, an escalation path, a human in the loop with context and a deadline. Teams that bring up the human layer unprompted have operated production systems. Teams that promise full automation on day one have not.

6. Which models and frameworks, and why those? The honest answer is "it depends on the case", followed by reasons: accuracy on your data, cost, latency, data residency. Be wary of a shop wedded to one vendor or one framework. Your requirements did not choose their stack; their history did. Portability matters, because model pricing and quality shift constantly.

Security you can inspect

7. Where does our data live, and who can touch it? Demand specifics: whose accounts the data sits in, what access the models get, where personal data is masked. The pattern to listen for: least-privilege access, an audit log on every action, PII masking before models see records, data residency honored. "We take security very seriously" contains no information.

8. What will you claim when our auditor asks? A trick question with a correct answer. No vendor can hand you compliance; that is your auditor's call. A serious partner says its systems are designed to support DPDP, GDPR, HIPAA or SOC 2 expectations, then shows the mechanics: human approval gates before irreversible actions, complete logs, masked data. A vendor who flatly declares its AI "compliant" is answering a question your auditor has not asked yet, and that should worry you.

The money and the people

9. What is the smallest thing I can buy? The shape of the first engagement shows how an agency thinks about risk, including yours. A fixed, scoped entry that ends in a working system on your real cases, or an honest recommendation not to build, keeps proof cheap. Look at how a firm structures its services and whether each one has a small, priced front door. Then ask about running costs, and listen for the unit: cost per outcome, per published article, per processed document. Build fees are paid once; unit economics are forever.

10. Who exactly does the work? The people in the sales call and the people in the repository are often different. Ask who writes the code, how senior they are, whether anything is subcontracted, and how much principal attention your project gets each week. Small senior teams beat large rotating ones at this work. Ask for the ratio, and ask who you can message directly when something breaks.

11. Can you build the whole thing? An AI feature is also a backend, an interface, a deployment pipeline and a cost model. If the agency only does "the AI part", you are signing up to run a vendor relay race, and projects leak time at the seams between vendors. Full-stack AI product teams remove the seam. If you have your own engineers, flip the question: how will the agency pair with them, and hand over to them?

12. What do we own when you leave? The uncomfortable question, so ask it early. Who owns the code, the prompts, the eval sets, the training data? Is there a runbook a new engineer could operate from? Has the agency handed a system over cleanly before, and can you speak to the team that received it? Vendors who plan for their own exit are the ones worth keeping around.

What to do with the answers

Send the list ahead of your first calls and compare what comes back. Answers built from specifics, numbers, named trade-offs and a failure freely admitted mean the team has lived through production. Answers built from adjectives tell you how the project will go.

Notice that none of the twelve ask which model is smartest. Model choice is the most reversible decision in the whole project. Evidence, evals, security and handover are not.

We are glad to be tested against our own list. extendfuture has built AI systems worldwide for clients worldwide since 2019, and every engagement starts with a founder-led thirty-minute call. Bring all twelve questions to it: book a call.

Radhika MenonProduct LeadTurns AI capability into products people actually adopt, with adoption and unit economics instrumented from day one. extendfuture on LinkedIn. Reviewed by Amol Patil, Founder.

Working on something in this territory?

Tell us what you are trying to win. We answer within one business day, from the people who build.