The vanity metrics trap
Three indicators often put forward that don’t measure what they claim:
- Number of queries: a user who asks the same question ten times because the answer is bad inflates the metric.
- Number of users created: opened accounts say nothing about real usage.
- NPS on AI: too generic, too influenced by the novelty effect.
Here are the six KPIs that measure real value.
1. Active adoption: weekly users / users created
Definition. Percent of created users who ask at least three questions in the past seven days.
Healthy threshold. > 60 percent at three months, > 70 percent at six months.
Diagnosis. Below 50 percent at three months, usage is shallow: revisit onboarding, check whether the assistant is actually faster than the existing tools.
2. Measured accuracy: percent of correct answers on a test set
Definition. On a stable test set of 30 to 50 questions, the percent of answers judged correct by a business reviewer.
Healthy threshold. > 90 percent in stable production.
Diagnosis. Below 85 percent, a source documentation or configuration problem. See Good documentation makes a good assistant.
3. Documentation coverage: percent of “I don’t know” answers
Definition. Percent of queries where the assistant says it can’t answer.
Healthy threshold. Between 5 percent and 15 percent in steady state. Below that, the assistant is likely making things up (hallucinations). Above that, the corpus is too thin for real usage.
4. Time saved: valued in hours and euros
Definition. Hours saved per user per week on the covered tasks (search, writing, summary), valued at fully loaded hourly cost.
Measurement method. Quarterly survey on a panel of 10 to 15 users: “for the questions you asked the assistant this week, how long would it have taken you without it?”
Healthy threshold. 3 to 8 hours per user per week for moderate use cases. More for intensive cases (legal, support, doctrine).
5. Offload: percent of requests handled before reaching a human
Definition. For assistants open to the public or support: percent of requests fully handled by the assistant, with no escalation to an agent.
Healthy threshold. Between 50 percent and 75 percent depending on scope. Broader scope = lower offload.
6. User satisfaction: structured feedback, not generic NPS
Definition. Three short questions, asked at the end of a conversation on 10 percent of traffic:
- Was this answer helpful? (yes/no)
- Did you save time? (1 to 5)
- Would you recommend this assistant? (1 to 10)
Healthy threshold. > 80 percent helpful answers, > 4/5 on time saved, > 8/10 on recommendation.
To turn these indicators into a ROI calculation, see Calculating and maximizing ROI.
Twenty minutes to identify the six KPIs that fit your case, their measurement frequency, and their alert thresholds. We hand you a ready-to-use template.
Book a demo→