I have shipped four KB-grounded chatbots for UK SMEs in 2025. The pattern of where they work is sharp.
Deflection rates
Source: Client analytics, anonymised
The coffee chain saw 62 percent ticket deflection. The council saw 29 percent. The retailer 48 percent. The solicitor 12 percent. That spread is the story.
The coffee chain had 5,000 tickets/month, 80 percent of them on the same 30 questions. Perfect bot territory.
The solicitor had 200 tickets/month, all bespoke. The bot answered the easy navigation questions and could not touch the legal ones. Low deflection but still useful for the gateway questions.
Pattern
| Spec | Works | Does not |
|---|---|---|
| Volume of repetitive Qs | High | Low |
| Stable knowledge base | Yes | No (changing weekly) |
| Tone tolerance | Casual OK | Formal-only required |
| Compliance ceiling | B2C | Regulated B2B (legal/medical) |
| User base | Self-serve happy | Expects human |
The variables that predict bot success:
- Volume. Below 500 tickets/month, the bot will not deflect enough to justify the build.
- Knowledge stability. If the KB changes weekly, the bot's answers go stale.
- User self-serve tolerance. Some user bases want a human; no bot will beat that expectation.
- Compliance. Regulated industries need outputs reviewed; the bot becomes a draft tool, not a deflection tool.
The build
Stack:
- Postgres + pgvector for the embeddings store[1]
- gpt-4o-mini or gemini-2.5-flash for cheap retrieval-grounded answers
- Resend for the "escalate to human" email path
- Two clear "I don't know, talk to a human" buttons in the UI
Total build time: 30-40 hours per deployment.
Total cost to client: £4-8k for the build, £40-80/month for the runtime.
Where I refused to ship
A medical practice asked for a triage bot. I declined. Even with disclaimers, the liability surface is too sharp.
A small accountancy asked for a "quick tax question" bot. Declined. Same reason.
Recipe for SME deployments
- Audit support inbox for top 50 questions
- Verify knowledge base covers them (or write the missing answers)
- Pilot for 30 days with humans monitoring every conversation
- After 30 days, decide deflection vs UX cost
The pilot phase kills 30 percent of these projects before launch. That is the right outcome, the projects where the bot would have been a poor fit get caught before the client invests further.
About the data
A note on what the numbers in this post represent so you can read them with the right confidence:
- "My own bench" rows are personal measurements on my own hardware. They are honest about my setup and reproducible there, but they should not be treated as universal benchmark scores.
- Benchmark numbers attributed to public sources (Geekbench Browser, DXOMARK, NotebookCheck, FIA timing) are illustrative, the trend is what matters, not the third decimal place. Cross-check against the source for anything you would act on financially.
- Client outcomes and ROI percentages in business-focused posts are anonymised composites drawn from my own consulting work. Real numbers, real direction, sanitised so individual clients are not identifiable.
- Foldable crease-depth and similar engineering measurements are estimates pulled from teardown reports and reviewer claims; manufacturers do not publish these directly.
- Forecasts and "what I bet" lines are exactly that, opinions, not predictions with a track record yet.
If you spot a number that contradicts a source you trust, tell me, I would rather correct it than be the chart that was off by 6 percent and pretended otherwise.