A quoting tool at a services organisation stopped including GST on its quotes. Nobody noticed for six weeks. The fix took an hour. The refunds, apologies, and one lost client took longer.
That's AI drift. The model didn't break — it just started behaving slightly differently, and nobody was watching closely enough to catch it.
If you're running AI inside your business, you need a way to know it's still doing what you hired it to do. The good news: you don't need an ML team to do this. You need three habits and about thirty minutes a week.
Why AI drifts in the first place
AI tools change. The underlying model gets updated by the vendor. Your prompts get tweaked by someone on the team. Your input data shifts — customers start asking different questions, products change, edge cases appear.
Any of those can quietly shift the output. The AI keeps working. It just stops working the way you set it up to work.
LangChain's 2026 State of AI Agents report surveyed over 1,300 professionals and found quality is the top barrier to getting agents into production — cited by 32% of respondents. The same report found 89% of teams have observability in place, but only 52% have actual evaluation processes (LangChain). Most teams can see what their AI is doing. Far fewer can tell whether it's doing it well.
This is the gap. And it's where customer-visible failures hide.
Check 1: The random sample read
Once a week, pull 5 to 10 outputs from your AI at random. Read them. That's it.
Not the ones the team flagged. Not the success stories the vendor sent you. Random. A spreadsheet pull, a database query, a colleague hitting "shuffle" on the support inbox.
You're looking for anything that feels off. A customer service AI that started rewriting people's names. A summary that's technically correct but oddly phrased. A quote that's missing a line item. You won't have a precise definition of "wrong" — that's fine. You're using your business sense, not a metric.
The point isn't to catch every error. It's to keep a human eye on the output so drift doesn't go six weeks without being noticed.
Check 2: Golden questions
Write down 10 inputs you know the correct answer to. Real ones, from your business. Mix easy and tricky cases.
For a customer service AI: 10 common questions with the right answers. For a quoting tool: 10 sample jobs with the price you'd quote. For a summary tool: 10 documents with the points you'd expect in the summary.
Run them every week. Same inputs, every time. If the outputs change, you have a signal — and you can show someone exactly what changed and when.
This is the single most useful thing most SMEs can do. It takes 15 minutes. It catches the failures your random sample will miss because they're rare.
Check 3: The trend line
Pick one number you can pull from your AI's outputs and track it weekly.
Some that work well: average output length, percentage of outputs containing a specific required element (a price, a disclaimer, a signature), sentiment score, escalation rate, on-topic rate.
A summary tool whose summaries got 30% longer over six months. A support AI whose escalation rate dropped from 18% to 6% (great — or it stopped flagging things it should). A quoting tool whose average quote value drifted up 12%.
You're not trying to explain the number. You're trying to spot the shift. When the line moves, that's when you go look.
What to do when a check fails
Don't panic and don't tear down the system. Drift usually has a small cause.
Walk it backwards. When did the metric or output last look right? What changed since — vendor update, prompt edit, new data source, new product launched? Compare two outputs side by side: one from before, one from now. The shift is almost always visible.
Then fix the smallest thing that explains it. Roll back a prompt change. Add a missing instruction. Flag a vendor model update. Don't rebuild — adjust.
If you're rebuilding more than you're adjusting, the system was probably too ambitious for what your team can monitor. That's a separate conversation, and it's one we have a lot at Kursol — most AI failures aren't model failures, they're scope failures. We've written about why we augment teams instead of automating them away, and the same logic applies to quality control: you keep humans in the loop because humans notice when something is off.
The closing position
You don't need to evaluate AI like an ML engineer. You need to quality-check it like an operations manager.
Pull random samples. Keep golden questions. Watch one trend line. That's the whole system. It's a habit, not a project.
Gartner forecasts that 40% of organisations deploying AI will use dedicated evaluation and observability platforms by 2028 (Gartner). Most SMEs won't get there — and don't need to. What you need is enough discipline to know whether the AI you're paying for is still helping your team. That's a thirty-minute weekly habit, not a tooling problem.
The AI doesn't know it's drifting. Your customers will find out before you do unless you check.
FAQ
Weekly is the right cadence for most SMEs. Daily is overkill unless your AI handles high-volume customer-facing tasks. Monthly is too slow — drift that goes a month unnoticed usually means lost customers.
No. A spreadsheet and a calendar reminder is enough to start. If your AI volume is high, basic logging from your vendor plus a weekly dashboard is plenty. Dedicated evaluation tools are useful at scale, but they're not a prerequisite.
Then your AI is probably doing more than your team can supervise. That's a scope problem, not a time problem. Either reduce what the AI is doing, or assign one person 30 minutes a week to own the checks. Both are cheaper than catching a six-week drift after a customer complains.
Ready to get your time back?
No pitch, just a conversation about what Autopilot looks like for your business.