We ask an AI to classify the same 12 borderline requests, eight times in a row. Because AI answers aren't perfectly repeatable, its accuracy keeps moving — and a single score can mislead you.
GPT-3.5 Turbo sits right at its limit on these cases, so it wobbles hard — easy to see. Switch to a stronger model and watch the spread tighten… but never fully flatten.
Before anything runs — these are the requests sent to the model on every run. Each has one correct label (the chip): Answerable, Clarify, Contradictory, or Nonsense. The right answer never changes; we measure how often the model agrees with it — and with itself.
Each run is a full pass through all 12 requests. This big number is just the latest run — like checking your speed once and calling it your average driving speed.
Try it — switch the model at the top and run again. A stronger model's spread shrinks, but it never hits zero. Every model is stochastic; a frontier model just wobbles less. And that's the real trap: the danger isn't this big, obvious swing on a weak model — it's the small, unmeasured wobble in the frontier model you're shipping to production, the one nobody is putting error bars on. We picked a weak model here only so you can see in 60 seconds what's always there.