All posts

You run your eval suite against your agent and see a pass rate of 91%. You make one small change to a prompt. You run it again. Now it's 95%. You run it a third time, changing nothing: 97%. A fourth: back to 91%.

So — did your change help? Hurt? Do nothing? You genuinely cannot tell. And if you can't tell, then the number you've been pasting into your standup, your investor update, and your "ready to ship" decision was never really telling you anything.

We've started calling this Schrödinger's Eval: the agent both passes and fails, and it stays in superposition until a real user opens the box. Watch it happen live — same eval, same agent, run after run, and the score refuses to hold still.

Stop treating agent evaluation as a win/loss count. Start treating it as a probabilistic question.

Why the number won't hold still

The instinct is to assume the wobble is a bug — a flaky test, a bad network call, a judge that hiccuped. It isn't. The wobble is the truth, and the stable single number was the lie.

An LLM-driven agent is stochastic by construction. The same input does not produce the same output; temperature, sampling, and tool-call ordering see to that. So any given test isn't a fact about your agent — it's a coin flip weighted by your agent's quality. Run it once and you observe one flip. Run your hundred-test suite once and you've observed a hundred single flips, then collapsed them into one percentage as if you'd measured something fixed.

You didn't. You sampled a distribution exactly once and reported the sample as the parameter. The honest version of "91%" was always "somewhere around 91%, and ask me again and I'll say something different."

The win/loss trap

Treating eval as a scoreboard — this build beat the last one, ship it — fails in three specific, expensive ways.

1. You can't separate signal from noise

91% → 93% looks like progress. But if the run-to-run noise on this suite is ±4 points, that "improvement" is indistinguishable from rolling the dice again. Teams ship regressions and revert improvements every week on exactly this confusion. The fix is the thing every other empirical field already does: put error bars on the number. Run the eval enough times to estimate a confidence interval, and "91% ± 4% (n=50)" tells you instantly whether 93% is real or a mirage.

2. The "independent trials" assumption is quietly false

The math that would let you trust one run assumes each test is an independent, identically distributed draw. Agent evals violate this constantly. Tests cluster by scenario, by persona, by tool path; failures correlate; unlikely-looking runs show up far more often than a clean coin-flip model predicts. If you don't model that, your confidence interval is too narrow and your green is overconfident.

3. Pooling hides the regression that matters

A single pooled pass rate is an average, and averages bury their worst cases. Your agent can sit at a comfortable 95% overall while the "frustrated non-native speaker trying to cancel" persona has quietly collapsed to 60%. The pooled number never twitches. The churn shows up in your support queue six weeks later. You have to evaluate per group — per persona, per scenario, per risk category — or the one cohort you most needed to catch is the one the average erases.

The reframe in one line

An agent's quality is a distribution, not a verdict. The job of an eval is to estimate that distribution well enough to support a confident ship / no-ship decision — with error bars, accounting for correlation, and broken out by group.

What an honest eval actually requires

Once you accept that you're estimating a distribution rather than reading a verdict, the requirements fall out on their own:

Why this is the same fight as the rest of QualityMax

If you've read us before, this will rhyme. Our whole thesis is that you can't review your own work — that verification has to be an independent system with its own context and its own incentives, not the agent grading its own homework. Probabilistic evaluation is that same principle pointed at a different surface.

A vanity score that drifts with the wind isn't verification — it's decoration, the same way a hollow test that asserts nothing is decoration. A green checkmark is only evidence if it would have turned red when something was actually wrong. A pass rate is only evidence if you know how much it moves when nothing changed. Anyone can call a model and print a percentage. The hard, durable part — the part that's actually worth paying for — is the harness around it that turns a noisy sample into a decision you can defend: repeat, bound, group, and answer the only question that matters — is this safe to ship, and how sure are we?

The one line

Your agent isn't 91% good. It's a distribution, and 91% is one sample you happened to draw. Until you measure the spread, you're not evaluating your agent — you're reading tea leaves with a decimal point. Open the box, run it enough times to see the shape, and decide on the shape.

See your agent's real distribution — free

Point QualityMax at your chatbot or agent endpoint. We run an adversarial and conversational eval, score it with an independent judge, and tell you what's safe to ship — not just a number that won't hold still.

Watch the Live Demo Scan Your Agent Free