Schrödinger's Eval

The 12 requests we ask it to classify

Before anything runs — these are the requests sent to the model on every run. Each has one correct label (the chip): Answerable, Clarify, Contradictory, or Nonsense. The right answer never changes; we measure how often the model agrees with it — and with itself.

Loading the cases…

One run gives you one score

Each run is a full pass through all 12 requests. This big number is just the latest run — like checking your speed once and calling it your average driving speed.

—

Press “Run the eval”. We’ll classify all 12 requests eight times.

“Just use a better model”?

Try it — switch the model at the top and run again. A stronger model's spread shrinks, but it never hits zero. Every model is stochastic; a frontier model just wobbles less. And that's the real trap: the danger isn't this big, obvious swing on a weak model — it's the small, unmeasured wobble in the frontier model you're shipping to production, the one nobody is putting error bars on. We picked a weak model here only so you can see in 60 seconds what's always there.

The 12 requests we ask it to classify

One run gives you one score

Same test, different results each time

The real answer: a range, not one number

How much did each run score?

Which requests does it get wrong?

“Just use a better model”?

That was a fixed demo. Point this at your agent.

Schrödinger's Eval

The 12 requests we ask it to classify

One run gives you one score

Same test, different results each time

The real answer: a range, not one number

How much did each run score?

Which requests does it get wrong?

“Just use a better model”?

That was a fixed demo. Point this at your agent.

Quick check before we run