Anyone can make an AI write a test. Point a model at your app, ask for a checkout test, and you’ll get a beautiful, confident, perfectly-formatted Playwright script in seconds.
The hard part — the only part that actually matters — is knowing whether that test is real. Whether it clicks buttons that exist. Whether it passed because your app works, or because the assertion was hollow. Whether “green” means anything at all.
If May was the month we made QualityMax more trustworthy, June was the month we stopped taking the model’s word for anything. The spine of the month: verify the AI’s output instead of trusting it. Here’s what shipped.
A hallucination gate on every generated test
The classic failure mode of AI test generation: the model writes a flawless test for an endpoint, a selector, or a button that doesn’t exist. It looks right. It reads right. It’s fiction. In June we shipped a post-generation verification gate that checks every AI-written test against the real, live app before you ever see it. If it references something the app can’t actually do, it’s quarantined and flagged — not handed to you as “done.” You can see the verification status right on the test, and toggle the strictness per project.
A reviewer that learns when to stay quiet
Every AI code reviewer has the same problem: noise. Plausible-but-wrong comments that train your team to ignore it. This month our reviewer started learning which of its own findings are noise and suppressing them — and, crucially, keeping a telemetry trail so we can measure the noise it prevented instead of just claiming it. When the signal looked shaky in early data, we shipped it turned off by default and built a labelled benchmark to earn the confidence back. The full story is in Teaching the Reviewer.
“It passed once” is not a result
AI systems are probabilistic, so testing them with a single deterministic pass/fail is a category error. In June we added probabilistic evals: run the same eval N times and report the pass rate with an actual statistical confidence interval (Wilson / Clopper-Pearson), plus a visualization of the distribution. We wrapped it in a public browser demo you can play with — watch a model’s accuracy wobble in real time in Schrödinger’s Eval.
Runs that admit “partly”
We rebuilt the run experience around a single live stream — you watch tests execute in real time instead of staring at a dead spinner. And we killed the pass/fail binary lie: a run where some sub-tests pass and some fail now reports honest yellow “partial”, with collapsible, downloadable artifacts for every step. We also added a JavaScript/Jest API runner and fixed the cloud-execution template so runs start in seconds instead of re-downloading a browser every time.
More models, one harness
We wired up several new model providers this month — including Cerebras with Gemma 4 and Qwen — and put every model selector in the product behind a single source of truth. The point isn’t any one model. The point is that you can swap the model underneath for anything, and the harness of checks around it stays exactly the same. More on why that matters in When Claude Goes Down, Your Tests Shouldn’t.
So how do we ship this fast?
Fair question — and the answer is the product. We don’t ship this fast because we cut corners on quality. We ship this fast because a swarm of autonomous bots does the small, repetitive, incremental heavy lifting on every single change.
For context: everything above shipped across nearly 300 pull requests in a single month. That number is only survivable because every one of those PRs runs through the exact same automated gates we ship to you. No human has to remember to run them; they just fire:
- LLM diff review — reads the full PR diff, flags issues, can block the merge.
- Security scan — hybrid Semgrep patterns + an LLM review of security-sensitive files.
- Tech-debt scan — static complexity, maintainability, and TODO markers, no LLM cost.
- Run tests — the Go / Rust / pytest suite plus AI-generated scripts, in a sandbox.
- Preview-deploy tests — Playwright smoke tests against the preview URL for that PR.
- BOLA / IDOR audit — an LLM authorization-gap scan on route handlers, on every push.
- Guardian — auto-syncs and promotes validated tests, so coverage keeps up with the code on its own.
On top of that, a bug-detection bot watches our own production app around the clock and opens a fix PR when it finds something — which then runs back through the same gates. The humans get to move fast on the interesting 5% precisely because the bots never get bored of the other 95%: reviewing diffs, scanning for regressions, keeping tests in sync. That’s the whole flywheel, and it’s the same one you get when you point QualityMax at your repo.
The throughline
There’s a thesis under all of it, and it’s the same one we’ve been building toward: you can’t review your own work. A model that generates a test is the last thing you should trust to tell you the test is good. So we don’t. We generate with one pass and verify with another — against the live app, across repeated runs, with a reviewer that’s honest about its own noise and a run status that’s honest about partial failure.
The one line
Generating a test is easy. Knowing it’s real is the product.
That’s the bet. The model can hallucinate; the harness catches it. The model can be swapped; the harness stays. And when we’re not sure the harness is right yet, we ship it off by default and prove it before we trust it — the same standard we’re asking you to hold your own AI-built software to.
Links from June
- Schrödinger’s Eval — and the live browser demo
- Teaching the Reviewer
- You Can’t Review Your Own Work
- Value Per Token
- When Claude Goes Down, Your Tests Shouldn’t
Coming in July
Two big ones are almost out of the oven: a completely redesigned product — new look, new information architecture, faster to get from “here’s my app” to “here are your tests” — and personal coding seats, so you can bring your own AI agent and let it work inside QualityMax. Both go public next month.
Try QualityMax
AI-generated tests that are verified against your real app, self-heal when it changes, and report honest results instead of faking green.
Get Started Free