You Can't Review Your Own Work

All posts

Ask an AI agent to write a feature. Then ask the same agent to write the tests for it. The tests pass. Of course they pass — they were written by the same mind, from the same assumptions, with the same blind spots. They verify what the code does. They have nothing to say about what it should do.

This is the quiet failure at the centre of the "let AI write your tests" pitch, and it's the single idea QualityMax is built on:

AI code generation and AI code testing are adversarial systems that should be separate products. You can't review your own work.

Every human engineering org already knows this. It's why the author of a pull request doesn't approve their own PR. It's why we separate "dev" from "QA," why auditors are independent, why you don't grade your own exam. The moment the same agent both produces and validates, validation collapses into a tautology. A green checkmark stops being evidence and becomes decoration.

Why this is getting worse, fast

A year ago this was a philosophical point. Now it's an operational one, because a large and growing share of the code entering production was written by an agent — and increasingly, so were its tests.

The trust numbers tell the story. Developers have adopted AI tooling almost universally, while their trust in the output has gone the other direction — the most recent Stack Overflow survey has the vast majority of developers using AI tools and only a low single-digit percentage saying they highly trust what comes out. The dominant frustration is no longer "we don't have enough tests." It's "we can't trust the tests we have."

When an agent writes both the implementation and the assertions, the most likely outcome isn't a wrong test. It's a hollow one: a test that runs, passes, and asserts nothing that could ever fail. expect(page).toBeVisible() after a click. A snapshot of whatever the code happened to render. A URL regex of /.*/. It's the safest possible way for a model to produce a green result — which is exactly why it defaults there.

QualityMax is the independent verifier

We don't sit in the agent's editor helping it write code faster. We sit on the other side of the wall: a separate system, with separate context, whose only job is to be the adversary that asks "does this actually work, and would we know if it didn't?"

And here's the part that matters for anyone who's watched a dozen "AI testing" demos: the moat isn't the model. Anyone can call Claude or GPT. The same models are one API key away for every competitor and every customer. If the product were the model, there would be no product.

The moat is the deterministic guardrail harness wrapping the model — a pipeline of non-AI checks that constrain, verify, and reject the AI's output regardless of which model produced it. The AI proposes. The harness disposes. A few of the layers:

1. A hollow-test gate that rejects tests which test nothing

Before any generated test is allowed to be saved, it's scored by a deterministic quality gate (no LLM in the verdict). It looks for the signatures of a test that only pretends to test: tautological assertions like expect(true).toBe(true), empty test bodies, assertions that just echo a literal copied from the source, Playwright tests with no real locator calls. Score below threshold, and the test doesn't ship — it goes back for regeneration. The gate exists precisely because the failure mode of AI-written tests isn't "wrong," it's "vacuous," and a human skimming a wall of green will never catch it.

2. Self-healing that proposes diffs — and never silently edits

When a test breaks because a selector moved, QualityMax can re-locate the element and fix it. But the fix is surfaced as a reviewable unified diff — the same thing you'd see in a git diff — with the failure reason, a confidence score, and the healing tier that produced it. You apply it or you discard it. Nothing is written to your repository until you say so.

This line is non-negotiable, and it's where most "self-healing" marketing quietly betrays you:

Heal may change where, never what

Self-healing is allowed to change how an element is located — the path to reach a state, waits, scoping. It is never allowed to touch the expected value — what "correct" means.

A test that fails because the total now reads €0.00 instead of €129.00 is not a test to be "healed" green. That's not healing. That's deleting the test and keeping the checkmark — while the broken checkout ships.

Locators are opinions about where. Assertions are facts about what. We only ever let a machine rewrite the opinions, and only with a human approving the diff. (Yes — "you can't review your own work" applies to us too. The machine proposes; a person decides.)

3. Multi-model routing, so the verifier never goes down with one vendor

An independent verifier that's offline whenever one provider has an incident isn't independent — it's a thin client. QualityMax routes per task across eight providers (six cloud, two self-hosted on our own GPUs), with task-fit selection and a circuit breaker that benches a failing provider and probes it back. We wrote about the day this earned its keep in When Claude Goes Down, Your Tests Shouldn't.

4. Retrieval grounded in a green-only corpus

Generation is RAG-grounded — we retrieve real, working examples from your own project to anchor the model. But retrieval is only as good as the corpus, and the subtle way this goes wrong is poisoning it with tests that were healed or are failing. We keep the corpus green: only verified-passing, project-scoped tests are eligible to be retrieved. The verifier learns from what actually works, not from its own past mistakes.

What we don't do (on purpose)

The harness is defined as much by its refusals as its features. We do not let an LLM be the pass/fail judge at runtime — verdicts must be deterministic, because a model will confidently approve a broken total. We use AI to decide what to assert; we never use it to decide whether it passed. We don't auto-update assertions to make red go green. We don't silently mutate your repo. Every one of those is a place where the convenient thing and the correct thing diverge, and the harness is what holds the line when "just make it green" is the tempting answer.

The dominant frustration is no longer "we don't have enough tests." It's "we can't trust the tests we have."

Why a separate product, not a feature

The obvious objection: won't Copilot, Cursor, or your agent of choice just add a "write tests" button and absorb this? They can add the button. They can't escape the structural problem — the button is still the same agent grading its own homework. Coding agents are optimised to make the diff land. A verifier has to be optimised to make the diff fail when it should. Those are opposing objectives, and you cannot serve both from inside the same loop. The separation isn't a packaging decision; it's the whole point.

That's also why we're model-agnostic and provider-agnostic by design. We can swap the underlying model for anything that scores better on our evals — the gateway doesn't care, the harness doesn't care. What's ours is the pipeline of checks and the verified data flowing through it. Bias on quality, not on a vendor.

The one line

AI is great at writing tests. Speed stopped being the bottleneck a while ago. The job — the entire job — is judgement: deciding what to assert, and deciding when to believe red. That judgement cannot come from the same system that wrote the code, for the same reason you don't approve your own pull requests.

You can't review your own work. Neither can your AI. So we built the thing that reviews it for you — and refuses to look away when something's actually broken.

Put an independent verifier on your AI-written code

RAG-grounded test generation, a hollow-test gate, self-healing as reviewable diffs, and multi-model routing — the deterministic harness that wraps the model instead of trusting it.

Get Started Free

You Can't ReviewYour Own Work