The Möbius Strip QA Loop: When the Tool Tests Itself

All posts

Take a strip of paper. Give one end a half-twist. Glue the ends together. What you've made is called a Möbius strip, and it's unusual: it has only one side. Run your finger along the surface and you'll end up on "both sides" without ever crossing an edge. Inside and outside are the same place.

That's the shape of what we've been building at QualityMax.

This post is about the moment the QA tool and the thing it tests become one continuous surface — and why that only becomes possible once the tool owns the adjacent objects (the tests, the code, the errors, the AI).

Most QA tools are "outside in"

The standard shape of a QA stack is a stack of separate tools, each with a clear boundary.

You have a test runner (Playwright, Cypress, pytest).
You have an error tracker (Sentry, Datadog, Bugsink).
You have a coding assistant (Cursor, Claude Code, Copilot).
You have a CI system that glues some of them together with YAML and hope.

Each tool sits outside the code it touches. The error tracker sees the errors but not the tests. The test runner sees the tests but not the errors. The coding assistant writes the code but has no idea which tests cover it or which errors it's about to cause. The CI system sees the outputs but none of the context.

This architecture is fine. It also has a ceiling. When a production error fires at 3 AM, a human still has to walk the loop — read the stack trace in one tool, find the relevant test in another, open the file in a third, ask an AI in a fourth. The loop is there. It's just that you are the glue.

What happens when one tool owns the whole loop

QualityMax imports your repo. It crawls your app. It generates tests. It runs them in CI. It receives errors from production. It stores them. It knows which tests cover which files. It can call an AI with full context.

Those are all things other tools do. What's interesting is what happens when they're all in the same data model.

The instant the error tracker and the test database share a schema, the walk you used to do by hand becomes a JOIN. The instant the AI can read both the failing stack frame and the test that should have caught it, its suggestions stop being generic and start being specific to your codebase. The instant the same platform can open a PR with the fix, the loop closes without you touching a keyboard.

You go from four tools with four seams to one continuous surface. A Möbius strip.

The actual loop, in one diagram

production error fires │ ▼ Bugsink captures it ◄──┐ │ │ │ (sync job mirrors to Supabase) │ ▼ │ error row now joinable with: │ • test_cases (does a test cover this file?) • code_repo (which PR last touched it?) │ • executions (did that test just fail?) │ │ │ ▼ │ Claude gets: error + stack + top-frame │ + surrounding source + last test │ run + recent commits │ │ │ ▼ │ Proposed regression test + suggested patch │ │ │ ▼ │ PR opens in the same repo that triggered │ the error — reviewed, merged │ │ │ ▼ │ next run of the test hits the fix, │ Bugsink sees zero new events │ │ │ └───────────────────────────────────────┘

No human glue. The tool owns every step.

The twist: QualityMax tests QualityMax

Here's where it gets weird.

The QualityMax app is itself a codebase. That codebase has errors. Those errors get tracked in Bugsink. So we imported the QualityMax repository into QualityMax, linked our own Bugsink project to our own QM project, and let the loop run against us.

The first time I opened the embedded Bugsink page after the cross-link, the top issue was literally:

Failed to query Bugsink DB directly: connection to server at "localhost" (::1), port 15432 failed: Connection refused

That's our own MCP server, reporting its own inability to reach the old Bugsink endpoint — the exact bug the new cache architecture was built to eliminate. Bugsink tracking Bugsink's own obsolescence. The first user of the system was the system.

Why this is unusual. The standard boundary between "tool" and "target" assumes they're different programs, maintained by different teams. When they're the same program, the metrics the tool reports about itself become input to the tool's own improvement. Over time the product gets sharper at the things its users also hit, because its own engineering team feels the same pain first.

Why this is hard to copy

Any one of the pieces exists somewhere. Sentry owns the error side. GitHub owns the PR side. Cursor owns the code-authoring side. Playwright owns the test-execution side. But owning one of those is not the same as owning all four in one schema.

A standalone error tracker can never answer "which test should have caught this" because it doesn't know your tests. A coding assistant can never answer "which bug is this PR about to cause" because it doesn't know your error history. A CI platform can never propose a regression test because it doesn't own the test-authoring side.

To close the loop you have to have already imported the repo, generated tests, routed errors through your ingest, and kept the results in a shape that can be joined together. QualityMax has been doing that work as its primary job for a year. The Möbius loop isn't a new product — it's what you can see once the foundation is in place.

Concrete example: an Appetize timeout

Let me walk through one of the real issues on the dashboard right now so this isn't all metaphor.

Bugsink reports: [AppetizeBridge] Timed out waiting for session to be ready. 12 events in the last 24h. Stack frame: services/ai_crawl/mobile_discovery_crawler.py:start.

On the old architecture, that's where a human takes over. Open the file, figure out the timeout logic, check whether any test exercises mobile crawls, write a reproducer, open a PR.

On the new architecture, the loop does the walk:

Linked project — the Bugsink project "QualityMax" is mapped to our QM project, which is mapped to the Quality-Max/qamax-rag-app repo. The file path resolves to a concrete repo location.
Test coverage lookup — scan automation_scripts.code for the filename. Result: one integration test mentions the file, but it was last run three days ago and doesn't cover the ready-timeout path.
Recent changes — the last commit to touch mobile_discovery_crawler.py was PR #414. Opens in one click for reference.
Claude explain — the AI gets: the error title, the stack trace, the 50 lines around start(), and the diff from PR #414. It explains that the ready-check uses a fixed 10-second timeout that's shorter than Appetize's cold-start under load, suggests a longer timeout with exponential backoff, and proposes a regression test that simulates a slow Appetize session.
PR drafted — the proposed test and patch are drafted back into the same repo. We review, tweak, merge. Bugsink's event count for this issue stops climbing.

That's the whole walk, end to end, on one platform. Not because the platform is magic — because the joins are possible.

Screenshot of an expanded Bugsink issue inside /bugsink. Top: 'Session failed to start due to client error — Embeds are not enabled for this app' with the failing frame at https://js.appetize.io/embed.js. Stack trace shows 9 frames of embed.js with handleWindowMessage at the top. AI Explanation section with 'Explain with Claude' button noting Sonnet 4.6 cost ~$0.01–0.02 per call, hard-capped at 2. Test Coverage pill: 'No test coverage — No existing test case references https://js.appetize.io/embed.js' with 'Generate a test for this' + 'Browse test cases' buttons. Recent Changes section honestly notes the failing frame is an external script so no commit history applies. First seen / Last seen 2h ago. — Issue detail pane: stack trace, AI-explain button with cost displayed, test-coverage pill with real CTA, and commit-blame that honestly says "external script — not in our repo". Every box here is a join against a different part of the QualityMax schema.

What we're publishing today

The first cut of this is already live:

Screenshot of the embedded /bugsink page inside QualityMax. KPI row shows 87 open issues, 1,353 total events, noisiest: 'consumer: Cannot connect...' with 100 events in the QualityMax project. Filters for project, status, rows. Issue list shows the live 'Embeds are not enabled for this app' Appetize error at the top, alongside 'Failed to query Bugsink DB directly' and other production errors, each with project badge, event count, last-seen, and status pill. — The embedded `/bugsink` page — Bugsink data living inside QualityMax, not as an external link.

Bugsink embedded inside QualityMax at /bugsink — filter, search, expand issues without leaving the app. Stack frames and failing files visible without a click-through.
Test-coverage linkage on every issue — "Covered by T-123" or "No test covers this file → Generate one".
Supabase-mirrored cache so the error data is queryable with the same SQL as everything else in the platform.
MCP tools (bugsink_list_issues, bugsink_summary, etc.) so Claude Code and qmax-code can read production errors directly from the terminal, without a tunnel.

Still landing: "recent PRs that touched this file", Claude explain-this-error with source context, and auto-drafted regression tests. Each is a small amount of code on top of the join, because the join is the hard part.

The shape of the next decade of dev tools

Tools that sit outside the codebase are good at showing problems. Tools that sit inside — that own the errors, the tests, and the AI in one schema — are the ones that can close loops.

The metaphor of the Möbius strip isn't a clever framing. It's a description of the shape. One surface. One boundary. No more hand-off between the tool and the thing it tests, because they're the same surface traced continuously.

If you're building a product and you're tired of the standard four-tool, four-seam, human-is-the-glue architecture, come try ours. We built it the way we wish we had it — and now it monitors itself.

Try it on your repo

Import your codebase, connect Bugsink (or bring your own Sentry), watch the loop run against your own production errors.

Get Started →

The Möbius Strip QA Loop:When the Tool Tests Itself

Most QA tools are "outside in"

What happens when one tool owns the whole loop

The actual loop, in one diagram

The twist: QualityMax tests QualityMax

Why this is hard to copy

Concrete example: an Appetize timeout

What we're publishing today

The shape of the next decade of dev tools

Try it on your repo

The Möbius Strip QA Loop:
When the Tool Tests Itself