Six Days of Shipping QualityMax from My Phone

← Back to Blog

Two weeks ago I posted about shipping our iPhone app to TestFlight in 4 days. That native app exists as a staff dashboard. But this trip’s real workhorse was the boring one: QualityMax’s own mobile web in iPhone Safari — the same product, the same routes, just rendered responsive on a 6.1-inch screen.

The last week has been a stress test of that idea, but in the other direction. I haven’t just watched from the phone. I’ve been merging from it. Features. Bug fixes. One revert.

A six-day Faroe Islands trip: Berlin → Copenhagen → Tórshavn on Monday May 18, Tórshavn → Berlin on Saturday May 23. Cliffs, ferries, guesthouse Wi-Fi, weather changing every ten minutes, zero days at a proper desk. The work didn’t slow down. If anything it sped up, because the pipeline did the work that I would normally have done by re-reading my own diff at 11pm.

Faroe Islands cliffs during a phone-first QualityMax shipping week — The actual backdrop: Faroe Islands cliffs, a phone, and a CI pipeline doing the heavy lifting.

Here’s what that looked like.

PRs landed

6 days · git log --since=May 18

Mix

features · fixes · CI

backend, mobile UI, pipeline plumbing

Days behind a desk

0 / 6

phone + Woodpecker the whole way

Reverts

the gates were right; I wasn’t

What’s actually in my pocket

The phone isn’t writing the code. I’m not pretending you can hammer out a Playwright executor refactor on a 6.1-inch screen. The phone is the observer and the merge button. The writing happens elsewhere — mostly in qmax-code on a laptop session I kicked off before I left, or in Claude Code, or in Codex. Then it lands as a PR, and the pipeline takes over.

Three browser tabs and one app do the work:

QualityMax in mobile Safari — the production web app on my phone. Same dashboard, same routes I’d hit on a laptop, just rendered responsive. Live runs, Bugsink errors, bug-bot ticks, infra health. It’s where I see whether the world is on fire.
GitHub mobile — PR review surface, including the AI review comments QualityMax has already posted by the time I open the PR.
The tab pinned to our Woodpecker dashboard — for when I want to see which step turned red and not just “something failed.”

And one important thing that isn’t on the phone: any local terminal access. Everything I merge from the phone has to have already passed enough gates that I’m willing to trust it without re-running anything myself. That’s the whole point.

The four QM gates that did the work

Every PR runs through four sequential gates in Woodpecker. Three of them are hard blocks. One is a soft warn. Together they’re the thing that lets me click “Merge” from a ferry terminal.

ALPHA

AI review + SAST

5-persona structured review, SAST scan, prompt-injection check, secret scan. Hard block on BLOCK verdict.

GAMMA

Native test suite

pytest + Go + Rust + JS. Lint, type-check, unit, integration. The full local suite, in CI.

DELTA

QM-generated scripts

Playwright tests the QM crawler wrote against earlier builds. Soft warn — signal, not gate.

BETA

Preview-deploy E2E

Playwright run against the live Railway preview. Last word before merge.

The non-gate plumbing matters too: ruff, mypy, pylint, promptfoo, semgrep, supply-chain pytest, prompt-lint, two layers of SPA navigation E2E, fresh HTTP-integration coverage on the AI-crawler service. Each runs in its own Woodpecker step. If any of them are red I can see exactly which in the dashboard from the phone.

The thing that lets me trust a phone-merge isn’t any one gate. It’s that there’s nowhere for a bad commit to hide.

Bugs I found because I was actually using the app

The funniest thing about doing your phone work in your own product is that you start finding your own bugs. On a phone. As a phone user. With phone-sized hands and one-handed thumb reach.

A clean batch of mobile-responsive UI fixes landed across the trip’s final weekend:

A dead Back button. Tapping Back from a project view did nothing in iOS Safari — the route handler wasn’t wired on the header touch path. Caught the first time I tried to navigate out of a project from the phone on the road. Filed and fixed within an hour.
Three independent layout bugs. Settings overflowed off-screen. Inputs were clipped at narrow widths. A floating action button sat at full opacity over content text. All three were only visible because I was looking at the app on an actual phone in actual light, not at 700px in a browser devtools mobile preview.
A flush-left regression. The fix above accidentally baked a centered alignment into a layout helper. Caught on the next phone session, fixed flush-left an hour later.

Production didn’t crash, no one was paged — but these are the paper-cuts visible only to users on the surface I was now living on. The dogfooding loop only works if you actually use the thing on its target surface, and a Faroe Islands roadside stop with uneven signal is a much better mobile testbed than any office.

Backend features I shipped from the same phone

The interesting half of these six days is the backend work that landed while I was nowhere near a backend.

The pattern was always the same: I’d kick off a qmax-code or Claude Code session before leaving, give it a ticket as context, watch it open a PR. Then the four gates would run, the AI review would post, the SAST would scan, the Playwright preview would deploy and assert. By the time I opened the PR on my phone the review was already there.

What actually shipped from the phone:

A browser-health probe in the AI-crawl pipeline. Crawls had been silently continuing against a dead browser context after Playwright lost it. The probe checks the live pages and fails fast with a clear error instead of running the rest of the pipeline against a zombie. Alpha caught two BLOCK-tier issues on the first draft — an unhandled exception path and a leaky log line. Gamma green. Beta green. Merged from a guesthouse table.
Per-action heartbeats and an explicit state machine for AI actions. Each action now emits a heartbeat and transitions through named states — observability we’d been wanting for months. Reviewed and merged between stops.
A one-line auth fix. The impersonate endpoint was setting a cookie, but bearer-token callers (mobile, MCP) needed the JWT in the response body. One-line change. Alpha all green. SAST clean. Merged in under 10 minutes from notice to deploy.
The re-land of the reverted work. Three smaller PRs that split a too-big bundle into pipeline-resilience, retry-budget v2, and selector-rewrite v2. More on the revert that produced them below.

The revert

The honest part of this post.

On Sunday May 24 — the day after I flew back from the islands, still living on the phone — around lunchtime I merged a P0 stuck-crawl fix that bundled two AI-crawl improvements into one PR. The AI review had been generally positive. Two of the five personas had flagged blockers about completed-step semantics and missing test coverage for one of the integration modes. I read them on the phone, decided they could be follow-ups, hit merge.

They could not be follow-ups.

Within an hour, the stuck-crawl behavior on the exact test case I’d been chasing was worse than before the fix. Bugsink picked up new exceptions. The bug-bot started ticking on them. Beta on subsequent PRs started timing out where it hadn’t before.

This is the receipt:

Sun 12:31

P0 fix merged after I overrode two persona-review blockers. Phone, away from the desk.

Sun 13:24

Bugsink + Beta on the next PR show a regression on the same test case. Mobile Safari surfaces both within seconds.

Sun 13:48

Opened a revert PR on the phone. Pre-commit hooks ran in CI; the four gates passed on a clean revert.

Sun 14:02

Revert merged. Bugsink event rate drops back to baseline.

Mon — Tue

Re-do in three smaller PRs — pipeline resilience, retry budget v2, selector rewrite v2. This time the persona blockers were addressed before merge, not after. Last one landed Tuesday.

The lesson isn’t that I shouldn’t merge from a phone. The lesson is the one the AI review was already telling me: when persona-review flags a blocker, “follow-up” is not an answer. The system was correct. I overrode it. The system then caught the consequences within an hour and gave me an unbroken way to undo it. Total customer-visible damage: zero, because Bugsink + Beta + the bug-bot triangulated the regression before any user hit it in earnest.

If you want one sentence on why we run our own pipeline against our own commits: it’s so the system is right even when I’m wrong.

Why Woodpecker, and why all this plumbing

People ask why we run our own CI. Three reasons:

Our gates aren’t a generic GitHub Action. Alpha is the QualityMax AI reviewer running on the diff. Beta is QualityMax running its own crawler-generated tests against the preview. If we hosted those on GHA we’d be paying GitHub to run our own infrastructure indirectly — or paying a vendor to host the gates that are our product. Woodpecker on our own EC2 means the gates that protect every customer PR also protect every QualityMax PR, at marginal cost.
We get the live receipt. When the same pipeline runs on customer code and our code, every weird interaction becomes a self-improvement signal. A recent switch to uv-based parallel install steps cut our pipeline time roughly in half — the same speedup landed for every customer pipeline behind us.
Phone-merging only works with a tight loop. If a gate failure takes 45 minutes to surface I’m back at my laptop by the time it does. The whole thing has to come back in under ~15 minutes for the “merge from anywhere” pattern to work. Woodpecker + uv + our own runners hit that envelope. Hosted CI on its bad days does not.

Also: our Woodpecker setup is now battle-tested in the way only production teaches. The hard-won lessons are written into the contributor docs as durable instructions for the next agent that touches the pipeline — including a four-PR silent-pipeline outage that turned out to be a config file-vs-directory ambiguity, and a separate parse-time failure mode where a missing CI secret aborts the whole manifest set before any when: clause is evaluated.

What this proves

Three things, in order of importance:

The pipeline is the unlock, not the phone. Phone-first engineering isn’t a productivity hack — it’s the artifact of a pipeline being load-bearing enough that you can step away. If your CI is “run the tests; you decide what to do” you can’t merge from anywhere except your desk. If your CI is “AI review + SAST + tests + preview-deploy E2E, all green or it doesn’t merge,” you can merge from a ferry terminal.
The dogfooding loop closes on the way it’s actually used. I shipped backend changes through the mobile web app. I caught its own responsive-layout bugs because I was living in it from a phone all week. The native iOS dashboard from the previous post still has its own role — but on this trip the boring web surface did the heavy lifting. Tool, target, and user kept folding into each other.
The system was right when I was wrong. The revert wasn’t a pipeline failure — it was a pipeline success. AI review told me the truth. Beta + Bugsink + the bug-bot caught the consequences of me ignoring the truth. There was a clean way back. That’s the entire premise of QualityMax in one revert.

We’ll keep building the platform the same way we sold it: every commit through the gates, every bug a future test, every system improvement compounding the next one. The next post is about what happens when the bug-detection bot starts opening PRs against itself.

Run the same gates on your own PRs

Alpha (AI review + SAST + prompt-injection), Gamma (your test suite), Delta (QM-generated scripts), Beta (Playwright on preview). Same pipeline we ran on every PR in this post. Connect your repo and the bot ships the same receipts on your next PR — no Woodpecker required on your side.

Get Started