M6.2 — Flake, retries, triage¶
What you'll learn
- The common sources of Gauntlet flake, mapped back to earlier modules.
- When a retry is legitimate and when it masks a real bug.
- A repeatable triage workflow for a red Gauntlet run, ending in a classification and an owner.
How it applies (QA)
Owning a Gauntlet suite is mostly managing red — and most red in a mature suite is flake, not fresh product bugs. A test owner who classifies fast and retries responsibly keeps the gate trusted. One who reflexively retries until green trains the team to ignore the gate, which is how a real regression ships.
Concepts¶
Where flake comes from (a map)¶
Almost every Gauntlet flake traces to a cause you've already met:
| Flake source | Module | Tell |
|---|---|---|
| Device/pool — kit offline, dirty, contended, leaked reservation | M3.3 | fails on one kit, passes on others; "no devices available" |
| Isolation — hardcoded port/path, shared state | M4.4 | green solo, red in batch, non-reproducible solo |
Timeout — MaxDuration too tight for a slow target |
M4.2 | red on cold/loaded devices, "timed out waiting for marker" |
| Marker visibility — stripped in this configuration | M5.3 | green in Development, red in Test/Shipping, no crash |
| Real network — latency, NAT, packet loss | M3.2 | only per-device multiplayer, intermittent connect failures |
| Genuine product flake — a real intermittent game bug | — | reproduces on clean devices, in isolation, within budget |
The first five are infrastructure/test flake — your problem as the harness owner, fixable without touching the game. The last is a real defect surfaced intermittently. Classifying which is the whole game.
Retries: legitimate vs. masking¶
A retry re-runs a red test hoping for green. It's a sharp tool with a wrong edge:
- Legitimate — re-running past a known transient infrastructure fault: a device that was being reimaged, a pool hiccup, a network blip. The thing you're retrying past is not the build. Best practice: retry narrowly (only on classified-transient failures), log every retry, and alert if retry rates climb (rising retries = degrading infra or a new flake).
- Masking — re-running a product or test failure until it happens to pass. This converts a real signal into noise: the regression is still there, the gate just stopped reporting it. A test that's "green after 3 retries" on a genuine 1-in-4 crash is lying.
The rule: retry to absorb infrastructure nondeterminism, never to launder a build failure. If you can't name the transient cause, don't retry — investigate.
Quarantine over deletion¶
When a test is too flaky to gate but you don't want to lose its coverage, quarantine it: keep it running but stop it from blocking the gate (informational), file the flake, and fix it on a clock. Quarantine is a holding cell, not a graveyard — an un-time-boxed quarantine is a silent deletion. The alternative (leaving a flaky test gating) erodes trust in every red, which is more expensive than the one test.
The triage workflow¶
A repeatable order for any red run (it composes the earlier modules):
- Skipped or failed? (M6.1) Grey/skipped → upstream build problem, not your test. Stop here.
- Crash/ensure? (M2.4) The run's crash summary first — a real crash usually ends triage at the dump.
- Localize the tier. (M1.3) Device/pool? session/role? verdict logic? Use the symptom→tier map.
- Reproducibility probes:
- Other kits pass? → device (M3.3).
- Green solo, red in batch? → isolation (M4.4).
- Only this configuration? → marker visibility (M5.3).
- Slow targets only, "timed out"? →
MaxDuration(M4.2). - Reproduces clean, isolated, in-budget? → real product bug.
- Classify + route. Infra/test flake → you fix (and decide retry/quarantine). Product bug → file against the feature with the artifacts. Either way: record the build provenance (M3.4).
Pitfall: the auto-retry that buries a regression
A pipeline that silently retries any red until green (or N times) will pass a build with a real 1-in-3 crash and report it green. The crash ships; the post-mortem finds "the test caught it but retried past it." Retries must be classified and counted, never blanket — a green that required retries is not the same as a clean green, and your reporting should say so.
Worked example — three reds, three classifications¶
| Symptom | Probe result | Classification | Action |
|---|---|---|---|
UE.BootTest PS5 red, "no devices available" |
n/a (never launched) | device/pool (M3.3) | free/repair pool; legitimate to re-queue, not "retry the test" |
| Custom net test red ~25% in batch, green every solo re-run | green solo | isolation (M4.4) | fix hardcoded port; do not retry-to-green in the meantime |
| Boot test red on cold console, green warm; "timed out at 90s" | passes at 150s | timeout (M4.2) | raise MaxDuration for that target; not a product bug |
None of these is "retry until green." Two are fixes you own; one is a pool action. The masking trap would have hidden all three behind a retry.
Lab — Write a triage runbook entry
For the boot test you've carried since M1.3, write a one-page runbook the next on-call would follow when it goes red: (1) the skipped-vs-failed check, (2) the ordered probes and which tier each implicates, (3) your retry policy (what transient causes justify a retry, the cap, and that retries are logged), (4) the quarantine criteria and time-box, (5) what to capture before re-running (provenance + artifacts).
Exercise 1 — Retry or not?
For each red, say retry (transient infra) or don't retry — investigate:
- "No devices available" during a pool reimage window.
- A client crashes with a dump, reproducibly, on clean kits.
- Green solo, red in batch, hardcoded port suspected.
- A network connect failure that vanished when the lab switch was rebooted.
Exercise 2 — Classify
A test is green in Development nightly, red on the first Test-configuration promotion run, with no crash and a "timed out waiting for marker" error. Which flake source, and what's the fix?
Self-check — answers
Exercise 1: 1 retry (known transient infra) — though really "re-queue for a device," 2 don't retry (reproducible product crash — file it), 3 don't retry (isolation bug — retrying masks it; fix the port), 4 retry was effectively the switch reboot; the failure was infra, so re-running after the fix is fine (but the root cause was the switch, log it).
Exercise 2: Marker visibility (M5.3) — the success marker is emitted at a verbosity stripped
in the Test configuration, so it's never seen and the run times out, despite the build being
fine. Fix: emit the marker at a verbosity that survives Test (or assert via a channel that
ships), then re-run. Not a product bug, not a retry candidate.
Done when
- [ ] You can map a flake symptom to its source module and tier.
- [ ] You can state when a retry is legitimate vs. masking, and why blanket retries are dangerous.
- [ ] You can describe quarantine and why it must be time-boxed.
- [ ] You can run the triage workflow to a classification (infra/test vs. product) and an owner.
Next: M6.3 — Capstone lab.