M6.2 — Flake, retries, triage¶

What you'll learn

The common sources of Gauntlet flake, mapped back to earlier modules.
When a retry is legitimate and when it masks a real bug.
A repeatable triage workflow for a red Gauntlet run, ending in a classification and an owner.

How it applies (QA)

Owning a Gauntlet suite is mostly managing red — and most red in a mature suite is flake, not fresh product bugs. A test owner who classifies fast and retries responsibly keeps the gate trusted. One who reflexively retries until green trains the team to ignore the gate, which is how a real regression ships.

Concepts¶

Where flake comes from (a map)¶

Almost every Gauntlet flake traces to a cause you've already met:

Flake source	Module	Tell
Device/pool — kit offline, dirty, contended, leaked reservation	M3.3	fails on one kit, passes on others; "no devices available"
Isolation — hardcoded port/path, shared state	M4.4	green solo, red in batch, non-reproducible solo
Timeout — `MaxDuration` too tight for a slow target	M4.2	red on cold/loaded devices, "timed out waiting for marker"
Marker visibility — stripped in this configuration	M5.3	green in Development, red in Test/Shipping, no crash
Real network — latency, NAT, packet loss	M3.2	only per-device multiplayer, intermittent connect failures
Genuine product flake — a real intermittent game bug	—	reproduces on clean devices, in isolation, within budget

The first five are infrastructure/test flake — your problem as the harness owner, fixable without touching the game. The last is a real defect surfaced intermittently. Classifying which is the whole game.

Retries: legitimate vs. masking¶

A retry re-runs a red test hoping for green. It's a sharp tool with a wrong edge:

Legitimate — re-running past a known transient infrastructure fault: a device that was being reimaged, a pool hiccup, a network blip. The thing you're retrying past is not the build. Best practice: retry narrowly (only on classified-transient failures), log every retry, and alert if retry rates climb (rising retries = degrading infra or a new flake).
Masking — re-running a product or test failure until it happens to pass. This converts a real signal into noise: the regression is still there, the gate just stopped reporting it. A test that's "green after 3 retries" on a genuine 1-in-4 crash is lying.

The rule: retry to absorb infrastructure nondeterminism, never to launder a build failure. If you can't name the transient cause, don't retry — investigate.

Quarantine over deletion¶

When a test is too flaky to gate but you don't want to lose its coverage, quarantine it: keep it running but stop it from blocking the gate (informational), file the flake, and fix it on a clock. Quarantine is a holding cell, not a graveyard — an un-time-boxed quarantine is a silent deletion. The alternative (leaving a flaky test gating) erodes trust in every red, which is more expensive than the one test.

The triage workflow¶

A repeatable order for any red run (it composes the earlier modules):

Skipped or failed? (M6.1) Grey/skipped → upstream build problem, not your test. Stop here.
Crash/ensure? (M2.4) The run's crash summary first — a real crash usually ends triage at the dump.
Localize the tier. (M1.3) Device/pool? session/role? verdict logic? Use the symptom→tier map.
Reproducibility probes:
- Other kits pass? → device (M3.3).
- Green solo, red in batch? → isolation (M4.4).
- Only this configuration? → marker visibility (M5.3).
- Slow targets only, "timed out"? → MaxDuration (M4.2).
- Reproduces clean, isolated, in-budget? → real product bug.
Classify + route. Infra/test flake → you fix (and decide retry/quarantine). Product bug → file against the feature with the artifacts. Either way: record the build provenance (M3.4).

Pitfall: the auto-retry that buries a regression

A pipeline that silently retries any red until green (or N times) will pass a build with a real 1-in-3 crash and report it green. The crash ships; the post-mortem finds "the test caught it but retried past it." Retries must be classified and counted, never blanket — a green that required retries is not the same as a clean green, and your reporting should say so.

Worked example — three reds, three classifications¶

Symptom	Probe result	Classification	Action
`UE.BootTest` PS5 red, "no devices available"	n/a (never launched)	device/pool (M3.3)	free/repair pool; legitimate to re-queue, not "retry the test"
Custom net test red ~25% in batch, green every solo re-run	green solo	isolation (M4.4)	fix hardcoded port; do not retry-to-green in the meantime
Boot test red on cold console, green warm; "timed out at 90s"	passes at 150s	timeout (M4.2)	raise `MaxDuration` for that target; not a product bug

None of these is "retry until green." Two are fixes you own; one is a pool action. The masking trap would have hidden all three behind a retry.

Lab — Write a triage runbook entry

For the boot test you've carried since M1.3, write a one-page runbook the next on-call would follow when it goes red: (1) the skipped-vs-failed check, (2) the ordered probes and which tier each implicates, (3) your retry policy (what transient causes justify a retry, the cap, and that retries are logged), (4) the quarantine criteria and time-box, (5) what to capture before re-running (provenance + artifacts).

Exercise 1 — Retry or not?

For each red, say retry (transient infra) or don't retry — investigate:

"No devices available" during a pool reimage window.
A client crashes with a dump, reproducibly, on clean kits.
Green solo, red in batch, hardcoded port suspected.
A network connect failure that vanished when the lab switch was rebooted.

Exercise 2 — Classify

A test is green in Development nightly, red on the first Test-configuration promotion run, with no crash and a "timed out waiting for marker" error. Which flake source, and what's the fix?

Self-check — answers

Exercise 1: 1 retry (known transient infra) — though really "re-queue for a device," 2 don't retry (reproducible product crash — file it), 3 don't retry (isolation bug — retrying masks it; fix the port), 4 retry was effectively the switch reboot; the failure was infra, so re-running after the fix is fine (but the root cause was the switch, log it).

Exercise 2: Marker visibility (M5.3) — the success marker is emitted at a verbosity stripped in the Test configuration, so it's never seen and the run times out, despite the build being fine. Fix: emit the marker at a verbosity that survives Test (or assert via a channel that ships), then re-run. Not a product bug, not a retry candidate.

Done when

[ ] You can map a flake symptom to its source module and tier.
[ ] You can state when a retry is legitimate vs. masking, and why blanket retries are dangerous.
[ ] You can describe quarantine and why it must be time-boxed.
[ ] You can run the triage workflow to a classification (infra/test vs. product) and an owner.

Next: M6.3 — Capstone lab.