M8.3 — Save Reliability¶

What you'll learn

Why a save file is untrusted input and the failure modes a robust loader must survive.
Three reliability layers, each driven by a concrete failure: atomic write (crash mid-write), versioned migration (save from an older build), and backup before overwrite (a bad save destroying a good one).
How to enumerate failure modes before writing the loader — negative testing as a design step.
How to keep all this proportionate to a beginner project: the layers that earn their keep versus the production extras that don't yet.

How it applies

Losing a save is the worst bug a game ships. An ARPG run is hours of investment; a loader that corrupts or loses it on an OS crash, a power loss mid-write, or a version mismatch will be the single thing players remember. Reliability is not polish here — it is the difference between a game players trust and one they don't.
Saves change format, guaranteed. The day you add a stat or a slot, old saves no longer match the new code. Without a migration path, the first patch after launch breaks every existing save. Planning for format change from the first version is far cheaper than retrofitting it.
The file is written by the wild. It may be truncated by a crash, edited by a player, or left by an older build. Treating it as untrusted input — validate, then use — is the same discipline as validating any external input, which a tester recognizes immediately as negative testing.
Proportionality is a skill. Production saves have many layers; a beginner project needs a few. Knowing which failure modes are likely enough to defend (crash mid-write, version drift) versus which are premature (base64 portable export, telemetry metadata) is judgment worth practicing.

Try it first

Before reading on: list every way loading the save from M8.2 could go wrong in a shipped game — not logic bugs, but the file itself being in a bad state. For each, decide what the loader should do (recover? refuse? start fresh?). Spend a few minutes; the failure list you produce is the loader's test matrix, and the recovery decisions are this chapter.

Concepts¶

The save is untrusted input¶

The M8.2 loader assumed a well-formed file. In the wild it won't always be. Enumerate the failure modes — this is negative testing, done at design time:

Failure mode	Cause	Recovery
File missing	first launch, or player deleted it	start a new game
File empty / truncated	crash or power loss mid-write	restore from backup, else new game
Malformed JSON	interrupted write, or bad hand-edit	restore from backup, else new game
Valid JSON, wrong shape	corruption, or a different file	refuse; restore from backup
Valid, older schema version	save from a previous build	migrate to current, then load
Valid, values out of range	corruption or cheating	clamp and log, or refuse

The loader's job is to detect each and recover deliberately, never to crash or silently load garbage. Each row is a test case (M8.3's QA bridge: the table is the negative-test suite).

Layer 1 — atomic write (crash mid-write)¶

The dangerous moment is writing: if the game dies after truncating the file but before finishing, the save is destroyed. The fix is to never overwrite the real file in place. Write to a temporary file, then rename it over the real one — a rename is atomic on every major OS, so the real save is either the old complete file or the new complete file, never a half-written one.

Example

Atomic save: write to a temp path, then rename.

func save_game() -> void:
    var json := JSON.stringify(_serialize(), "\t")
    var tmp := SAVE_PATH + ".tmp"
    var file := FileAccess.open(tmp, FileAccess.WRITE)
    if file == null:
        push_error("save: cannot open temp file")
        return
    file.store_string(json)
    file.close()                                  # fully written and flushed
    # promote temp to the real save; back up the previous real save first (layer 3)
    if FileAccess.file_exists(SAVE_PATH):
        DirAccess.copy_absolute(SAVE_PATH, SAVE_PATH + ".bak")
    DirAccess.rename_absolute(tmp, SAVE_PATH)

The real SAVE_PATH is only replaced by a complete temp file via rename_absolute. If the game crashes during store_string, only the .tmp is damaged; the real save is untouched. The .bak copy is layer 3.

Layer 2 — versioned migration (old saves)¶

M8.2 wrote a version field. When the schema changes (a new stat, a renamed field), bump the version and add a migration that upgrades an old dictionary to the current shape before _apply reads it. Run migrations in order, so a very old save steps up through each version.

Example

const CURRENT_VERSION := 2

func _migrate(data: Dictionary) -> Dictionary:
    var v := int(data.get("version", 1))
    if v < 2:
        # v1 had no "currency" field; default it, then mark migrated
        data["currency"] = data.get("currency", 0)
        data["version"] = 2
    # if v < 3: ... future migrations chain here ...
    return data

Each migration is a small, named step describing the real change it accounts for ("v1 had no currency"). Chaining them means a v1 save loads in a v5 build by stepping through each migration. The _apply's per-field data.get(field, default) (M8.2) already tolerates missing fields, so migration plus defensive reads together absorb format drift.

Layer 3 — backup before overwrite¶

Even with atomic writes, a logically bad save (a bug wrote nonsense, or a migration was wrong) can replace a good one. Keeping the previous save as .bak (copied just before the rename, above) gives the loader a fallback: if the main file fails validation, try the backup before giving up and starting fresh.

Example

Load with backup fallback:

func load_game() -> bool:
    if _try_load(SAVE_PATH):
        return true
    push_warning("save: main file failed; trying backup")
    if _try_load(SAVE_PATH + ".bak"):
        return true
    push_warning("save: no valid save; starting new game")
    return false

func _try_load(path: String) -> bool:
    if not FileAccess.file_exists(path):
        return false
    var file := FileAccess.open(path, FileAccess.READ)
    if file == null:
        return false
    var data = JSON.parse_string(file.get_as_text())
    file.close()
    if typeof(data) != TYPE_DICTIONARY:
        return false                              # malformed or wrong shape
    _apply(_migrate(data))
    return true

load_game tries the main file, then the backup, then concedes to a new game — each step logged. A crash that corrupted the main save still leaves the .bak from the previous successful save. Combined with atomic writes (layer 1) and migration (layer 2), the loader survives the common failure modes without crashing or loading garbage.

Proportionality¶

Production saves add more — a sentinel token to truncate a corrupt tail, base64 portable export for sharing across devices, ASCII-sanitization retries, telemetry metadata. Those are real, but they are not the first things a beginner project needs. The three layers above (atomic write, migration, backup) defend the likely failures (crash mid-write, version drift, a bad save clobbering a good one) and are small. Add the rest when the project's scale justifies it. Recognizing that ordering — defend the likely, defer the exotic — is the judgment; over-engineering the save before the game is done is its own failure.

Walkthrough¶

Change save_game (M8.2) to the atomic write: write to SAVE_PATH + ".tmp", copy the existing save to .bak, then DirAccess.rename_absolute the temp over the real path.
Add version to the serialized dictionary (M8.2 already did) and a _migrate(data) that upgrades older versions; call it in _apply's path (or in _try_load before _apply).
Restructure load_game to _try_load(main) → _try_load(bak) → new game, with each step logged via push_warning.
Write the negative-test list (the failure-mode table) down as a checklist for your project.
Press F5, save, then simulate the failures: delete the save (expect new game), truncate the JSON in a text editor (expect backup fallback), bump CURRENT_VERSION and remove a field from an old save (expect migration to fill it). Confirm each recovers as designed and logs what it did.

Optional sanity check

Hand-corrupt the main save (delete its closing brace so the JSON is malformed) and reload: the loader should log "main file failed; trying backup," load the .bak, and the run should come back intact. Then delete both files and reload: it should log "no valid save; starting new game" and not crash. These two probes exercise the malformed-file and missing-file rows of your failure table — the negative tests that distinguish a save system that recovers from one that loses the player's run.

Self-check quiz¶

Q1 — Why write to a temp file and rename it over the real save, instead of overwriting the real file directly?

A. Renaming is faster than writing. B. A crash during a direct overwrite leaves the real save half-written and destroyed; writing a complete temp file and atomically renaming it means the real save is always either the old complete file or the new complete one. C. Temp files compress better. D. The engine forbids overwriting save files.

Reveal answer

B. The vulnerable window is the write itself; an in-place overwrite that's interrupted corrupts the only copy. A complete temp file plus an atomic rename guarantees the real save is never in a half-written state. A is false (rename isn't about speed). C and D are fabricated.

Q2 — Why add a version field and migration chain from the very first save format?

A. To make the file larger. B. Because the format will change as the game grows; without a versioned migration path, the first patch that adds or renames a field breaks every existing save, and retrofitting migration later is far costlier than planning for it. C. Because JSON requires a version field. D. Versioning prevents all corruption.

Reveal answer

B. Format change is inevitable, and a version + ordered migrations lets old saves step up to the current shape; planning it from v1 is cheap, retrofitting it after saves exist in the wild is not. A is trivial. C is false. D overstates it — versioning handles format drift, not crashes or bad edits (that's layers 1 and 3).

Q3 — Why treat the save file as 'untrusted input,' and what testing discipline does that map to?

A. Because players are adversaries. B. Because the file can be truncated by a crash, hand-edited, or written by an older build, so the loader must validate and recover rather than assume well-formed data — which is negative testing (enumerate failure modes, verify graceful handling). C. Because JSON is insecure. D. Because user:// is shared between players.

Reveal answer

B. The file's state is not under the loader's control at read time (crash, edit, old version), so it must be validated before use and recover deliberately on failure — the definition of negative testing applied to save loading. A overdramatizes (most cases aren't malicious). C is false. D is false (user:// is per-user).

Integration question¶

Q4 — open

The three reliability layers each defend a specific failure mode. Map each layer to its failure, explain how they compose so the loader survives the common ways a save goes wrong, and justify why a beginner project should implement these three but defer production extras like sentinel tokens and base64 export — connecting the whole chapter to the negative-testing mindset from M8.1's state-coverage framing.

Reveal expected answer

Each layer answers one failure: atomic write (write temp, rename over) defends crash/power-loss mid-write, because the real save is only ever replaced by a complete file, never left half-written; versioned migration defends a save from an older build, because a version field plus ordered migrations upgrades an old dictionary to the current shape before it's applied; backup before overwrite (.bak copy before the rename, with load falling back to it) defends a bad save clobbering a good one, because the previous good save survives even if the new one is logically corrupt. They compose into a loader that tries the main file, falls back to the backup, and concedes to a new game only when both fail — so a crash mid-write leaves an intact main save (layer 1), a version bump still loads old saves (layer 2), and a logically corrupt overwrite still has the previous run recoverable (layer 3); each step is logged, and _apply's per-field defaults absorb missing fields. A beginner project implements exactly these three because they defend the likely failures (crashes, patches, bad writes) at small cost, while deferring production extras — sentinel tokens for corrupt-tail truncation, base64 portable export for cross-device sharing, ASCII-retry, telemetry metadata — that address rarer or scale-dependent needs and add complexity before the game is even finished; over-engineering the save before the game ships is its own failure of proportionality. The whole chapter is the negative-testing counterpart to M8.1's state coverage: M8.1 enumerated the state that must round-trip; M8.3 enumerates the ways the file can be wrong and defines the recovery for each, turning "is the save robust?" into a finite failure-mode table that doubles as the loader's test suite — the same move a QA professional makes when they list how an input can be malformed before trusting it.