Break-the-glass disaster recovery
Problem
Normally SPIKE recovers itself: Nexus rebuilds its root key from the keepers on startup. But if you lose enough keepers at once (the whole cluster, the node, the data center) there is nothing left to auto-recover from. Break-the-glass recovery is the human-held fallback: a set of recovery shards an operator exports ahead of time and feeds back in after a catastrophe.
The catch is in the timing. The shards must be exported while the system is healthy. If you wait until the outage, it is too late.
TL;DR
Two operator commands, two different moments:
# BEFORE disaster, while Nexus is healthy: export recovery shards
spike operator recover
# AFTER disaster, when Nexus cannot auto-recover: feed shards back in
spike operator restore # prompts for one shard at a time; repeat
recover needs the recover role; restore needs the restore role. Store
the exported shards encrypted, offline, and split across custodians.
Workflow
Phase 1 — Export shards (do this now, while healthy)
-
As an operator with the
recoverrole, run:spike operator recover -
SPIKE writes the recovery shards to the recovery directory as
spike.recovery.0.txt,spike.recovery.1.txt, … Each file holds one shard inspike:<index>:<hex>format. -
Immediately secure them. Encrypt each shard, move it to safe offline storage (ideally different custodians/locations), and securely erase the plaintext files from the recovery directory. SPIKE prints this reminder for a reason: if you lose these shards, a total crash is unrecoverable.
Phase 2 — Restore (only after a catastrophe)
-
When Nexus cannot auto-recover (keepers gone, no root key), an operator with the
restorerole runs:spike operator restore -
Paste one recovery shard when prompted. Input is hidden. SPIKE reports progress:
Shards collected: 1 Shards remaining: 1 Please run `spike operator restore` again to provide the remaining shards. -
Repeat with the next shard until SPIKE collects the threshold and prints
SPIKE is now restored and ready to use.
Tips
recovervsrestore.recoverexports shards from a healthy system (proactive backup).restoreimports them into a broken one (reactive rebuild). They are not opposites of one command; they are two halves of one drill.- Threshold, not all. Restore needs
thresholdshards, not every shard, so you can tolerate losing some custodians. This is the same Shamir threshold that backs keeper auto-recovery. See Where the root key lives. - Rehearse it. Schedule a recovery drill: export shards, stand up a throwaway Nexus, and restore into it. A break-the-glass procedure no one has run is a guess, not a plan.
- Roles are separate identities. The
recoverandrestoreroles are distinct SPIFFE-ID roles, separate from day-to-day Pilot access. Provision them deliberately to the humans who hold the glass.
Pitfalls
- You cannot export after the disaster.
recovertalks to a healthy Nexus. If you skipped Phase 1, there is no second chance once the keepers are gone. Export shards as part of going to production, not as an afterthought. - Shards are root-key material. Anyone with
thresholdshards can rebuild the root key and decrypt everything. Treat them like the keys to the kingdom: encrypted, offline, split, audited. - Shard format is exact. A shard is
spike:<index>:<hex>where the hex is 64 characters (32 bytes). Truncated or reformatted shards are rejected. Keep them byte-for-byte. - Re-keying invalidates old shards. If you re-bootstrap with a new root key, previously exported shards no longer restore the current system. Re-export after any deliberate root-key rotation.
Cross-links
- Where the root key lives: keepers, Shamir, and recovery
- Backup and restore
- Bootstrapping a fresh SPIKE
- Reference: Recovery operations
What’s next
Pair key recovery with data backup: Backup and restore.