ADR-0029: Restrict Recovery and Restoration Operations to SPIKE Pilot
- Status: accepted
- Date: 2025-11-19
- Tags: Security, Recovery, Access Control, SPIFFE
Context
SPIKE provides various operations that workloads can perform against SPIKE Nexus, including secret management (get, put, delete), policy management, and critical recovery operations (recover, restore). Most operations can be controlled through SPIKE’s policy system, allowing fine-grained access control based on SPIFFE IDs and other attributes.
However, recovery and restoration operations are fundamentally different from regular operations:
- Recovery initiates the process of retrieving Shamir secret shards when SPIKE Nexus needs to be restored from a catastrophic failure
- Restoration submits these shards back to rebuild the root encryption key
These operations bypass normal secret access policies and directly manipulate the root cryptographic material that protects the entire secrets store. If compromised, an attacker could potentially decrypt all secrets in the system.
We need to determine the appropriate access control mechanism for these critical recovery operations.
Decision
Recovery (Recover) and restoration (Restore) operations will be restricted
exclusively to SPIKE Pilot at the SDK level, enforced through SPIFFE ID
validation.
Specifically:
- The SDK will check the caller’s SPIFFE ID using
spiffeid.IsPilot() - Only workloads identified as SPIKE Pilot may invoke these operations
- Violations will result in immediate fatal termination via
log.FatalErr() - This restriction is not configurable through policies
All other operations (secrets, policies, cipher, ACLs) remain policy-controlled and can be authorized for any workload based on configured policies.
Rationale
Security Criticality Hierarchy
SPIKE operations fall into different security tiers:
| Operation Type | Security Impact | Access Control |
|---|---|---|
| Secret read/write | Medium - affects individual secrets | Policy-based |
| Policy management | High - affects access control | Policy-based |
| Cipher operations | Medium - encryption/decryption | Policy-based |
| Recovery/Restore | Critical - affects entire system | Hard-coded |
Why Recovery Operations Are Different
Policy-controlled operations (secrets, policies, etc.):
- Operate within the normal secret access control framework
- Failure affects specific secrets or policies
- Can be safely delegated to various workloads
- Policy misconfiguration has limited blast radius
Recovery operations (recover, restore):
- Bypass all policy controls and access root cryptographic material
- Failure or compromise could decrypt all secrets in the system
- Should only be performed during disaster recovery scenarios
- Must have the smallest possible attack surface
- Policy-based control would create circular dependency (policies are protected by the key being recovered)
Defense in Depth
While SPIKE Nexus itself validates recovery requests, enforcing the restriction at the SDK level provides defense in depth:
- SDK enforcement: Prevents unauthorized workloads from attempting recovery
- Nexus enforcement: Final validation even if SDK is bypassed
- SPIFFE authentication: Cryptographically verifiable identity
- Audit trail: Fatal errors logged when violations occur
SPIFFE Identity as Strong Authentication
SPIKE Pilot’s SPIFFE ID is:
- Cryptographically verified through mTLS
- Issued by the trusted SPIRE server
- Cannot be spoofed or stolen without compromising the SPIRE trust domain
- Provides stronger authentication than password-based or API key approaches
Fail-Safe Design
The SDK implementation uses log.FatalErr() rather than returning an error:
if !spiffeid.IsPilot(selfSPIFFEID) {
failErr := sdkErrors.ErrUnauthorized
failErr.Msg = "recovery can only be performed from SPIKE Pilot"
log.FatalErr(fName, *failErr)
}
This ensures:
- No possibility of error handling bugs bypassing the check
- Clear audit trail in logs
- Immediate termination prevents any further processing
- Aligns with security-critical failure handling (similar to key length validation failures)
Alternatives Considered
Alternative 1: Policy-Based Control
Allow recovery operations to be controlled through the policy system like other operations.
Rejected because:
- Creates circular dependency: policies are protected by the key being recovered
- During disaster recovery, policy system may not be available
- Increases attack surface unnecessarily
- Policy misconfiguration could enable unauthorized recovery
Alternative 2: No SDK Enforcement
Rely solely on SPIKE Nexus to validate recovery requests.
Rejected because:
- Violates defense-in-depth principle
- Allows unauthorized attempts to reach Nexus unnecessarily
- Reduces audit trail granularity
- Bypasses early-fail security principle
Alternative 3: Configuration-Based Control
Make the allowed SPIFFE IDs configurable via environment variables or config files.
Rejected because:
- Configuration errors could accidentally enable unauthorized access
- Increases operational complexity
- Provides no real benefit (recovery should always be from Pilot)
- Configuration-based security is generally weaker than hard-coded for critical operations
Consequences
Positive
- Reduced attack surface: Only SPIKE Pilot can initiate recovery operations
- Defense in depth: Multiple layers of validation (SDK + Nexus)
- Fail-safe: Fatal errors prevent accidental bypasses
- Clear security model: Critical operations have stricter controls than regular operations
- Audit trail: Failed attempts are logged with context
- No configuration complexity: No additional configuration required
Negative
- Less flexible: Cannot delegate recovery to other workloads
- Operational constraint: Requires SPIKE Pilot for disaster recovery scenarios
- Hard-coded policy: Cannot be changed without code modification
Neutral
- Consistent with design: SPIKE Pilot is already the administrative/operator interface
- Expected behavior: Recovery is inherently a privileged operation
Implementation Details
SDK Enforcement
The spike-sdk-go package enforces this in:
api/internal/impl/operator/recover.go:67-71api/internal/impl/operator/restore.go:76-80
selfSPIFFEID := svid.ID.String()
// Security: Recovery and Restoration can ONLY be done via SPIKE Pilot.
if !spiffeid.IsPilot(selfSPIFFEID) {
failErr := sdkErrors.ErrUnauthorized
failErr.Msg = "recovery can only be performed from SPIKE Pilot"
log.FatalErr(fName, *failErr)
}
Operations NOT Restricted
The following operations remain policy-controlled and can be performed by any workload with appropriate policy permissions:
- Secret operations:
Get,Put,Delete,Undelete,List,GetMetadata - Policy operations:
Create,Get,Delete,List - Cipher operations:
Encrypt,Decrypt - ACL operations:
Get,List - Bootstrap operations:
Contribute,Verify
Nexus-Side Validation
SPIKE Nexus performs additional validation of recovery requests, providing a second layer of defense even if the SDK check is bypassed.
Migration Impact
This ADR documents existing behavior and does not require migration. The restriction has been in place since the recovery operations were first implemented.
References
- SPIFFE specification: https://spiffe.io/docs/latest/spiffe-about/overview/
- Shamir Secret Sharing: https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing
- Defense in Depth: https://www.nist.gov/publications/defense-depth-strategy
Related ADRs
- ADR-0001: Use SPIFFE/SPIRE for Workload Identity
- ADR-0028: Use Human-Readable Error Messages in CLI Tools
- ADR-0032: Standard 12-Byte Nonce Size for AES-GCM
- ADR-0031: AST-Based Test Enforcement for Route Guard Functions
- ADR-0030: Minimal Error Messages in API Responses
- ADR-0029: Restrict Recovery and Restoration Operations to SPIKE Pilot
- ADR-0028: Use Human-Readable Error Messages in CLI Tools
- ADR-0027: Separate Audit Logs from Operational Logs
- ADR-0026: Configurable Data Directory for SPIKE Components
- ADR-0025: Path Patterns as Key Namespaces with Regular Expression Matching
- ADR-0024: Transition from In-Memory Cache to Direct Backend Storage for High Availability
- ADR-0023: Decision Against Implementing Lock/Unlock Mechanism in SPIKE Nexus
- ADR-0022: Continuous Polling of SPIKE Keepers Despite 404 Response
- ADR-0021: SPIKE Keeper as a Stateless Shard Holder
- ADR-0020: Switch to Zola for Documentation System
- ADR-0019: Plugin-Based Storage Backend Architecture
- ADR-0018: Administrative Access to SPIKE
- ADR-0017: Synchronous Persistence for SPIKE Secrets Store
- ADR-0016: Memory-First Secrets Store
- ADR-0015: Use Singular Form for File and Package Naming
- ADR-0014: Maintaining SQLite as SPIKE’s Primary Storage Backend
- ADR-0013: S3-Compatible Storage as SPIKE’s Backing Store
- ADR-0012: HTTP Methods for SPIKE API
- ADR-0011: PostgreSQL as SPIKE’s Backing Store
- ADR-0010: Session Token Storage Strategy for SPIKE Nexus
- ADR-0009: Multi-Administrator Support System
- ADR-0008: Administrative Access Control System
- ADR-0007: Root Key Lifecycle and Management Strategy
- ADR-0006: Trust Boundary Definition and Security Assumptions
- ADR-0005: Use SPIFFE mTLS for Inter-Component Authentication and Communication
- ADR-0004: SPIKE Keeper Minimalist Design Approach
- ADR-0003: Root Key Management and Storage Strategy
- ADR-0002: Use Docsify for Documentation System
- ADR-0001: Display Secrets in Plain Text in SPIKE Pilot Admin CLI