Waiting for input...
Star SPIKE on GitHub

ADR-0029: Restrict Recovery and Restoration Operations to SPIKE Pilot


  • Status: accepted
  • Date: 2025-11-19
  • Tags: Security, Recovery, Access Control, SPIFFE

Context

SPIKE provides various operations that workloads can perform against SPIKE Nexus, including secret management (get, put, delete), policy management, and critical recovery operations (recover, restore). Most operations can be controlled through SPIKE’s policy system, allowing fine-grained access control based on SPIFFE IDs and other attributes.

However, recovery and restoration operations are fundamentally different from regular operations:

  1. Recovery initiates the process of retrieving Shamir secret shards when SPIKE Nexus needs to be restored from a catastrophic failure
  2. Restoration submits these shards back to rebuild the root encryption key

These operations bypass normal secret access policies and directly manipulate the root cryptographic material that protects the entire secrets store. If compromised, an attacker could potentially decrypt all secrets in the system.

We need to determine the appropriate access control mechanism for these critical recovery operations.

Decision

Recovery (Recover) and restoration (Restore) operations will be restricted exclusively to SPIKE Pilot at the SDK level, enforced through SPIFFE ID validation.

Specifically:

  • The SDK will check the caller’s SPIFFE ID using spiffeid.IsPilot()
  • Only workloads identified as SPIKE Pilot may invoke these operations
  • Violations will result in immediate fatal termination via log.FatalErr()
  • This restriction is not configurable through policies

All other operations (secrets, policies, cipher, ACLs) remain policy-controlled and can be authorized for any workload based on configured policies.

Rationale

Security Criticality Hierarchy

SPIKE operations fall into different security tiers:

Operation TypeSecurity ImpactAccess Control
Secret read/writeMedium - affects individual secretsPolicy-based
Policy managementHigh - affects access controlPolicy-based
Cipher operationsMedium - encryption/decryptionPolicy-based
Recovery/RestoreCritical - affects entire systemHard-coded

Why Recovery Operations Are Different

Policy-controlled operations (secrets, policies, etc.):

  • Operate within the normal secret access control framework
  • Failure affects specific secrets or policies
  • Can be safely delegated to various workloads
  • Policy misconfiguration has limited blast radius

Recovery operations (recover, restore):

  • Bypass all policy controls and access root cryptographic material
  • Failure or compromise could decrypt all secrets in the system
  • Should only be performed during disaster recovery scenarios
  • Must have the smallest possible attack surface
  • Policy-based control would create circular dependency (policies are protected by the key being recovered)

Defense in Depth

While SPIKE Nexus itself validates recovery requests, enforcing the restriction at the SDK level provides defense in depth:

  1. SDK enforcement: Prevents unauthorized workloads from attempting recovery
  2. Nexus enforcement: Final validation even if SDK is bypassed
  3. SPIFFE authentication: Cryptographically verifiable identity
  4. Audit trail: Fatal errors logged when violations occur

SPIFFE Identity as Strong Authentication

SPIKE Pilot’s SPIFFE ID is:

  • Cryptographically verified through mTLS
  • Issued by the trusted SPIRE server
  • Cannot be spoofed or stolen without compromising the SPIRE trust domain
  • Provides stronger authentication than password-based or API key approaches

Fail-Safe Design

The SDK implementation uses log.FatalErr() rather than returning an error:

if !spiffeid.IsPilot(selfSPIFFEID) {
    failErr := sdkErrors.ErrUnauthorized
    failErr.Msg = "recovery can only be performed from SPIKE Pilot"
    log.FatalErr(fName, *failErr)
}

This ensures:

  • No possibility of error handling bugs bypassing the check
  • Clear audit trail in logs
  • Immediate termination prevents any further processing
  • Aligns with security-critical failure handling (similar to key length validation failures)

Alternatives Considered

Alternative 1: Policy-Based Control

Allow recovery operations to be controlled through the policy system like other operations.

Rejected because:

  • Creates circular dependency: policies are protected by the key being recovered
  • During disaster recovery, policy system may not be available
  • Increases attack surface unnecessarily
  • Policy misconfiguration could enable unauthorized recovery

Alternative 2: No SDK Enforcement

Rely solely on SPIKE Nexus to validate recovery requests.

Rejected because:

  • Violates defense-in-depth principle
  • Allows unauthorized attempts to reach Nexus unnecessarily
  • Reduces audit trail granularity
  • Bypasses early-fail security principle

Alternative 3: Configuration-Based Control

Make the allowed SPIFFE IDs configurable via environment variables or config files.

Rejected because:

  • Configuration errors could accidentally enable unauthorized access
  • Increases operational complexity
  • Provides no real benefit (recovery should always be from Pilot)
  • Configuration-based security is generally weaker than hard-coded for critical operations

Consequences

Positive

  • Reduced attack surface: Only SPIKE Pilot can initiate recovery operations
  • Defense in depth: Multiple layers of validation (SDK + Nexus)
  • Fail-safe: Fatal errors prevent accidental bypasses
  • Clear security model: Critical operations have stricter controls than regular operations
  • Audit trail: Failed attempts are logged with context
  • No configuration complexity: No additional configuration required

Negative

  • Less flexible: Cannot delegate recovery to other workloads
  • Operational constraint: Requires SPIKE Pilot for disaster recovery scenarios
  • Hard-coded policy: Cannot be changed without code modification

Neutral

  • Consistent with design: SPIKE Pilot is already the administrative/operator interface
  • Expected behavior: Recovery is inherently a privileged operation

Implementation Details

SDK Enforcement

The spike-sdk-go package enforces this in:

  • api/internal/impl/operator/recover.go:67-71
  • api/internal/impl/operator/restore.go:76-80
selfSPIFFEID := svid.ID.String()

// Security: Recovery and Restoration can ONLY be done via SPIKE Pilot.
if !spiffeid.IsPilot(selfSPIFFEID) {
    failErr := sdkErrors.ErrUnauthorized
    failErr.Msg = "recovery can only be performed from SPIKE Pilot"
    log.FatalErr(fName, *failErr)
}

Operations NOT Restricted

The following operations remain policy-controlled and can be performed by any workload with appropriate policy permissions:

  • Secret operations: Get, Put, Delete, Undelete, List, GetMetadata
  • Policy operations: Create, Get, Delete, List
  • Cipher operations: Encrypt, Decrypt
  • ACL operations: Get, List
  • Bootstrap operations: Contribute, Verify

Nexus-Side Validation

SPIKE Nexus performs additional validation of recovery requests, providing a second layer of defense even if the SDK check is bypassed.

Migration Impact

This ADR documents existing behavior and does not require migration. The restriction has been in place since the recovery operations were first implemented.

References

  • SPIFFE specification: https://spiffe.io/docs/latest/spiffe-about/overview/
  • Shamir Secret Sharing: https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing
  • Defense in Depth: https://www.nist.gov/publications/defense-depth-strategy
  • ADR-0001: Use SPIFFE/SPIRE for Workload Identity
  • ADR-0028: Use Human-Readable Error Messages in CLI Tools