ADR-0029: Restrict Recovery and Restoration Operations to SPIKE Pilot

Status: accepted
Date: 2025-11-19
Tags: Security, Recovery, Access Control, SPIFFE

Context

SPIKE provides various operations that workloads can perform against SPIKE Nexus, including secret management (get, put, delete), policy management, and critical recovery operations (recover, restore). Most operations can be controlled through SPIKE’s policy system, allowing fine-grained access control based on SPIFFE IDs and other attributes.

However, recovery and restoration operations are fundamentally different from regular operations:

Recovery initiates the process of retrieving Shamir secret shards when SPIKE Nexus needs to be restored from a catastrophic failure
Restoration submits these shards back to rebuild the root encryption key

These operations bypass normal secret access policies and directly manipulate the root cryptographic material that protects the entire secrets store. If compromised, an attacker could potentially decrypt all secrets in the system.

We need to determine the appropriate access control mechanism for these critical recovery operations.

Decision

Recovery (Recover) and restoration (Restore) operations will be restricted exclusively to SPIKE Pilot at the SDK level, enforced through SPIFFE ID validation.

Specifically:

The SDK will check the caller’s SPIFFE ID using spiffeid.IsPilot()
Only workloads identified as SPIKE Pilot may invoke these operations
Violations will result in immediate fatal termination via log.FatalErr()
This restriction is not configurable through policies

All other operations (secrets, policies, cipher, ACLs) remain policy-controlled and can be authorized for any workload based on configured policies.

Rationale

Security Criticality Hierarchy

SPIKE operations fall into different security tiers:

Operation Type	Security Impact	Access Control
Secret read/write	Medium - affects individual secrets	Policy-based
Policy management	High - affects access control	Policy-based
Cipher operations	Medium - encryption/decryption	Policy-based
Recovery/Restore	Critical - affects entire system	Hard-coded

Why Recovery Operations Are Different

Policy-controlled operations (secrets, policies, etc.):

Operate within the normal secret access control framework
Failure affects specific secrets or policies
Can be safely delegated to various workloads
Policy misconfiguration has limited blast radius

Recovery operations (recover, restore):

Bypass all policy controls and access root cryptographic material
Failure or compromise could decrypt all secrets in the system
Should only be performed during disaster recovery scenarios
Must have the smallest possible attack surface
Policy-based control would create circular dependency (policies are protected by the key being recovered)

Defense in Depth

While SPIKE Nexus itself validates recovery requests, enforcing the restriction at the SDK level provides defense in depth:

SDK enforcement: Prevents unauthorized workloads from attempting recovery
Nexus enforcement: Final validation even if SDK is bypassed
SPIFFE authentication: Cryptographically verifiable identity
Audit trail: Fatal errors logged when violations occur

SPIFFE Identity as Strong Authentication

SPIKE Pilot’s SPIFFE ID is:

Cryptographically verified through mTLS
Issued by the trusted SPIRE server
Cannot be spoofed or stolen without compromising the SPIRE trust domain
Provides stronger authentication than password-based or API key approaches

Fail-Safe Design

The SDK implementation uses log.FatalErr() rather than returning an error:

if !spiffeid.IsPilot(selfSPIFFEID) {
    failErr := sdkErrors.ErrUnauthorized
    failErr.Msg = "recovery can only be performed from SPIKE Pilot"
    log.FatalErr(fName, *failErr)
}

This ensures:

No possibility of error handling bugs bypassing the check
Clear audit trail in logs
Immediate termination prevents any further processing
Aligns with security-critical failure handling (similar to key length validation failures)

Alternatives Considered

Alternative 1: Policy-Based Control

Allow recovery operations to be controlled through the policy system like other operations.

Rejected because:

Creates circular dependency: policies are protected by the key being recovered
During disaster recovery, policy system may not be available
Increases attack surface unnecessarily
Policy misconfiguration could enable unauthorized recovery

Alternative 2: No SDK Enforcement

Rely solely on SPIKE Nexus to validate recovery requests.

Rejected because:

Violates defense-in-depth principle
Allows unauthorized attempts to reach Nexus unnecessarily
Reduces audit trail granularity
Bypasses early-fail security principle

Alternative 3: Configuration-Based Control

Make the allowed SPIFFE IDs configurable via environment variables or config files.

Rejected because:

Configuration errors could accidentally enable unauthorized access
Increases operational complexity
Provides no real benefit (recovery should always be from Pilot)
Configuration-based security is generally weaker than hard-coded for critical operations

Consequences

Positive

Reduced attack surface: Only SPIKE Pilot can initiate recovery operations
Defense in depth: Multiple layers of validation (SDK + Nexus)
Fail-safe: Fatal errors prevent accidental bypasses
Clear security model: Critical operations have stricter controls than regular operations
Audit trail: Failed attempts are logged with context
No configuration complexity: No additional configuration required

Negative

Less flexible: Cannot delegate recovery to other workloads
Operational constraint: Requires SPIKE Pilot for disaster recovery scenarios
Hard-coded policy: Cannot be changed without code modification

Neutral

Consistent with design: SPIKE Pilot is already the administrative/operator interface
Expected behavior: Recovery is inherently a privileged operation

Implementation Details

SDK Enforcement

The spike-sdk-go package enforces this in:

api/internal/impl/operator/recover.go:67-71
api/internal/impl/operator/restore.go:76-80

selfSPIFFEID := svid.ID.String()

// Security: Recovery and Restoration can ONLY be done via SPIKE Pilot.
if !spiffeid.IsPilot(selfSPIFFEID) {
    failErr := sdkErrors.ErrUnauthorized
    failErr.Msg = "recovery can only be performed from SPIKE Pilot"
    log.FatalErr(fName, *failErr)
}

Operations NOT Restricted

The following operations remain policy-controlled and can be performed by any workload with appropriate policy permissions:

Secret operations: Get, Put, Delete, Undelete, List, GetMetadata
Policy operations: Create, Get, Delete, List
Cipher operations: Encrypt, Decrypt
ACL operations: Get, List
Bootstrap operations: Contribute, Verify

Nexus-Side Validation

SPIKE Nexus performs additional validation of recovery requests, providing a second layer of defense even if the SDK check is bypassed.

Migration Impact

This ADR documents existing behavior and does not require migration. The restriction has been in place since the recovery operations were first implemented.

References

SPIFFE specification: https://spiffe.io/docs/latest/spiffe-about/overview/
Shamir Secret Sharing: https://en.wikipedia.org/wiki/Shamir%27s_Secret_Sharing
Defense in Depth: https://www.nist.gov/publications/defense-depth-strategy

ADR-0001: Use SPIFFE/SPIRE for Workload Identity
ADR-0028: Use Human-Readable Error Messages in CLI Tools

ADR-0029: Restrict Recovery and Restoration Operations to SPIKE Pilot

Context

Decision

Rationale

Security Criticality Hierarchy

Why Recovery Operations Are Different

Defense in Depth

SPIFFE Identity as Strong Authentication

Fail-Safe Design

Alternatives Considered

Alternative 1: Policy-Based Control

Alternative 2: No SDK Enforcement

Alternative 3: Configuration-Based Control

Consequences

Positive

Negative

Neutral

Implementation Details

SDK Enforcement

Operations NOT Restricted

Nexus-Side Validation

Migration Impact

References

Related ADRs