ADR-0021: SPIKE Keeper as a Stateless Shard Holder
- Status: accepted
- Date: 2025-03-01
- Tags: Stateless, Availability, Resilience, Fault-Tolerance
Context
SPIKE Nexus is the core secret store that generates and manages the root encryption key. To ensure availability and resilience, the root key is sharded using Shamir’s Secret Sharing Scheme and distributed to multiple SPIKE Keeper instances. These SPIKE Keepers are responsible for holding their respective key shards in memory to support the recovery process in case SPIKE Nexus needs to reconstruct its root key.
A key design principle in SPIKE is simplicity and minimalism: The SPIKE Keeper component should remain as lightweight as possible, avoiding unnecessary complexity such as awareness of Nexus, complex configuration, or inter-Keeper communication. Instead, SPIKE Nexus should maintain full control over key management while leveraging SPIKE Keepers as dumb storage units for key shards.
Decision
SPIKE Keepers remain stateless and unaware of SPIKE Nexus:
- SPIKE Keepers do not need to know which SPIKE Nexus instance they are serving.
- They store their assigned key shard in-memory and do not persist any state.
Zero configuration for SPIKE Keepers:
- SPIKE Keepers have no static configuration files or runtime parameters related to SPIKE Nexus.
- Deployment should be as simple as running a SPIKE Keeper instance without additional setup.
SPIKE Nexus is responsible for lifecycle management:
- SPIKE Nexus generates the root key, sharding it and distributing the pieces to SPIKE Keepers.
- SPIKE Nexus polls Keepers to check their health and ensure that a quorum is available.
- If a SPIKE Keeper goes down and restarts, SPIKE Nexus is responsible for rehydrating it with the correct key shard.
Polling-based health monitoring and rehydration:
- SPIKE Keepers do not initiate communication with SPIKE Nexus.
- Instead, SPIKE Nexus periodically queries SPIKE Keepers for their status.
- If a SPIKE Keeper is found to be empty (e.g., after a restart), SPIKE Nexus reassigns the missing shard.
Rationale
- Security: SPIKE Keepers hold only a single shard; which is not adequate to regenerate the root key. They are never aware of other SPIKE Keepers or the full key. This limits their attack surface.
- Simplicity: By eliminating configuration and inter-service dependencies, SPIKE Keepers become easy to deploy, replace, and scale.
- Availability: The polling and rehydration mechanism ensures that SPIKE Nexus* can automatically recover lost shards without manual intervention.
- Fault Tolerance: Stateless SPIKE Keepers can be replaced without requiring reconfiguration or coordination with other components.
Consequences
Positive
- Simplifies Keeper deployment and operation.
- Improves security by ensuring Keepers never hold full knowledge of the system.
- Enhances reliability by making Keepers easily replaceable without system-wide impact.
- Reduces operational burden since SPIKE Nexus automatically manages the lifecycle of SPIKE Keepers and their shards.
Negative
- SPIKE Nexus must handle additional logic for polling, health monitoring, and rehydration.
- SPIKE Keepers depend on SPIKE Nexus for their purpose, making them entirely reliant on SPIKE Nexus’ availability.
Alternatives Considered
SPIKE Keepers as Stateful Services:
- Rejected because it adds complexity and requires persistent storage.
- Would introduce additional configuration and synchronization challenges.
SPIKE Keepers Managing Their Own Shards:
- Rejected as it violates the principle of keeping SPIKE Keepers unaware of the full system state.
- Would require SPIKE Keepers to store metadata about SPIKE Nexus, increasing complexity and risk.
Push-Based Shard Distribution Instead of Polling:
- Rejected because it would require SPIKE Keepers to maintain knowledge of SPIKE Nexus.
- Polling ensures that SPIKE Keepers can remain stateless and unaware of the system topology.
Decision Outcome
This decision is final unless significant operational issues arise. Future revisions may consider optimizations such as event-driven polling or alternative SPIKE Keeper designs if the current model proves inefficient at scale.