ADR-0022: Continuous Polling of SPIKE Keepers Despite 404 Responses
- Status: accepted
- Date: 2025-05-03
- Tags: Resilience, Fault-Tolerance, Recovery, Availability
Context
SPIKE Nexus distributes root encryption key shards to multiple SPIKE Keeper instances using Shamir’s Secret Sharing Scheme.
When a SPIKE Keeper doesn’t have a shard (e.g., after restart or during
initial deployment), it returns a 404
HTTP response to shard retrieval
requests from SPIKE Nexus.
A question has arisen about the appropriate behavior when SPIKE Nexus receives 404 responses from SPIKE Keeper instances:
- Should SPIKE Nexus stop polling a SPIKE Keeper after receiving a 404 response?
- Should SPIKE Nexus continue polling SPIKE Keeper instances regardless of previous 404 responses?
This decision is particularly relevant in scenarios where:
- SPIKE Keeper instances may restart
- New SPIKE Keeper instances may be deployed
- SPIKE Keeper instances may be rehydrated through alternative methods
- The system architecture needs to support future recovery mechanisms
Decision
SPIKE Nexus will continue polling SPIKE Keepers even after receiving 404 responses.
Specifically:
- SPIKE Nexus will maintain a regular polling schedule for all known SPIKE Keeper instances, regardless of their response history.
- A 404 response from a SPIKE Keeper will be logged but will not affect the polling schedule.
- SPIKE Nexus will NOT remove a SPIKE Keeper from its polling list based solely on 404 responses.
- SPIKE Nexus will automatically attempt to rehydrate empty SPIKE Keepers when possible.
Rationale
The primary reasons for this decision are:
-
Future Extensibility: It allows for future mechanisms to rehydrate SPIKE Keepers through alternative methods:
- Other SPIKE Nexus instances may seed the SPIKE Keeper
- Cloning from backup SPIKE Keepers may become available
- Secure SPIKE Keeper APIs may be implemented that allow shard restoration
-
Architectural Simplicity: Continuing to poll all SPIKE Keepers regardless of their state creates a simpler, more consistent architecture:
- No complex logic to manage the polling schedule
- No state to track which SPIKE Keepers should be excluded
- Reduced risk of accidentally abandoning a recoverable SPIKE Keeper
-
Operational Resilience: Continuous polling allows the system to automatically recover when conditions change:
- SPIKE Keepers that restart will be discovered during the next polling cycle
- If a previously unavailable SPIKE Keeper comes back online with a shard, it will be immediately useful
- No manual intervention is required to re-enable polling
-
Fewer Assumptions: This approach makes fewer assumptions about the future state of the system:
- Does not assume a 404 response means permanent unavailability
- Does not assume the current distribution methods are the only ones possible
- Allows for unanticipated recovery scenarios
Consequences
Positive
- System can automatically recover from SPIKE Keeper restarts without manual intervention
- Architecture remains simpler with fewer conditional paths and state tracking
- Future extensibility is preserved for new recovery mechanisms
- Consistent behavior across all SPIKE Keeper instances
- Reduced operational burden for managing the system
Negative
- Slightly increased network traffic due to polling SPIKE Keepers that may remain empty
- Potential resource usage for maintaining connections to SPIKE Keepers that consistently return 404
- Additional log entries for expected 404 responses
- May mask actual problems if a SPIKE Keeper is consistently unavailable for other reasons
Alternatives Considered
Stop Polling After Consistent 404 Responses
- Rejected because it would require additional logic to track SPIKE Keeper states
- Would introduce a permanent failure mode requiring manual intervention
- Would not automatically benefit from future recovery mechanisms
- Would add complexity to the codebase
- Would create an inconsistent behavior pattern depending on response history
Event-Based Notification System
- Rejected in favor of simple polling, though may be reconsidered in the future
- Would require SPIKE Keepers to have knowledge of SPIKE Nexus, violating the design principle
- More complex to implement and maintain
- Introduces potential reliability issues with missed notifications
- Would conflict with
ADR-0021
’s principle of SPIKE Keeper as a stateless shard holder
Decision Outcome
This decision is implemented as the standard behavior for SPIKE Nexus when interacting with SPIKE Keeper instances. The continuous polling approach:
- Aligns with the principle of simplicity in the SPIKE architecture
- Maintains the stateless nature of SPIKE Keepers as defined in
ADR-0021
- Provides immediate recovery when SPIKE Keepers become available
- Supports future extensibility for alternative recovery mechanisms
The system should be monitored for any performance impacts from continuous polling, but the architectural benefits outweigh the minimal resource costs associated with this approach.