ADR-0022: Continuous Polling of SPIKE Keepers Despite 404 Responses

Status: accepted
Date: 2025-05-03
Tags: Resilience, Fault-Tolerance, Recovery, Availability

Context

SPIKE Nexus distributes root encryption key shards to multiple SPIKE Keeper instances using Shamir’s Secret Sharing Scheme.

When a SPIKE Keeper doesn’t have a shard (e.g., after restart or during initial deployment), it returns a 404 HTTP response to shard retrieval requests from SPIKE Nexus.

A question has arisen about the appropriate behavior when SPIKE Nexus receives 404 responses from SPIKE Keeper instances:

Should SPIKE Nexus stop polling a SPIKE Keeper after receiving a 404 response?
Should SPIKE Nexus continue polling SPIKE Keeper instances regardless of previous 404 responses?

This decision is particularly relevant in scenarios where:

SPIKE Keeper instances may restart
New SPIKE Keeper instances may be deployed
SPIKE Keeper instances may be rehydrated through alternative methods
The system architecture needs to support future recovery mechanisms

Decision

SPIKE Nexus will continue polling SPIKE Keepers even after receiving 404 responses.

Specifically:

SPIKE Nexus will maintain a regular polling schedule for all known SPIKE Keeper instances, regardless of their response history.
A 404 response from a SPIKE Keeper will be logged but will not affect the polling schedule.
SPIKE Nexus will NOT remove a SPIKE Keeper from its polling list based solely on 404 responses.
SPIKE Nexus will automatically attempt to rehydrate empty SPIKE Keepers when possible.

Rationale

The primary reasons for this decision are:

Future Extensibility: It allows for future mechanisms to rehydrate SPIKE Keepers through alternative methods:
- Other SPIKE Nexus instances may seed the SPIKE Keeper
- Cloning from backup SPIKE Keepers may become available
- Secure SPIKE Keeper APIs may be implemented that allow shard restoration
Architectural Simplicity: Continuing to poll all SPIKE Keepers regardless of their state creates a simpler, more consistent architecture:
- No complex logic to manage the polling schedule
- No state to track which SPIKE Keepers should be excluded
- Reduced risk of accidentally abandoning a recoverable SPIKE Keeper
Operational Resilience: Continuous polling allows the system to automatically recover when conditions change:
- SPIKE Keepers that restart will be discovered during the next polling cycle
- If a previously unavailable SPIKE Keeper comes back online with a shard, it will be immediately useful
- No manual intervention is required to re-enable polling
Fewer Assumptions: This approach makes fewer assumptions about the future state of the system:
- Does not assume a 404 response means permanent unavailability
- Does not assume the current distribution methods are the only ones possible
- Allows for unanticipated recovery scenarios

Consequences

Positive

System can automatically recover from SPIKE Keeper restarts without manual intervention
Architecture remains simpler with fewer conditional paths and state tracking
Future extensibility is preserved for new recovery mechanisms
Consistent behavior across all SPIKE Keeper instances
Reduced operational burden for managing the system

Negative

Slightly increased network traffic due to polling SPIKE Keepers that may remain empty
Potential resource usage for maintaining connections to SPIKE Keepers that consistently return 404
Additional log entries for expected 404 responses
May mask actual problems if a SPIKE Keeper is consistently unavailable for other reasons

Alternatives Considered

Stop Polling After Consistent 404 Responses

Rejected because it would require additional logic to track SPIKE Keeper states
Would introduce a permanent failure mode requiring manual intervention
Would not automatically benefit from future recovery mechanisms
Would add complexity to the codebase
Would create an inconsistent behavior pattern depending on response history

Event-Based Notification System

Rejected in favor of simple polling, though may be reconsidered in the future
Would require SPIKE Keepers to have knowledge of SPIKE Nexus, violating the design principle
More complex to implement and maintain
Introduces potential reliability issues with missed notifications
Would conflict with ADR-0021’s principle of SPIKE Keeper as a stateless shard holder

Decision Outcome

This decision is implemented as the standard behavior for SPIKE Nexus when interacting with SPIKE Keeper instances. The continuous polling approach:

Aligns with the principle of simplicity in the SPIKE architecture
Maintains the stateless nature of SPIKE Keepers as defined in ADR-0021
Provides immediate recovery when SPIKE Keepers become available
Supports future extensibility for alternative recovery mechanisms

The system should be monitored for any performance impacts from continuous polling, but the architectural benefits outweigh the minimal resource costs associated with this approach.