ADR-0024: Transition from In-Memory Cache to Direct Backend Storage for High Availability

Status: accepted
- Supersedes: ADR-0016: Memory-First Secrets Store
- Supersedes: ADR-0017: Synchronous Persistence
Date: 2025-08-23
Tags: Security, Operations, HA, Simplicity

Context

SPIKE Nexus currently uses a dual-layer storage architecture where an in-memory cache (secretStore using kv.KV and policies using sync.Map) serves as the primary source of truth, with a backing store (SQLite, S3, etc.) providing persistence. This design pattern follows ADR-0016 (Memory-First Secrets Store) and ADR-0017 (Synchronous Persistence).

While this architecture provides excellent performance for single-instance deployments, it creates challenges that unnecessarily increase the architectural complexity of the distributed system when attempting to achieve high availability (HA) through horizontal scaling:

Current Architecture Issues:

Split-brain scenarios: Multiple instances behind a load balancer can maintain independent in-memory states, leading to inconsistent data views
Version drift: When instance A and instance B update the same secret path, they maintain different version numbers in their local memory stores
Lost updates: Without distributed locking, concurrent updates from different instances can silently overwrite each other
Inconsistent reads: Users can receive different responses depending on which instance handles their request
Complex synchronization: Maintaining cache coherency across instances would require additional infrastructure (message buses, distributed locks, consensus protocols)

Example Scenario:

Time T1: Instance A updates secret at /api/key to version 2
Time T2: Instance B (unaware of A's update) updates same path to its version 2
Result: Both instances have version 2 with different content
         Backend has whichever write completed last
         Users see different data depending on load balancer routing

Decision

We will eliminate the in-memory cache layer entirely and make all secret and policy operations interact directly with the backing store.

This change involves:

Remove global in-memory stores:
- Comment out secretStore (kv.KV) and secretStoreMu
- Comment out policies (sync.Map)
Refactor all operations to use backend directly:
- GetSecret() → Read from persist.Backend().LoadSecret()
- UpsertSecret() → Write to persist.Backend().StoreSecret() with atomic operations
- DeleteSecret() → Update backend with deletion timestamps
- Similar changes for policy operations
Move version management to persistence layer:
- Implement compare-and-swap (CAS) operations in the backend
- Use optimistic locking to handle concurrent updates
- Maintain version consistency at the database level
Transform NoopStore to InMemoryStore:
- For development/testing scenarios requiring true in-memory operation
- Implement full Backend interface using kv.KV internally
- Provide the same semantics as persistent backends

Alternatives Considered

Distributed cache with cache invalidation: Add Redis/Hazelcast with pub/sub for cache invalidation. Rejected due to added complexity and potential for cache inconsistency windows.
Leader-follower architecture: Elect a single leader instance to handle all writes. Rejected as it doesn’t provide true HA and adds complexity of leader election.
Distributed locking for writes: Keep cache but use distributed locks (for, i.e., via etcd) for write operations. Rejected as it adds external dependencies and doesn’t solve read inconsistency.
Event sourcing: Transform to event-driven architecture with CQRS. Rejected as it requires fundamental redesign of the entire system and brings unnecessary complexity.

Consequences

Positive

Immediate consistency: All instances read the same data from the single source of truth
True high availability: Any SPIKE Nexus instance can serve any request without coordination
Simplified architecture: Removes entire class of distributed systems problems
Easier debugging: No cache invalidation issues or cache/backend divergence
Reduced memory footprint: Instances no longer hold all secrets in memory
Atomic operations: Version conflicts handled at database level with proper transactions

Negative

Increased latency: Every operation now involves backend I/O (can be mitigated by backend proximity and connection pooling)
Higher backend load: All reads hit the backing store (can be mitigated by backend scaling and potentially adding read replicas)
Backend becomes the critical path: System availability now entirely depends on backend availability (can be mitigated by using a HA backing store)
Potential thundering herd: Many instances might request the same secret simultaneously after backend recovery

Neutral

Different operational characteristics: Everything (as always) is a trade-off. With this new design, teams now need to monitor and scale the backend appropriately
Changed failure modes: Failures now manifest as backend errors rather than cache inconsistencies

Implementation Notes

The refactoring follows a systematic approach:

Comment out cache-related globals
Let the compiler identify all broken dependencies
Update each function to use backend directly
Ensure proper error handling for I/O operations
Add context propagation for timeouts and cancellation

This change aligns with modern cloud-native patterns where stateless services rely on robust, scalable data stores rather than maintaining local state.

References

Supersedes: ADR-0016 (Memory-First Secrets Store)
Supersedes: ADR-0017 (Synchronous Persistence for SPIKE Secrets Store)
Related: ADR-0014 (Maintaining SQLite as SPIKE’s Primary Storage Backend)
Related: ADR-0019 (Plugin-Based Storage Backend Architecture)