ADR-0024: Transition from In-Memory Cache to Direct Backend Storage for High Availability
- Status: accepted
- Supersedes: ADR-0016: Memory-First Secrets Store
- Supersedes: ADR-0017: Synchronous Persistence
- Date: 2025-08-23
- Tags: Security, Operations, HA, Simplicity
Context
SPIKE Nexus currently uses a dual-layer storage architecture where an
in-memory cache (secretStore
using kv.KV
and policies
using sync.Map
)
serves as the primary source of truth, with a backing store (SQLite, S3, etc.)
providing persistence. This design pattern follows
ADR-0016 (Memory-First Secrets Store) and
ADR-0017 (Synchronous Persistence).
While this architecture provides excellent performance for single-instance deployments, it creates challenges that unnecessarily increase the architectural complexity of the distributed system when attempting to achieve high availability (HA) through horizontal scaling:
Current Architecture Issues:
- Split-brain scenarios: Multiple instances behind a load balancer can maintain independent in-memory states, leading to inconsistent data views
- Version drift: When instance A and instance B update the same secret path, they maintain different version numbers in their local memory stores
- Lost updates: Without distributed locking, concurrent updates from different instances can silently overwrite each other
- Inconsistent reads: Users can receive different responses depending on which instance handles their request
- Complex synchronization: Maintaining cache coherency across instances would require additional infrastructure (message buses, distributed locks, consensus protocols)
Example Scenario:
Time T1: Instance A updates secret at /api/key to version 2
Time T2: Instance B (unaware of A's update) updates same path to its version 2
Result: Both instances have version 2 with different content
Backend has whichever write completed last
Users see different data depending on load balancer routing
Decision
We will eliminate the in-memory cache layer entirely and make all secret and policy operations interact directly with the backing store.
This change involves:
-
Remove global in-memory stores:
- Comment out
secretStore
(kv.KV
) andsecretStoreMu
- Comment out
policies
(sync.Map
)
- Comment out
-
Refactor all operations to use backend directly:
GetSecret()
→ Read frompersist.Backend().LoadSecret()
UpsertSecret()
→ Write topersist.Backend().StoreSecret()
with atomic operationsDeleteSecret()
→ Update backend with deletion timestamps- Similar changes for policy operations
-
Move version management to persistence layer:
- Implement compare-and-swap (CAS) operations in the backend
- Use optimistic locking to handle concurrent updates
- Maintain version consistency at the database level
-
Transform
NoopStore
toInMemoryStore
:- For development/testing scenarios requiring true in-memory operation
- Implement full
Backend
interface usingkv.KV
internally - Provide the same semantics as persistent backends
Alternatives Considered
- Distributed cache with cache invalidation: Add Redis/Hazelcast with pub/sub for cache invalidation. Rejected due to added complexity and potential for cache inconsistency windows.
- Leader-follower architecture: Elect a single leader instance to handle all writes. Rejected as it doesn’t provide true HA and adds complexity of leader election.
- Distributed locking for writes: Keep cache but use distributed locks
(for, i.e., via
etcd
) for write operations. Rejected as it adds external dependencies and doesn’t solve read inconsistency. - Event sourcing: Transform to event-driven architecture with CQRS. Rejected as it requires fundamental redesign of the entire system and brings unnecessary complexity.
Consequences
Positive
- Immediate consistency: All instances read the same data from the single source of truth
- True high availability: Any SPIKE Nexus instance can serve any request without coordination
- Simplified architecture: Removes entire class of distributed systems problems
- Easier debugging: No cache invalidation issues or cache/backend divergence
- Reduced memory footprint: Instances no longer hold all secrets in memory
- Atomic operations: Version conflicts handled at database level with proper transactions
Negative
- Increased latency: Every operation now involves backend I/O (can be mitigated by backend proximity and connection pooling)
- Higher backend load: All reads hit the backing store (can be mitigated by backend scaling and potentially adding read replicas)
- Backend becomes the critical path: System availability now entirely depends on backend availability (can be mitigated by using a HA backing store)
- Potential thundering herd: Many instances might request the same secret simultaneously after backend recovery
Neutral
- Different operational characteristics: Everything (as always) is a trade-off. With this new design, teams now need to monitor and scale the backend appropriately
- Changed failure modes: Failures now manifest as backend errors rather than cache inconsistencies
Implementation Notes
The refactoring follows a systematic approach:
- Comment out cache-related globals
- Let the compiler identify all broken dependencies
- Update each function to use backend directly
- Ensure proper error handling for I/O operations
- Add context propagation for timeouts and cancellation
This change aligns with modern cloud-native patterns where stateless services rely on robust, scalable data stores rather than maintaining local state.