Waiting for input...
Star SPIKE on GitHub

ADR-0024: Transition from In-Memory Cache to Direct Backend Storage for High Availability

Context

SPIKE Nexus currently uses a dual-layer storage architecture where an in-memory cache (secretStore using kv.KV and policies using sync.Map) serves as the primary source of truth, with a backing store (SQLite, S3, etc.) providing persistence. This design pattern follows ADR-0016 (Memory-First Secrets Store) and ADR-0017 (Synchronous Persistence).

While this architecture provides excellent performance for single-instance deployments, it creates challenges that unnecessarily increase the architectural complexity of the distributed system when attempting to achieve high availability (HA) through horizontal scaling:

Current Architecture Issues:

  1. Split-brain scenarios: Multiple instances behind a load balancer can maintain independent in-memory states, leading to inconsistent data views
  2. Version drift: When instance A and instance B update the same secret path, they maintain different version numbers in their local memory stores
  3. Lost updates: Without distributed locking, concurrent updates from different instances can silently overwrite each other
  4. Inconsistent reads: Users can receive different responses depending on which instance handles their request
  5. Complex synchronization: Maintaining cache coherency across instances would require additional infrastructure (message buses, distributed locks, consensus protocols)

Example Scenario:

Time T1: Instance A updates secret at /api/key to version 2
Time T2: Instance B (unaware of A's update) updates same path to its version 2
Result: Both instances have version 2 with different content
         Backend has whichever write completed last
         Users see different data depending on load balancer routing

Decision

We will eliminate the in-memory cache layer entirely and make all secret and policy operations interact directly with the backing store.

This change involves:

  1. Remove global in-memory stores:

    • Comment out secretStore (kv.KV) and secretStoreMu
    • Comment out policies (sync.Map)
  2. Refactor all operations to use backend directly:

    • GetSecret() → Read from persist.Backend().LoadSecret()
    • UpsertSecret() → Write to persist.Backend().StoreSecret() with atomic operations
    • DeleteSecret() → Update backend with deletion timestamps
    • Similar changes for policy operations
  3. Move version management to persistence layer:

    • Implement compare-and-swap (CAS) operations in the backend
    • Use optimistic locking to handle concurrent updates
    • Maintain version consistency at the database level
  4. Transform NoopStore to InMemoryStore:

    • For development/testing scenarios requiring true in-memory operation
    • Implement full Backend interface using kv.KV internally
    • Provide the same semantics as persistent backends

Alternatives Considered

  1. Distributed cache with cache invalidation: Add Redis/Hazelcast with pub/sub for cache invalidation. Rejected due to added complexity and potential for cache inconsistency windows.
  2. Leader-follower architecture: Elect a single leader instance to handle all writes. Rejected as it doesn’t provide true HA and adds complexity of leader election.
  3. Distributed locking for writes: Keep cache but use distributed locks (for, i.e., via etcd) for write operations. Rejected as it adds external dependencies and doesn’t solve read inconsistency.
  4. Event sourcing: Transform to event-driven architecture with CQRS. Rejected as it requires fundamental redesign of the entire system and brings unnecessary complexity.

Consequences

Positive

  • Immediate consistency: All instances read the same data from the single source of truth
  • True high availability: Any SPIKE Nexus instance can serve any request without coordination
  • Simplified architecture: Removes entire class of distributed systems problems
  • Easier debugging: No cache invalidation issues or cache/backend divergence
  • Reduced memory footprint: Instances no longer hold all secrets in memory
  • Atomic operations: Version conflicts handled at database level with proper transactions

Negative

  • Increased latency: Every operation now involves backend I/O (can be mitigated by backend proximity and connection pooling)
  • Higher backend load: All reads hit the backing store (can be mitigated by backend scaling and potentially adding read replicas)
  • Backend becomes the critical path: System availability now entirely depends on backend availability (can be mitigated by using a HA backing store)
  • Potential thundering herd: Many instances might request the same secret simultaneously after backend recovery

Neutral

  • Different operational characteristics: Everything (as always) is a trade-off. With this new design, teams now need to monitor and scale the backend appropriately
  • Changed failure modes: Failures now manifest as backend errors rather than cache inconsistencies

Implementation Notes

The refactoring follows a systematic approach:

  1. Comment out cache-related globals
  2. Let the compiler identify all broken dependencies
  3. Update each function to use backend directly
  4. Ensure proper error handling for I/O operations
  5. Add context propagation for timeouts and cancellation

This change aligns with modern cloud-native patterns where stateless services rely on robust, scalable data stores rather than maintaining local state.

References