ADR-0016: Memory-First Secrets Store

Status: accepted
Date: 2024-12-22
Tags: Security, Operations, Storage, Performance, Scalability

Context

SPIKE keeps secrets in the memory of SPIKE Nexus by design. The secrets are encrypted and backed up to a secondary backup storage; however, the primary source of truth is the in-memory store.

This is an efficient mechanism to store application secrets (e.g., API keys, certificates, even relatively beefy Kubeconfig files).

However, we need to maintain certain requirements for SPIKE to be a production-grade secure, reliable, and robust secrets store:

Our requirements include:

High-Performance Access: Secrets should be rapidly retrievable with minimal latency.
Robust Backup and Recovery: The system should persist data safely and recover quickly from crashes.
Security: Restrict access to secrets via path-based policies and protect data at rest via encryption.
Auditability: Record all read/write operations for compliance and monitoring.
Scalability: The system should handle up to hundreds of thousands of secrets.
High Availability: Provide read replicas for scaling reads and failover strategies.

We considered disk-only, disk-first, cloud-storage-only (like AWS S3) and cloud-storage-first solutions and decided a memory-first secrets store with a reliable back-up mechanism is the best fit for SPIKE.

Decision

SPIKE will be an in-memory secrets store with the following characteristics:

In-Memory Data: The primary data store resides in RAM, offering near-instant reads and writes.
Periodic Backup: An encrypted backing store (SQLite, Postgres DB, or an S3-compatible interface) will serve as a backup. The system uses exponential retries to ensure data persistence.
Hardened Container: The service is recommended to run in a hardened container or sandbox with minimal OS surface area, reducing the likelihood of root compromise.
Path-Based Access Controls: Secrets are organized hierarchically (for, e.g., /secrets/acme/*). Only specific roles/tokens can access their respective paths.
Replication: A primary read-write store with read-only replicas. These replicas can be promoted or re-hydrated if the primary fails.
Auditing: All secret operations (reads, writes, deletes) are logged to an audit trail for compliance and investigation.

Rationale

Performance: In-memory data reduces latency compared to purely disk-backed solutions.
Backup Safety: The secondary backup (encrypted at rest) mitigates memory volatility by allowing the system to recover from unexpected crashes or restarts.
Security:
- Hardened Container: Minimizes OS-level attack surface.
- Encryption at Rest: Protects offline backups if the disk is compromised.
- Path-Based Policies: Enforces the principle of least privilege.
- Auditing: Aids in compliance and detection of unauthorized access.
- Scalability: Storing thousands or even hundreds of thousands of secrets in memory is feasible with proper resource planning.

Consequences

Positive Outcomes

Performance Gain: Ultra-fast secrets retrieval for latency-sensitive applications.
Backup Resilience: Encrypted disk backups reduce permanent data loss if the container restarts.
Fine-Grained Control*: Path-based policies and an internal auditing mechanism meet security and compliance needs.

Trade-Offs and Risks

In contrast to our decision, here are some benefits of using a database (or a remote object storage) as the single source of truth:

Security and Persistence: Using an encrypted database as the source of truth ensures that secrets are securely stored and persist across system restarts or crashes. Though with frequent forced writes, the risk of data loss is minimized and can further be mitigated by using mechanisms like message queues.
Scalability: Databases can handle growth more effectively, allowing the system to accommodate the increasing number of secrets without a significant redesign. Again, this is a non-issue because if you have to store millions of secrets, then you need to review your architecture anyway. In an ideal world, the only secret an app needs are PKI certificates (like SVIDs) as they can uniquely identify the app.
Simplicity: A single source of truth simplifies the architecture, making the system easier to develop and maintain. To counter this, SPIKE Nexus’ current architecture is simple enough to maintain and develop. We have abstracted exponential backoff and retry mechanisms to the storage layer, and once we have adequate abstractions, the maintainability of the system will be equivalent to a database-as-the-single-source-of-truth system. Besides, at the cost of simplicity, we lose performance and will have to implement additional caching mechanisms to mitigate latency, which will add complexity and result in an equally complex system. There is no free lunch.

Here are some other liabilities of a memory-first secrets store:

Crash Consistency: Potential for a small window of data loss if the system crashes just before backup.
- Mitigation: frequent or near-synchronous write-through.
Failover Complexity: Replication and promotion logic must be robustly implemented to handle node failures seamlessly.
Memory as an Additional Attack Surface:
- While ephemeral in-memory storage can mitigate certain disk-theft scenarios, memory itself can be inspected if an attacker gains OS-level access.
- That’s why hardening the container and ensuring proper access controls are crucial. SPIKE assume the machine as the trusted boundary. So, if the machine is compromised, the secrets are considered compromised as well.