ADR-0030: Minimal Error Messages in API Responses

Status: accepted
Date: 2025-11-22
Tags: Security, API Design, Error Handling, Information Disclosure

Context

SPIKE Nexus exposes an HTTP API that workloads use to interact with the secret management system. When operations fail, the API must communicate errors to clients. The design of error responses involves a fundamental trade-off between security and diagnostics:

Detailed error messages provide rich diagnostic information to clients, making debugging easier, but risk information leakage
Minimal error messages provide only error codes, maintaining security but offering limited client-side diagnostics

For a security-critical secret management system, we need to determine the appropriate balance between these concerns.

Decision

SPIKE Nexus API responses will return error codes only, with no descriptive error messages to clients.

Specifically:

API responses contain only structured error codes (e.g., NOT_FOUND, UNAUTHORIZED, BAD_REQUEST)
No additional error message fields, stack traces, or diagnostic information
All detailed error context is logged server-side with full audit trail
Clients must interpret errors based solely on:
- HTTP status codes (404, 401, 400, 500)
- Structured error code enumerations
- Request parameters they provided

Rationale

Security: Preventing Information Leakage

Error messages can reveal sensitive information about the system:

Path existence enumeration:

❌ "Secret 'secrets/admin/root-password' not found"
   → Reveals path structure even when denied

✓  NOT_FOUND
   → Reveals nothing about whether path exists or is unauthorized

Permission structure disclosure:

❌ "Permission denied for path 'secrets/database'"
   → Leaks information about permission boundaries

✓  UNAUTHORIZED
   → No information about what paths exist or their structure

Implementation details:

❌ "Database query failed: table 'secrets' locked"
   → Leaks internal architecture details

✓  INTERNAL_SERVER_ERROR
   → No information about internal implementation

Stack traces (the “one bad commit” risk):

❌ Adding an ErrMsg field creates risk of accidentally including:
   - File paths
   - Internal function names
   - SQL queries
   - Configuration details

✓  No message field = no risk of accidental disclosure

Defense Against Enumeration Attacks

Minimal errors prevent attackers from probing the system:

Attack Vector	Detailed Messages	Minimal Codes
Path enumeration	“Path X not found” vs “Path Y unauthorized” reveals valid paths	All failures return same code
Permission probing	Messages reveal permission boundaries	No distinction between not-found and unauthorized
Version detection	Stack traces reveal library versions	No version information leaked
Schema discovery	Error messages reveal data structure	No schema information exposed

API Design: Clean and Stable

Error codes provide better API stability than messages:

Programmatic handling:

// Client can reliably handle specific errors
switch response.Err {
case data.ErrNotFound:
    // Handle missing secret
case data.ErrUnauthorized:
    // Handle permission denied
}

No versioning issues:

Error codes remain stable across versions
No message format changes breaking clients
No localization complexity
Consistent parsing and handling

Testability:

Deterministic error codes are easy to test
No string matching or regex required
Clear expected outcomes in test cases

Operational Model: Server-Side Context

SPIKE’s architecture provides full diagnostics where they belong:

Audit logging captures everything:

Server log:
[req_abc123] [SPIFFE: spiffe://example.org/workload/app]
Failed to read secret 'secrets/db/password': permission denied
Policy check failed: path pattern '^secrets/admin/.*' required

Client receives:

{
  "err": "UNAUTHORIZED"
}

Clear separation of concerns:

Clients: Get actionable error codes for programmatic handling
Operators: Have server access and can see full audit logs with context
Authorized users: Can correlate their requests with server logs if needed
Unauthorized users: Get nothing useful for reconnaissance

Industry Validation

Security-critical systems follow this pattern:

HashiCorp Vault:

API Response: {"errors":["permission denied"]}
Server logs:  Detailed context with paths, policies, tokens

AWS Secrets Manager:

API Response: Generic error codes
CloudTrail:   Full audit trail with all context

Kubernetes Secrets API:

API Response: Standard error codes
Audit logs:   Complete request/response details

All separate client-facing errors from server-side diagnostics.

Alternatives Considered

Alternative 1: Include Generic Error Messages

Provide generic messages without sensitive details:

{
  "err": "NOT_FOUND",
  "message": "Secret not found"
}

Rejected because:

Adds API surface complexity with minimal benefit
Generic messages don’t provide actionable information beyond the code
Risk of messages accidentally becoming more detailed over time
The error code already conveys the same information
No clear line between “safe” and “unsafe” detail levels

Alternative 2: Detailed Messages for Authenticated Users

Provide detailed errors only to authenticated, authorized users:

{
  "err": "UNAUTHORIZED",
  "message": "Policy 'db-read' denies access to 'secrets/db/password'"
}

Rejected because:

Still risks information leakage (policy names, path details)
Adds complexity to determine what details are “safe”
Authentication doesn’t mean users should see internal details
Creates inconsistent error handling logic
Server-side logs already provide this for operators

Alternative 3: Request IDs for Correlation

Include correlation IDs so clients can reference server logs:

{
  "err": "NOT_FOUND",
  "requestId": "req_abc123"
}

Considered acceptable but not required because:

SPIKE’s audit logging already provides correlation via SPIFFE ID and timestamp
Users with legitimate need for diagnostics have server log access
Adding request IDs provides minimal benefit over existing correlation methods
Can be added later if operational experience shows clear need
Keeping responses minimal is preferred for initial implementation

Status: May be reconsidered based on operational feedback

Consequences

Positive

Security by design: Information leakage is prevented at the API layer
Enumeration protection: Attackers cannot probe system structure via errors
No accidental disclosure: Impossible to leak stack traces or implementation details
Clean API surface: Simple, stable error code enumeration
Programmatic handling: Clients can reliably switch on error codes
Stable interface: Error codes don’t change; messages would
Clear security model: “If you’re authorized, the code tells you everything. If you’re not, you get nothing.”

Negative

Limited client diagnostics: Clients cannot see detailed error reasons
Operator workflow: Users must correlate client errors with server logs for debugging
Learning curve: New users might expect more detailed error messages
Script debugging: Wrapper scripts get less information for error handling

Neutral

Consistent with design: SPIKE already has comprehensive audit logging
Expected for security systems: Users familiar with Vault, etc., expect this pattern
Operational requirement: Operators need server access anyway for secret management

Implementation Details

Response Structure

All error responses follow this structure:

type ErrorResponse struct {
    Err data.ErrorCode `json:"err"`
    // No message, details, or stack trace fields
}

Error Codes

Standard error codes returned:

HTTP Status	Error Code	Meaning
200	`null`	Success
400	`BAD_REQUEST`	Invalid request format or parameters
401	`UNAUTHORIZED`	Authentication or authorization failure
404	`NOT_FOUND`	Resource does not exist (or unauthorized)
500	`INTERNAL_SERVER_ERROR`	Backend or server-side failure

Note: 404 is used for both “not found” and “not authorized” to prevent enumeration.

Server-Side Logging

All errors are logged with full context:

// Handler logs detailed context
log.DebugErr(fName, sdkErrors.ErrAPINotFound.Wrap(err))
// Audit trail captures request details
journal.AuditRequest(fName, r, audit, journal.AuditRead)

// Client receives only:
net.Fail(reqres.SecretGetNotFound, w, http.StatusNotFound)

Error Handling Pattern

All route handlers follow this pattern:

func RouteGetSecret(w http.ResponseWriter, r *http.Request,
    audit *journal.AuditEntry) *sdkErrors.SDKError {

    secret, err := state.GetSecret(path, version)
    if err != nil {
        // Server-side: Log full context
        log.DebugErr("RouteGetSecret", err)

        // Client-side: Return only code
        if err.Is(sdkErrors.ErrEntityNotFound) {
            net.Fail(reqres.SecretGetNotFound, w, http.StatusNotFound)
        } else {
            net.Fail(reqres.SecretGetInternal, w,
                http.StatusInternalServerError)
        }
        return err
    }

    // Success response
    return net.Success(reqres.SecretGetSuccess, w)
}

Client Interpretation

Clients use error codes programmatically:

// Client code
resp, err := nexus.GetSecret(ctx, path, version)
if err != nil {
    switch resp.Err {
    case data.ErrNotFound:
        // Secret doesn't exist or not authorized
    case data.ErrUnauthorized:
        // Authentication failed
    case data.ErrInternal:
        // Server error - retry or escalate
    }
}

Future Enhancements

Correlation IDs

If operational experience shows a clear need, we may add request correlation IDs to API responses:

{
  "err": "NOT_FOUND",
  "requestId": "req_abc123"
}

Benefits:

Users can reference specific requests when asking operators for help
Operators can quickly locate relevant log entries
No security information is leaked (ID is opaque)
Improves support workflow without compromising security

Current status:

Not implemented in initial version
Existing correlation via SPIFFE ID and timestamp is sufficient
Will reconsider based on operational feedback and support burden
Can be added non-breaking if needed

Evaluation criteria:

Frequency of users needing operator assistance for error diagnosis
Time spent by operators correlating client errors with server logs
User feedback on debugging difficulty
Comparison with alternative approaches (timestamp-based correlation, SPIFFE ID filtering)