Troubleshooting
Problem
Something is wrong and the error message points at a symptom, not a cause.
Nexus will not go Ready, a workload gets 403 even though you wrote a policy,
or bootstrap hangs. This recipe is organized symptom first: find the line
that matches what you see, then work the checklist under it.
TL;DR
| Symptom | Most likely cause |
|---|---|
| Nexus never Ready / loops in keeper recovery | Keepers unseeded; bootstrap |
403 Forbidden on a secret | Missing policy, or pattern is a glob not regex |
| Bootstrap hangs / never completes | A keeper unreachable, or verify can’t reach Nexus |
no registration entry found | SPIRE entry missing or selectors mismatch |
connection refused (SVID) | SPIRE Agent down or wrong socket path |
Nexus never becomes Ready (stuck in keeper recovery)
This is the classic fresh-deployment failure. Nexus rebuilds its root key from the keepers on startup and waits forever until it can; if the keepers are empty, it loops and the readiness probe keeps restarting it.
Work down this list:
- Did bootstrap run? On a fresh
lite/sqlitedeployment the keepers start empty. Until bootstrap seeds them, Nexus cannot recover. This is the number-one cause. Run it. See Bootstrapping a fresh SPIKE. - Did you
helm install --wait? That deadlocks: Nexus is not Ready until the keepers are seeded, which happens after install. Install without--wait, run bootstrap, then wait on the Nexus rollout. - Are all keepers reachable? Bootstrap needs every keeper in
SPIKE_NEXUS_KEEPER_PEERSreachable, and Nexus needsthresholdof them to reconstruct. Check the peer list and that each keeper is up. - Is
SPIKE_NEXUS_API_URLresolvable? In Kubernetes use the fully-qualified*.svc.cluster.localname; the shortservice.namespaceform can be NXDOMAIN. - Are all trust roots set? Nexus validates callers against the configured
trust roots, including
SPIKE_TRUST_ROOT_NEXUS. A missing root silently rejects otherwise-valid identities.
Policy created but access denied (403)
The workload has an identity but is not authorized for what it tried.
- Does a policy actually match? Both the
spiffeid-patternand thepath-patternmust match. A policy that “looks right” usually has one too narrow or unanchored. - Are the patterns regex, not globs?
^tenants/acme/.*$, nottenants/acme/*. A glob-style*silently matches the wrong set or nothing. See Writing access policies. - Did you escape the dots?
example\.org, notexample.org(an unescaped.matches any character, sometimes masking the real problem). - Right permission?
readto read,writeto create/update/delete,listto enumerate. A reader with onlyliststill cannotget. - Path is a namespace. No leading slash;
tenants/acme/db/creds, and the policy path and request path must agree.
Bootstrap hangs or never completes
- Keepers not all up. Bootstrap requires exactly
SHARESreachable keepers; a wrong or incomplete peer list stalls or fails it. - Verify cannot reach Nexus. The post-seed verification calls the Nexus
API; if
SPIKE_NEXUS_API_URLis wrong, verification fails even though the shares landed. - Re-run did nothing. That is idempotency working: the
spike-bootstrap-stateConfigMap records completion. To deliberately re-key, setSPIKE_BOOTSTRAP_FORCE=true(this orphans data under the old key).
SVID / SPIRE errors
no registration entry found— the SPIRE entry is missing, or its selectors do not match the pod/process. Create the entry and confirm the selectors. See Granting a workload access.connection refusedacquiring an SVID — the SPIRE Agent is down orSPIFFE_ENDPOINT_SOCKETpoints at the wrong socket. Start the agent and fix the path.
Tips
- Read the audit log carefully. Audit entries (
[AUDIT]:) can log “enter/exit success” for a route even when the request fell through to a fallback and returned an error. Cross-check the actual HTTP status, not just the audit line. - Local clusters cache images by tag. With
imagePullPolicy: Never, kind and minikube reuse the cached image for the same tag even after a fresh load. Use a unique tag (or force a re-pull) when an image change “isn’t taking effect.” - Isolate the layer. Confirm SPIRE issues the SVID, then that SPIKE authorizes it, then that the data operation works. Most “SPIKE” failures are actually identity or policy failures one layer down.
Pitfalls
- Treating the symptom, not the cause. A restarting Nexus pod is almost never a Nexus bug; it is usually unseeded keepers. Fix the cause (bootstrap), not the symptom (probe tuning).
- Assuming a 403 is a storage problem. It is authorization. Check the policy and its patterns before touching the secret store.
- Forgetting
liteneeds keepers too. “Encryption-only” does not mean “no bootstrap.”litestill recovers a root key from keepers and shows the exact same stuck-in-recovery symptom if unseeded.
Cross-links
- Bootstrapping a fresh SPIKE
- Where the root key lives
- Writing access policies
- Deploying SPIKE
- Reference: Configuration
What’s next
Once it is healthy, read secrets from your own code: Integrating the Go SDK.