soroban-abacus-flashcards/infra/terraform/.claude/LITEFS_K8S.md

3.5 KiB

LiteFS on Kubernetes - Reference Guide

Architecture Overview

LiteFS provides distributed SQLite by replicating from a primary node to replicas.

Pod-0 (Primary)     Pod-1 (Replica)     Pod-N (Replica)
    │                    │                    │
    └────────────────────┴────────────────────┘
           LiteFS replication stream
  • StatefulSet required for stable pod identities (abaci-app-0, abaci-app-1)
  • Headless Service required for pod-to-pod DNS resolution
  • Pod-0 is always the primary (handles writes, runs migrations)
  • Other pods are replicas (receive replicated data, forward writes to primary)

Critical Configuration Gotchas

1. HOSTNAME Environment Variable

NEVER set HOSTNAME in Dockerfile or ConfigMap. LiteFS needs the actual pod hostname (e.g., abaci-app-0) for cluster communication. Setting HOSTNAME=0.0.0.0 or any other value breaks LiteFS.

# BAD - breaks LiteFS
ENV HOSTNAME="0.0.0.0"

# GOOD - let the pod set its own hostname
# (don't set HOSTNAME at all)

2. Static Lease Configuration

For static primary election, ALL nodes must point to the same primary hostname:

# In litefs.yml - ALL NODES use the same values
lease:
  type: "static"
  hostname: "abaci-app-0"  # Fixed primary hostname
  advertise-url: "http://abaci-app-0.abaci-app-headless.abaci.svc.cluster.local:20202"
  candidate: ${LITEFS_CANDIDATE}  # true for pod-0, false for others

Common mistake: Using ${HOSTNAME} for the hostname field. This makes each pod think IT is the primary.

3. LITEFS_CANDIDATE Must Be Set at Runtime

Environment variable substitution in litefs.yml doesn't work reliably. Generate the config at container startup:

# In container command
export LITEFS_CANDIDATE=$(cat /config/litefs-candidate)
cat > /tmp/litefs.yml << EOF
lease:
  candidate: $LITEFS_CANDIDATE
EOF
exec litefs mount -config /tmp/litefs.yml

4. LiteFS 0.5 Config Changes

These fields do NOT exist in LiteFS 0.5:

  • primary-redirect-url (removed from proxy config)

5. Cluster ID Mismatch After Failed Attempts

If replicas fail to connect with "cannot stream from primary with a different cluster id":

# Scale down
kubectl -n abaci scale statefulset abaci-app --replicas=1

# Delete the replica's PVC
kubectl -n abaci delete pvc litefs-data-abaci-app-1

# Scale back up - replica will get fresh volume and adopt primary's cluster ID
kubectl -n abaci scale statefulset abaci-app --replicas=2

6. Migration Path

Ensure the migration command path matches your build output:

exec:
  - cmd: "node dist/db/migrate.js"  # NOT dist/migrate.js
    if-candidate: true

7. Required Container Privileges

LiteFS needs FUSE mount capability:

security_context {
  run_as_user  = 0
  run_as_group = 0
  privileged   = true  # Required for FUSE
}

Debugging Commands

# Check pod logs for LiteFS messages
kubectl -n abaci logs abaci-app-0 | grep -i litefs
kubectl -n abaci logs abaci-app-1 | grep -i litefs

# Verify primary is accepting connections
kubectl -n abaci logs abaci-app-0 | grep "stream connected"

# Verify replica connected successfully
kubectl -n abaci logs abaci-app-1 | grep "connected to cluster"

# Check PVCs
kubectl -n abaci get pvc

Workflow Reminder

Don't build Docker images locally. Use GitHub Actions:

  1. Commit and push changes
  2. Monitor build: gh run watch
  3. Apply Terraform: terraform apply
  4. Check pod logs