Ensures the latest smoke tests image is always pulled, avoiding stale cached images when updates are pushed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| .claude | ||
| .gitignore | ||
| .terraform.lock.hcl | ||
| CLAUDE.md | ||
| README.md | ||
| app.tf | ||
| cert-manager.tf | ||
| gatus.tf | ||
| keel.tf | ||
| main.tf | ||
| outputs.tf | ||
| redis.tf | ||
| smoke-tests.tf | ||
| storage.tf | ||
| terraform.tfvars.example | ||
| variables.tf | ||
| versions.tf | ||
README.md
K3s Infrastructure for Abaci.one
This directory contains Terraform configuration for deploying the Abaci.one application to a k3s (lightweight Kubernetes) cluster.
Architecture Overview
Internet
│
▼
┌───────────────────┐
│ NAS Traefik │ (Entry point, handles SSL for all domains)
│ (Docker) │ Config: /volume1/homes/antialias/projects/traefik/services.yaml
│ - SSL/TLS via │
│ Let's Encrypt │
│ - Routes to k3s │
└─────────┬─────────┘
│
▼ passHostHeader: true
┌───────────────────┐
│ k3s Traefik │ (Internal ingress controller)
│ - Rate Limiting │
│ - HSTS │
│ - Path routing │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ abaci-app Service│ (Load Balancer)
└─────────┬─────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod-0 │ │ Pod-1 │ │ Pod-2 │
│ PRIMARY │ │ REPLICA │ │ REPLICA │
│ │ │ │ │ │
│ LiteFS │──│ LiteFS │──│ LiteFS │
│ (FUSE) │ │ (FUSE) │ │ (FUSE) │
│ │ │ │ │ │
│ Next.js │ │ Next.js │ │ Next.js │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────┴────────────┘
│
┌──────┴──────┐
│ Redis │
└─────────────┘
Key Components
StatefulSet: abaci-app
- 3 replicas with stable network identities (pod-0, pod-1, pod-2)
- Pod-0 is always the primary (handles database writes)
- Other pods are replicas (receive replicated data via LiteFS)
LiteFS
- Provides distributed SQLite with automatic replication
- Mounted via FUSE at
/litefs - Primary (pod-0) handles all writes
- Replicas maintain read-only copies for load distribution
- Important: LiteFS proxy's
fly-replayheader only works on Fly.io, not k8s
Keel (Auto-Deployment)
- Watches
ghcr.iofor new images - Polls every 2 minutes for
:latesttag changes - Automatically triggers rolling updates when new images are detected
- No manual deployment steps required after pushing to main
Services
- abaci-app: ClusterIP service, load balances GET requests across all pods
- abaci-app-primary: Routes to pod-0 only (for POST/PUT/DELETE/PATCH)
- abaci-app-headless: Headless service for pod-to-pod DNS (LiteFS replication)
Ingress & Write Routing
- Traefik ingress controller (included with k3s)
- SSL certificates via cert-manager + Let's Encrypt
- HSTS, rate limiting, and in-flight request limits
- IngressRoute routes write methods (POST/PUT/DELETE/PATCH) to primary service
- This is required because LiteFS proxy on replicas returns
fly-replayheader which k8s doesn't understand
File Structure
infra/terraform/
├── main.tf # Providers and namespace
├── app.tf # Main app StatefulSet, Services, Ingress
├── keel.tf # Keel auto-deployment
├── redis.tf # Redis deployment for sessions/cache
├── cert-manager.tf # SSL certificate management
├── storage.tf # PVC for vision training data
├── variables.tf # Input variables
├── outputs.tf # Terraform outputs
├── versions.tf # Provider versions
├── .claude/
│ └── LITEFS_K8S.md # LiteFS troubleshooting guide
├── CLAUDE.md # Agent instructions
└── README.md # This file
Deployment Workflow
Automatic (Normal Flow)
- Push code to main → GitHub Actions builds Docker image
- Image pushed to ghcr.io with
:latesttag - Keel detects new image (within 2 minutes)
- Rolling update triggered automatically
Manual Infrastructure Changes
When you modify Terraform files:
cd infra/terraform
terraform plan # Review changes
terraform apply # Apply changes
Manual Pod Restart
To force an immediate rollout without waiting for Keel:
kubectl --kubeconfig=~/.kube/k3s-config -n abaci rollout restart statefulset abaci-app
Common Operations
Check Pod Status
kubectl --kubeconfig=~/.kube/k3s-config -n abaci get pods
View Logs
# App logs
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-0 -f
# Keel logs (auto-deployment)
kubectl --kubeconfig=~/.kube/k3s-config -n keel logs -l app=keel
Check LiteFS Replication
# Primary should show "stream connected"
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-0 | grep stream
# Replicas should show "connected to cluster"
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-1 | grep connected
Query Production Database
kubectl --kubeconfig=~/.kube/k3s-config -n abaci exec abaci-app-0 -- sqlite3 /litefs/sqlite.db "SELECT COUNT(*) FROM users"
Scale Replicas
# Scale to 5 replicas
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=5
# Or update var.app_replicas in terraform.tfvars and apply
Troubleshooting
Pods Stuck in Pending
kubectl --kubeconfig=~/.kube/k3s-config -n abaci describe pod abaci-app-0
LiteFS Cluster ID Mismatch
If replicas fail with "cannot stream from primary with a different cluster id":
# Scale to 1, delete replica PVC, scale back up
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=1
kubectl --kubeconfig=~/.kube/k3s-config -n abaci delete pvc litefs-data-abaci-app-1
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=3
Keel Not Updating
- Check Keel logs for errors
- Verify annotations on StatefulSet:
keel.sh/policy=force - Check if image digest actually changed in ghcr.io
Environment Variables
| Variable | Description |
|---|---|
NODE_ENV |
production |
PORT |
3000 (internal, proxied through LiteFS at 8080) |
DATABASE_URL |
/litefs/sqlite.db |
REDIS_URL |
redis://redis:6379 |
AUTH_SECRET |
NextAuth.js secret (from terraform secret) |
SSL/TLS
SSL is handled at two levels:
-
NAS Traefik (external entry point)
- Terminates SSL for all domains (abaci.one, status.abaci.one, etc.)
- Issues certs via Let's Encrypt (
certresolver: "myresolver") - Config:
nas:/volume1/homes/antialias/projects/traefik/services.yaml
-
k3s Traefik (internal)
- Receives traffic from NAS Traefik (passHostHeader)
- Handles internal routing and rate limiting
- Can optionally manage additional certs for internal services
Adding New Subdomains
To add a new subdomain (e.g., api.abaci.one):
-
Add DNS record (via Porkbun)
# CNAME pointing to main domain curl -X POST "https://api.porkbun.com/api/json/v3/dns/create/abaci.one" \ -d '{"name": "api", "type": "CNAME", "content": "abaci.one", ...}' -
Update NAS Traefik (
services.yaml)http: routers: api-k3s: rule: "Host(`api.abaci.one`)" service: abaci-k3s entryPoints: ["websecure"] tls: certresolver: "myresolver" api-k3s-http: rule: "Host(`api.abaci.one`)" service: abaci-k3s entryPoints: ["web"] middlewares: ["redirect-https"]File location:
nas:/volume1/homes/antialias/projects/traefik/services.yamlTraefik auto-reloads this file. -
Add k3s Ingress (in Terraform)
resource "kubernetes_ingress_v1" "api" { # ... standard ingress config spec { rule { host = "api.abaci.one" # ... } } } -
Apply Terraform
terraform apply