260 lines
8.7 KiB
Markdown
260 lines
8.7 KiB
Markdown
# K3s Infrastructure for Abaci.one
|
|
|
|
This directory contains Terraform configuration for deploying the Abaci.one application to a k3s (lightweight Kubernetes) cluster.
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
Internet
|
|
│
|
|
▼
|
|
┌───────────────────┐
|
|
│ NAS Traefik │ (Entry point, handles SSL for all domains)
|
|
│ (Docker) │ Config: /volume1/homes/antialias/projects/traefik/services.yaml
|
|
│ - SSL/TLS via │
|
|
│ Let's Encrypt │
|
|
│ - Routes to k3s │
|
|
└─────────┬─────────┘
|
|
│
|
|
▼ passHostHeader: true
|
|
┌───────────────────┐
|
|
│ k3s Traefik │ (Internal ingress controller)
|
|
│ - Rate Limiting │
|
|
│ - HSTS │
|
|
│ - Path routing │
|
|
└─────────┬─────────┘
|
|
│
|
|
▼
|
|
┌───────────────────┐
|
|
│ abaci-app Service│ (Load Balancer)
|
|
└─────────┬─────────┘
|
|
│
|
|
┌─────────────┼─────────────┐
|
|
▼ ▼ ▼
|
|
┌─────────┐ ┌─────────┐ ┌─────────┐
|
|
│ Pod-0 │ │ Pod-1 │ │ Pod-2 │
|
|
│ PRIMARY │ │ REPLICA │ │ REPLICA │
|
|
│ │ │ │ │ │
|
|
│ LiteFS │──│ LiteFS │──│ LiteFS │
|
|
│ (FUSE) │ │ (FUSE) │ │ (FUSE) │
|
|
│ │ │ │ │ │
|
|
│ Next.js │ │ Next.js │ │ Next.js │
|
|
└────┬────┘ └────┬────┘ └────┬────┘
|
|
│ │ │
|
|
└────────────┴────────────┘
|
|
│
|
|
┌──────┴──────┐
|
|
│ Redis │
|
|
└─────────────┘
|
|
```
|
|
|
|
## Key Components
|
|
|
|
### StatefulSet: `abaci-app`
|
|
- 3 replicas with stable network identities (pod-0, pod-1, pod-2)
|
|
- Pod-0 is always the primary (handles database writes)
|
|
- Other pods are replicas (receive replicated data via LiteFS)
|
|
|
|
### LiteFS
|
|
- Provides distributed SQLite with automatic replication
|
|
- Mounted via FUSE at `/litefs`
|
|
- Primary (pod-0) handles all writes
|
|
- Replicas maintain read-only copies for load distribution
|
|
- **Important:** LiteFS proxy's `fly-replay` header only works on Fly.io, not k8s
|
|
|
|
### Keel (Auto-Deployment)
|
|
- Watches `ghcr.io` for new images
|
|
- Polls every 2 minutes for `:latest` tag changes
|
|
- Automatically triggers rolling updates when new images are detected
|
|
- **No manual deployment steps required after pushing to main**
|
|
|
|
### Services
|
|
- **abaci-app**: ClusterIP service, load balances GET requests across all pods
|
|
- **abaci-app-primary**: Routes to pod-0 only (for POST/PUT/DELETE/PATCH)
|
|
- **abaci-app-headless**: Headless service for pod-to-pod DNS (LiteFS replication)
|
|
|
|
### Ingress & Write Routing
|
|
- Traefik ingress controller (included with k3s)
|
|
- SSL certificates via cert-manager + Let's Encrypt
|
|
- HSTS, rate limiting, and in-flight request limits
|
|
- **IngressRoute** routes write methods (POST/PUT/DELETE/PATCH) to primary service
|
|
- This is required because LiteFS proxy on replicas returns `fly-replay` header which k8s doesn't understand
|
|
|
|
## File Structure
|
|
|
|
```
|
|
infra/terraform/
|
|
├── main.tf # Providers and namespace
|
|
├── app.tf # Main app StatefulSet, Services, Ingress
|
|
├── keel.tf # Keel auto-deployment
|
|
├── redis.tf # Redis deployment for sessions/cache
|
|
├── cert-manager.tf # SSL certificate management
|
|
├── storage.tf # PVC for vision training data
|
|
├── variables.tf # Input variables
|
|
├── outputs.tf # Terraform outputs
|
|
├── versions.tf # Provider versions
|
|
├── .claude/
|
|
│ └── LITEFS_K8S.md # LiteFS troubleshooting guide
|
|
├── CLAUDE.md # Agent instructions
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Deployment Workflow
|
|
|
|
### Automatic (Normal Flow)
|
|
|
|
1. **Push code to main** → GitHub Actions builds Docker image
|
|
2. **Image pushed to ghcr.io** with `:latest` tag
|
|
3. **Keel detects new image** (within 2 minutes)
|
|
4. **Rolling update triggered** automatically
|
|
|
|
### Manual Infrastructure Changes
|
|
|
|
When you modify Terraform files:
|
|
|
|
```bash
|
|
cd infra/terraform
|
|
terraform plan # Review changes
|
|
terraform apply # Apply changes
|
|
```
|
|
|
|
### Manual Pod Restart
|
|
|
|
To force an immediate rollout without waiting for Keel:
|
|
|
|
```bash
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci rollout restart statefulset abaci-app
|
|
```
|
|
|
|
## Common Operations
|
|
|
|
### Check Pod Status
|
|
```bash
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci get pods
|
|
```
|
|
|
|
### View Logs
|
|
```bash
|
|
# App logs
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-0 -f
|
|
|
|
# Keel logs (auto-deployment)
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n keel logs -l app=keel
|
|
```
|
|
|
|
### Check LiteFS Replication
|
|
```bash
|
|
# Primary should show "stream connected"
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-0 | grep stream
|
|
|
|
# Replicas should show "connected to cluster"
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-1 | grep connected
|
|
```
|
|
|
|
### Query Production Database
|
|
```bash
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci exec abaci-app-0 -- sqlite3 /litefs/sqlite.db "SELECT COUNT(*) FROM users"
|
|
```
|
|
|
|
### Scale Replicas
|
|
```bash
|
|
# Scale to 5 replicas
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=5
|
|
|
|
# Or update var.app_replicas in terraform.tfvars and apply
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Pods Stuck in Pending
|
|
```bash
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci describe pod abaci-app-0
|
|
```
|
|
|
|
### LiteFS Cluster ID Mismatch
|
|
If replicas fail with "cannot stream from primary with a different cluster id":
|
|
|
|
```bash
|
|
# Scale to 1, delete replica PVC, scale back up
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=1
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci delete pvc litefs-data-abaci-app-1
|
|
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=3
|
|
```
|
|
|
|
### Keel Not Updating
|
|
1. Check Keel logs for errors
|
|
2. Verify annotations on StatefulSet: `keel.sh/policy=force`
|
|
3. Check if image digest actually changed in ghcr.io
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Description |
|
|
|----------|-------------|
|
|
| `NODE_ENV` | production |
|
|
| `PORT` | 3000 (internal, proxied through LiteFS at 8080) |
|
|
| `DATABASE_URL` | /litefs/sqlite.db |
|
|
| `REDIS_URL` | redis://redis:6379 |
|
|
| `AUTH_SECRET` | NextAuth.js secret (from terraform secret) |
|
|
|
|
## SSL/TLS
|
|
|
|
SSL is handled at **two levels**:
|
|
|
|
1. **NAS Traefik** (external entry point)
|
|
- Terminates SSL for all domains (abaci.one, status.abaci.one, etc.)
|
|
- Issues certs via Let's Encrypt (`certresolver: "myresolver"`)
|
|
- Config: `nas:/volume1/homes/antialias/projects/traefik/services.yaml`
|
|
|
|
2. **k3s Traefik** (internal)
|
|
- Receives traffic from NAS Traefik (passHostHeader)
|
|
- Handles internal routing and rate limiting
|
|
- Can optionally manage additional certs for internal services
|
|
|
|
## Adding New Subdomains
|
|
|
|
To add a new subdomain (e.g., `api.abaci.one`):
|
|
|
|
1. **Add DNS record** (via Porkbun)
|
|
```bash
|
|
# CNAME pointing to main domain
|
|
curl -X POST "https://api.porkbun.com/api/json/v3/dns/create/abaci.one" \
|
|
-d '{"name": "api", "type": "CNAME", "content": "abaci.one", ...}'
|
|
```
|
|
|
|
2. **Update NAS Traefik** (`services.yaml`)
|
|
```yaml
|
|
http:
|
|
routers:
|
|
api-k3s:
|
|
rule: "Host(`api.abaci.one`)"
|
|
service: abaci-k3s
|
|
entryPoints: ["websecure"]
|
|
tls:
|
|
certresolver: "myresolver"
|
|
api-k3s-http:
|
|
rule: "Host(`api.abaci.one`)"
|
|
service: abaci-k3s
|
|
entryPoints: ["web"]
|
|
middlewares: ["redirect-https"]
|
|
```
|
|
File location: `nas:/volume1/homes/antialias/projects/traefik/services.yaml`
|
|
Traefik auto-reloads this file.
|
|
|
|
3. **Add k3s Ingress** (in Terraform)
|
|
```hcl
|
|
resource "kubernetes_ingress_v1" "api" {
|
|
# ... standard ingress config
|
|
spec {
|
|
rule {
|
|
host = "api.abaci.one"
|
|
# ...
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
4. **Apply Terraform**
|
|
```bash
|
|
terraform apply
|
|
```
|