soroban-abacus-flashcards/infra/terraform
Thomas Hallock b04d0caeaf feat(infra): add OpenAI API key for LLM features
Add openai_api_key variable to terraform configuration for AI-powered
features like flowchart generation. The key is stored as a k8s secret
and exposed to pods as LLM_OPENAI_API_KEY environment variable.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:58:34 -06:00
..
.claude refactor(flowchart): remove legacy schema-specific formatting, add display.problem check 2026-01-22 11:03:15 -06:00
.gitignore
.terraform.lock.hcl feat(infra): add full k8s stack mirroring docker-compose setup 2026-01-21 11:33:49 -06:00
CLAUDE.md docs(infra): document LiteFS write routing for k8s deployments 2026-01-23 13:56:29 -06:00
README.md docs(infra): document LiteFS write routing for k8s deployments 2026-01-23 13:56:29 -06:00
app.tf feat(infra): add OpenAI API key for LLM features 2026-01-23 13:58:34 -06:00
cert-manager.tf feat(infra): add full k8s stack mirroring docker-compose setup 2026-01-21 11:33:49 -06:00
gatus.tf fix(infra): include paths in Gatus endpoint names 2026-01-23 12:56:30 -06:00
keel.tf feat(infra): add Keel for automatic k3s deployments 2026-01-23 12:11:19 -06:00
main.tf feat(infra): add full k8s stack mirroring docker-compose setup 2026-01-21 11:33:49 -06:00
outputs.tf feat(infra): add full k8s stack mirroring docker-compose setup 2026-01-21 11:33:49 -06:00
redis.tf feat(infra): add full k8s stack mirroring docker-compose setup 2026-01-21 11:33:49 -06:00
storage.tf refactor(flowchart): remove legacy schema-specific formatting, add display.problem check 2026-01-22 11:03:15 -06:00
terraform.tfvars.example feat(infra): add OpenAI API key for LLM features 2026-01-23 13:58:34 -06:00
variables.tf feat(infra): add OpenAI API key for LLM features 2026-01-23 13:58:34 -06:00
versions.tf feat(infra): add full k8s stack mirroring docker-compose setup 2026-01-21 11:33:49 -06:00

README.md

K3s Infrastructure for Abaci.one

This directory contains Terraform configuration for deploying the Abaci.one application to a k3s (lightweight Kubernetes) cluster.

Architecture Overview

                    Internet
                        │
                        ▼
            ┌───────────────────┐
            │  NAS Traefik      │  (Entry point, handles SSL for all domains)
            │  (Docker)         │  Config: /volume1/homes/antialias/projects/traefik/services.yaml
            │  - SSL/TLS via    │
            │    Let's Encrypt  │
            │  - Routes to k3s  │
            └─────────┬─────────┘
                      │
                      ▼ passHostHeader: true
            ┌───────────────────┐
            │  k3s Traefik      │  (Internal ingress controller)
            │  - Rate Limiting  │
            │  - HSTS           │
            │  - Path routing   │
            └─────────┬─────────┘
                      │
                      ▼
            ┌───────────────────┐
            │  abaci-app Service│  (Load Balancer)
            └─────────┬─────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
   ┌─────────┐  ┌─────────┐  ┌─────────┐
   │ Pod-0   │  │ Pod-1   │  │ Pod-2   │
   │ PRIMARY │  │ REPLICA │  │ REPLICA │
   │         │  │         │  │         │
   │ LiteFS  │──│ LiteFS  │──│ LiteFS  │
   │ (FUSE)  │  │ (FUSE)  │  │ (FUSE)  │
   │         │  │         │  │         │
   │ Next.js │  │ Next.js │  │ Next.js │
   └────┬────┘  └────┬────┘  └────┬────┘
        │            │            │
        └────────────┴────────────┘
                     │
              ┌──────┴──────┐
              │    Redis    │
              └─────────────┘

Key Components

StatefulSet: abaci-app

  • 3 replicas with stable network identities (pod-0, pod-1, pod-2)
  • Pod-0 is always the primary (handles database writes)
  • Other pods are replicas (receive replicated data via LiteFS)

LiteFS

  • Provides distributed SQLite with automatic replication
  • Mounted via FUSE at /litefs
  • Primary (pod-0) handles all writes
  • Replicas maintain read-only copies for load distribution
  • Important: LiteFS proxy's fly-replay header only works on Fly.io, not k8s

Keel (Auto-Deployment)

  • Watches ghcr.io for new images
  • Polls every 2 minutes for :latest tag changes
  • Automatically triggers rolling updates when new images are detected
  • No manual deployment steps required after pushing to main

Services

  • abaci-app: ClusterIP service, load balances GET requests across all pods
  • abaci-app-primary: Routes to pod-0 only (for POST/PUT/DELETE/PATCH)
  • abaci-app-headless: Headless service for pod-to-pod DNS (LiteFS replication)

Ingress & Write Routing

  • Traefik ingress controller (included with k3s)
  • SSL certificates via cert-manager + Let's Encrypt
  • HSTS, rate limiting, and in-flight request limits
  • IngressRoute routes write methods (POST/PUT/DELETE/PATCH) to primary service
  • This is required because LiteFS proxy on replicas returns fly-replay header which k8s doesn't understand

File Structure

infra/terraform/
├── main.tf           # Providers and namespace
├── app.tf            # Main app StatefulSet, Services, Ingress
├── keel.tf           # Keel auto-deployment
├── redis.tf          # Redis deployment for sessions/cache
├── cert-manager.tf   # SSL certificate management
├── storage.tf        # PVC for vision training data
├── variables.tf      # Input variables
├── outputs.tf        # Terraform outputs
├── versions.tf       # Provider versions
├── .claude/
│   └── LITEFS_K8S.md # LiteFS troubleshooting guide
├── CLAUDE.md         # Agent instructions
└── README.md         # This file

Deployment Workflow

Automatic (Normal Flow)

  1. Push code to main → GitHub Actions builds Docker image
  2. Image pushed to ghcr.io with :latest tag
  3. Keel detects new image (within 2 minutes)
  4. Rolling update triggered automatically

Manual Infrastructure Changes

When you modify Terraform files:

cd infra/terraform
terraform plan        # Review changes
terraform apply       # Apply changes

Manual Pod Restart

To force an immediate rollout without waiting for Keel:

kubectl --kubeconfig=~/.kube/k3s-config -n abaci rollout restart statefulset abaci-app

Common Operations

Check Pod Status

kubectl --kubeconfig=~/.kube/k3s-config -n abaci get pods

View Logs

# App logs
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-0 -f

# Keel logs (auto-deployment)
kubectl --kubeconfig=~/.kube/k3s-config -n keel logs -l app=keel

Check LiteFS Replication

# Primary should show "stream connected"
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-0 | grep stream

# Replicas should show "connected to cluster"
kubectl --kubeconfig=~/.kube/k3s-config -n abaci logs abaci-app-1 | grep connected

Query Production Database

kubectl --kubeconfig=~/.kube/k3s-config -n abaci exec abaci-app-0 -- sqlite3 /litefs/sqlite.db "SELECT COUNT(*) FROM users"

Scale Replicas

# Scale to 5 replicas
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=5

# Or update var.app_replicas in terraform.tfvars and apply

Troubleshooting

Pods Stuck in Pending

kubectl --kubeconfig=~/.kube/k3s-config -n abaci describe pod abaci-app-0

LiteFS Cluster ID Mismatch

If replicas fail with "cannot stream from primary with a different cluster id":

# Scale to 1, delete replica PVC, scale back up
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=1
kubectl --kubeconfig=~/.kube/k3s-config -n abaci delete pvc litefs-data-abaci-app-1
kubectl --kubeconfig=~/.kube/k3s-config -n abaci scale statefulset abaci-app --replicas=3

Keel Not Updating

  1. Check Keel logs for errors
  2. Verify annotations on StatefulSet: keel.sh/policy=force
  3. Check if image digest actually changed in ghcr.io

Environment Variables

Variable Description
NODE_ENV production
PORT 3000 (internal, proxied through LiteFS at 8080)
DATABASE_URL /litefs/sqlite.db
REDIS_URL redis://redis:6379
AUTH_SECRET NextAuth.js secret (from terraform secret)

SSL/TLS

SSL is handled at two levels:

  1. NAS Traefik (external entry point)

    • Terminates SSL for all domains (abaci.one, status.abaci.one, etc.)
    • Issues certs via Let's Encrypt (certresolver: "myresolver")
    • Config: nas:/volume1/homes/antialias/projects/traefik/services.yaml
  2. k3s Traefik (internal)

    • Receives traffic from NAS Traefik (passHostHeader)
    • Handles internal routing and rate limiting
    • Can optionally manage additional certs for internal services

Adding New Subdomains

To add a new subdomain (e.g., api.abaci.one):

  1. Add DNS record (via Porkbun)

    # CNAME pointing to main domain
    curl -X POST "https://api.porkbun.com/api/json/v3/dns/create/abaci.one" \
      -d '{"name": "api", "type": "CNAME", "content": "abaci.one", ...}'
    
  2. Update NAS Traefik (services.yaml)

    http:
      routers:
        api-k3s:
          rule: "Host(`api.abaci.one`)"
          service: abaci-k3s
          entryPoints: ["websecure"]
          tls:
            certresolver: "myresolver"
        api-k3s-http:
          rule: "Host(`api.abaci.one`)"
          service: abaci-k3s
          entryPoints: ["web"]
          middlewares: ["redirect-https"]
    

    File location: nas:/volume1/homes/antialias/projects/traefik/services.yaml Traefik auto-reloads this file.

  3. Add k3s Ingress (in Terraform)

    resource "kubernetes_ingress_v1" "api" {
      # ... standard ingress config
      spec {
        rule {
          host = "api.abaci.one"
          # ...
        }
      }
    }
    
  4. Apply Terraform

    terraform apply