Commit Graph

23 Commits

Author SHA1 Message Date
Thomas Hallock 9c09851b44 fix(smoke-tests): add imagePullPolicy Always to CronJob
Ensures the latest smoke tests image is always pulled, avoiding
stale cached images when updates are pushed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 06:24:40 -06:00
Thomas Hallock affad2f4a6 feat(monitoring): add E2E smoke tests with Gatus integration
Add Playwright-based smoke tests that run every 15 minutes via k8s CronJob,
with results exposed to Gatus for status.abaci.one monitoring.

- Add smoke_test_runs table for storing test results
- Add /api/smoke-test-status endpoint (Gatus checks this)
- Add /api/smoke-test-results endpoint (CronJob reports here)
- Add smoke tests for homepage, arcade, practice, and flowchart pages
- Add smoke-test-runner.ts script
- Add Dockerfile.smoke-tests based on Playwright image
- Add GitHub Actions workflow to build smoke tests image
- Add Kubernetes CronJob Terraform config
- Update Gatus config with Browser Smoke Tests endpoint

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 05:08:50 -06:00
Thomas Hallock 1e43ec18f3 fix(infra): configure Keel to watch all namespaces
Add watchAllNamespaces=true to Keel helm config so it monitors
workloads in the abaci namespace (not just keel namespace).

Update documentation to clarify that Keel annotations must be on
the workload metadata, not the pod template.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:45:04 -06:00
Thomas Hallock 747bc4a5f0 fix(infra): move Keel annotations to StatefulSet metadata
Keel reads annotations from the workload's metadata, not the pod template.
Moving annotations from spec.template.metadata to metadata fixes auto-updates.

Also:
- Set NAMESPACE="" on Keel deployment to watch all namespaces
- Keep ghcr credentials config (optional, for private registries)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:44:38 -06:00
Thomas Hallock c1809d72ae feat(infra): add ghcr.io registry credentials for Keel polling
Keel needs to authenticate with ghcr.io to poll for new image digests
(ghcr.io requires auth for manifest API even on public images).

- Add ghcr_token and ghcr_username variables
- Create docker-registry secret for ghcr.io
- Add imagePullSecrets to StatefulSet (Keel reads these for auth)
- Document the setup in keel.tf

To enable auto-updates:
1. Create GitHub PAT with read:packages scope
2. Set ghcr_token in terraform.tfvars
3. terraform apply

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 15:56:26 -06:00
Thomas Hallock b04d0caeaf feat(infra): add OpenAI API key for LLM features
Add openai_api_key variable to terraform configuration for AI-powered
features like flowchart generation. The key is stored as a k8s secret
and exposed to pods as LLM_OPENAI_API_KEY environment variable.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:58:34 -06:00
Thomas Hallock c80eefa5e3 docs(infra): document LiteFS write routing for k8s deployments
- Explain why LiteFS proxy fly-replay doesn't work outside Fly.io
- Document the primary service and IngressRoute solution
- Add troubleshooting symptoms for broken write routing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:56:29 -06:00
Thomas Hallock 6f76ce61df fix(infra): route write requests to primary pod for LiteFS compatibility
LiteFS proxy on replica pods returns fly-replay header expecting Fly.io's
infrastructure to re-route requests to the primary. Since we're on k8s,
Traefik doesn't understand this header and returns empty responses.

Solution:
- Add abaci-app-primary service targeting only pod-0 (the LiteFS primary)
- Add Traefik IngressRoute matching POST/PUT/DELETE/PATCH methods
- Route these write requests directly to the primary service
- GET requests still load-balance across all replicas for reads

This fixes the intermittent empty PDF responses where ~60-80% of POST
requests were failing due to hitting replica pods.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:55:24 -06:00
Thomas Hallock f916358614 fix(infra): include paths in Gatus endpoint names
Gatus UI only shows hostnames, not full URLs. Include the path
directly in the endpoint name for clarity.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:56:30 -06:00
Thomas Hallock c4d4ca7122 feat(infra): improve Gatus status page with clearer endpoint groups
- Organize endpoints into logical groups: Website, Arcade, Worksheets, Flowcharts, Core API, Infrastructure
- Add hide-url: false to show actual URLs on status page
- Use user-friendly names like "Games Hub", "Worksheet Builder", "Flashcard Generator"
- Remove confusing internal service endpoints
- Check database and Redis via infrastructure group

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:51:27 -06:00
Thomas Hallock ba4d2d7f7d docs(infra): document NAS Traefik routing and subdomain setup
- Update architecture diagram to show NAS Traefik as entry point
- Add "Adding New Subdomains" guide with DNS, NAS Traefik, and k3s steps
- Document network architecture in CLAUDE.md for agents
- Note services.yaml location on NAS

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:43:22 -06:00
Thomas Hallock dda5485408 feat(infra): add Gatus status page at status.abaci.one
- Gatus deployment monitoring homepage, health API, Redis, DB
- Simplified ingress (HTTP via NAS Traefik handles SSL)
- Updated NAS Traefik services.yaml with status subdomain routes

Access: https://status.abaci.one

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:42:41 -06:00
Thomas Hallock f8d1ec730c feat(infra): add Gatus status page at status.abaci.one
- Gatus deployment with SQLite persistence
- ConfigMap with endpoint monitors (homepage, health API, Redis, DB)
- Ingress with SSL via cert-manager
- DNS CNAME record already configured

Deploy with: terraform apply

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:17:24 -06:00
Thomas Hallock ee26b1e361 feat(infra): add Keel for automatic k3s deployments
- Add Keel helm release that polls ghcr.io every 2 minutes
- Add keel.sh annotations to app StatefulSet for auto-updates
- Create comprehensive README.md documenting k3s architecture
- Update CLAUDE.md with automatic deployment workflow

After terraform apply, deployments are fully automatic:
push to main → build → Keel detects new image → rolling update

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:11:19 -06:00
Thomas Hallock 2f82bc28ec feat(infra): scale to 3 app replicas for better load distribution
Pod-0 remains LiteFS primary (handles writes), pod-1 and pod-2 are
replicas that serve reads and forward writes to primary.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 10:16:00 -06:00
Thomas Hallock 0abed6ae55 feat(infra): add performance remediation for k8s deployment
- Increase resource limits: 1Gi memory, 2 CPU cores per pod
- Tune health probes: 10s timeout, 5 failures (75s grace period)
- Add Traefik rate limiting: 50 req/sec avg, 100 burst
- Add in-flight request limiting: max 100 concurrent connections

Fixes pod crashes under moderate load (50+ concurrent connections).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 10:10:38 -06:00
Thomas Hallock 6c51182c15 refactor(flowchart): remove legacy schema-specific formatting, add display.problem check
- Remove legacy schema-specific formatting fallbacks in formatting.ts and example-generator.ts
- All flowcharts now require explicit display.problem and display.answer expressions
- Add DISP-003 diagnostic for missing display.problem expressions
- Update doctor to treat missing display.answer as error (was warning)

Also includes:
- Terraform: generate LiteFS config at runtime, add AUTH_TRUST_HOST, add volume mounts for vision-training and uploads data
- Terraform: add storage.tf for persistent volume claims
- Add Claude instructions for terraform directory
- Various UI component formatting updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 11:03:15 -06:00
Thomas Hallock 2765b081bc fix(litefs): simplify candidate env var and add debug logging
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 16:50:02 -06:00
Thomas Hallock 42f55855eb fix(litefs): remove HOSTNAME env var to allow pod hostname detection
LiteFS needs the actual pod hostname for cluster communication,
but HOSTNAME=0.0.0.0 was being set in both the Dockerfile and
ConfigMap, overriding the pod's hostname.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 14:38:30 -06:00
Thomas Hallock e69a33838a feat(infra): add LiteFS for distributed SQLite in k8s
- Add LiteFS binary and config to Docker image for SQLite replication
- Convert k8s Deployment to StatefulSet for stable pod identities
- Pod-0 is primary (handles writes), others are replicas
- LiteFS proxy forwards write requests to primary automatically
- Add headless service for pod-to-pod communication
- Increase Node.js heap size to 4GB for Next.js build
- Exclude large Python venvs from Docker context

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 13:19:37 -06:00
Thomas Hallock c16b70090f feat(infra): add full k8s stack mirroring docker-compose setup
Terraform now deploys a complete k8s environment:
- cert-manager with Let's Encrypt (staging + prod issuers)
- Redis deployment with persistent storage
- App deployment (2 replicas, rolling updates)
- Traefik ingress with SSL, HSTS, HTTP→HTTPS redirect

Ready for switchover by forwarding ports 80/443 to k3s VM.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 11:33:49 -06:00
Thomas Hallock 38e289f626 chore(infra): add terraform lock file for reproducible builds
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 11:04:51 -06:00
Thomas Hallock 1cac633814 feat(infra): add initial Terraform config for k3s cluster
Set up Terraform to manage k3s resources on the NAS VM:
- Kubernetes and Helm providers configured
- Created 'abaci' namespace for workloads
- Ready for BullMQ workers and future scalable services

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 11:04:07 -06:00