Commit Graph

32 Commits

Author SHA1 Message Date
Thomas Hallock db1ca7fa7a feat(infra): add Gitea with Actions and Storybook deployment
Deploy Storybook / Build and Deploy Storybook (push) Failing after 7s Details
- Add self-hosted Gitea server at git.dev.abaci.one
- Configure Gitea Actions runner with Docker-in-Docker
- Set up push mirror to GitHub for backup
- Add Storybook deployment workflow to dev.abaci.one/storybook/
- Update nginx config to serve Storybook from local storage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:53:12 -06:00
Thomas Hallock 97313618ae feat(dev): add redirect from /storybook/ to GitHub Pages 2026-01-24 16:59:41 -06:00
Thomas Hallock c1475e0306 feat(dev): add dev portal index page for dev.abaci.one
Creates a nice landing page that links to all dev resources:

Testing & QA:
- /smoke-reports/ - Playwright E2E test results
- /storybook/ - Component library (coming soon)
- /coverage/ - Test coverage reports (coming soon)

Monitoring:
- grafana.dev.abaci.one - Dashboards
- prometheus.dev.abaci.one - Metrics
- status.abaci.one - Uptime monitoring

Quick links to production app and GitHub repo.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:48:27 -06:00
Thomas Hallock 8362db4572 fix(keel): resolve DNS lookup failures with k3s CoreDNS
Go's pure-Go DNS resolver has incompatibilities with k3s's CoreDNS that
cause intermittent "server misbehaving" errors after the initial lookup.
This prevented Keel from polling ghcr.io for new image digests.

Setting GODEBUG=netdns=cgo forces Go to use the system's cgo DNS resolver,
which works correctly with k3s.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:07:43 -06:00
Thomas Hallock 74e12c0029 feat(metrics): add session tracking and Grafana dashboard provisioning
- Add heartbeat-based session tracking (/api/heartbeat)
- Track active sessions, session duration, page views, unique visitors
- Use Page Visibility API to only send heartbeats when tab visible
- Add Grafana dashboard via ConfigMap provisioning
- Dashboard includes: sessions, Socket.IO, request rate, error rate,
  arcade games, worksheets, memory, event loop lag

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 14:40:55 -06:00
Thomas Hallock f1223bb81b fix(monitoring): use /api/metrics path for ServiceMonitor
The Next.js API route is at /api/metrics, not /metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 13:15:46 -06:00
Thomas Hallock 35856afb2e feat(observability): add Prometheus/Grafana monitoring stack
- Deploy kube-prometheus-stack to k3s cluster via Terraform
- Add Prometheus metrics endpoint (/api/metrics) using prom-client
- Track Socket.IO connections, HTTP requests, and Node.js runtime
- Configure ServiceMonitor for auto-discovery by Prometheus
- Expose Grafana at grafana.dev.abaci.one
- Expose Prometheus at prometheus.dev.abaci.one

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:45:32 -06:00
Thomas Hallock 3c0df8099c fix(dev-artifacts): use correct NFS path under data directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:18:56 -06:00
Thomas Hallock 5258437bef feat(dev): add dev.abaci.one for build artifacts
- Add nginx static server at dev.abaci.one for serving:
  - Playwright HTML reports at /smoke-reports/
  - Storybook (future) at /storybook/
  - Coverage reports (future) at /coverage/

- NFS-backed PVC shared between artifact producers and nginx
- Smoke tests now save HTML reports with automatic cleanup (keeps 20)
- Reports accessible at dev.abaci.one/smoke-reports/latest/

Infrastructure:
- infra/terraform/dev-artifacts.tf: nginx deployment, PVC, ingress
- Updated smoke-tests.tf to mount shared PVC
- Updated smoke-test-runner.ts to generate and save HTML reports

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 09:52:26 -06:00
Thomas Hallock 9c09851b44 fix(smoke-tests): add imagePullPolicy Always to CronJob
Ensures the latest smoke tests image is always pulled, avoiding
stale cached images when updates are pushed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 06:24:40 -06:00
Thomas Hallock affad2f4a6 feat(monitoring): add E2E smoke tests with Gatus integration
Add Playwright-based smoke tests that run every 15 minutes via k8s CronJob,
with results exposed to Gatus for status.abaci.one monitoring.

- Add smoke_test_runs table for storing test results
- Add /api/smoke-test-status endpoint (Gatus checks this)
- Add /api/smoke-test-results endpoint (CronJob reports here)
- Add smoke tests for homepage, arcade, practice, and flowchart pages
- Add smoke-test-runner.ts script
- Add Dockerfile.smoke-tests based on Playwright image
- Add GitHub Actions workflow to build smoke tests image
- Add Kubernetes CronJob Terraform config
- Update Gatus config with Browser Smoke Tests endpoint

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 05:08:50 -06:00
Thomas Hallock 1e43ec18f3 fix(infra): configure Keel to watch all namespaces
Add watchAllNamespaces=true to Keel helm config so it monitors
workloads in the abaci namespace (not just keel namespace).

Update documentation to clarify that Keel annotations must be on
the workload metadata, not the pod template.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:45:04 -06:00
Thomas Hallock 747bc4a5f0 fix(infra): move Keel annotations to StatefulSet metadata
Keel reads annotations from the workload's metadata, not the pod template.
Moving annotations from spec.template.metadata to metadata fixes auto-updates.

Also:
- Set NAMESPACE="" on Keel deployment to watch all namespaces
- Keep ghcr credentials config (optional, for private registries)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:44:38 -06:00
Thomas Hallock c1809d72ae feat(infra): add ghcr.io registry credentials for Keel polling
Keel needs to authenticate with ghcr.io to poll for new image digests
(ghcr.io requires auth for manifest API even on public images).

- Add ghcr_token and ghcr_username variables
- Create docker-registry secret for ghcr.io
- Add imagePullSecrets to StatefulSet (Keel reads these for auth)
- Document the setup in keel.tf

To enable auto-updates:
1. Create GitHub PAT with read:packages scope
2. Set ghcr_token in terraform.tfvars
3. terraform apply

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 15:56:26 -06:00
Thomas Hallock b04d0caeaf feat(infra): add OpenAI API key for LLM features
Add openai_api_key variable to terraform configuration for AI-powered
features like flowchart generation. The key is stored as a k8s secret
and exposed to pods as LLM_OPENAI_API_KEY environment variable.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:58:34 -06:00
Thomas Hallock c80eefa5e3 docs(infra): document LiteFS write routing for k8s deployments
- Explain why LiteFS proxy fly-replay doesn't work outside Fly.io
- Document the primary service and IngressRoute solution
- Add troubleshooting symptoms for broken write routing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:56:29 -06:00
Thomas Hallock 6f76ce61df fix(infra): route write requests to primary pod for LiteFS compatibility
LiteFS proxy on replica pods returns fly-replay header expecting Fly.io's
infrastructure to re-route requests to the primary. Since we're on k8s,
Traefik doesn't understand this header and returns empty responses.

Solution:
- Add abaci-app-primary service targeting only pod-0 (the LiteFS primary)
- Add Traefik IngressRoute matching POST/PUT/DELETE/PATCH methods
- Route these write requests directly to the primary service
- GET requests still load-balance across all replicas for reads

This fixes the intermittent empty PDF responses where ~60-80% of POST
requests were failing due to hitting replica pods.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:55:24 -06:00
Thomas Hallock f916358614 fix(infra): include paths in Gatus endpoint names
Gatus UI only shows hostnames, not full URLs. Include the path
directly in the endpoint name for clarity.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:56:30 -06:00
Thomas Hallock c4d4ca7122 feat(infra): improve Gatus status page with clearer endpoint groups
- Organize endpoints into logical groups: Website, Arcade, Worksheets, Flowcharts, Core API, Infrastructure
- Add hide-url: false to show actual URLs on status page
- Use user-friendly names like "Games Hub", "Worksheet Builder", "Flashcard Generator"
- Remove confusing internal service endpoints
- Check database and Redis via infrastructure group

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:51:27 -06:00
Thomas Hallock ba4d2d7f7d docs(infra): document NAS Traefik routing and subdomain setup
- Update architecture diagram to show NAS Traefik as entry point
- Add "Adding New Subdomains" guide with DNS, NAS Traefik, and k3s steps
- Document network architecture in CLAUDE.md for agents
- Note services.yaml location on NAS

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:43:22 -06:00
Thomas Hallock dda5485408 feat(infra): add Gatus status page at status.abaci.one
- Gatus deployment monitoring homepage, health API, Redis, DB
- Simplified ingress (HTTP via NAS Traefik handles SSL)
- Updated NAS Traefik services.yaml with status subdomain routes

Access: https://status.abaci.one

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:42:41 -06:00
Thomas Hallock f8d1ec730c feat(infra): add Gatus status page at status.abaci.one
- Gatus deployment with SQLite persistence
- ConfigMap with endpoint monitors (homepage, health API, Redis, DB)
- Ingress with SSL via cert-manager
- DNS CNAME record already configured

Deploy with: terraform apply

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:17:24 -06:00
Thomas Hallock ee26b1e361 feat(infra): add Keel for automatic k3s deployments
- Add Keel helm release that polls ghcr.io every 2 minutes
- Add keel.sh annotations to app StatefulSet for auto-updates
- Create comprehensive README.md documenting k3s architecture
- Update CLAUDE.md with automatic deployment workflow

After terraform apply, deployments are fully automatic:
push to main → build → Keel detects new image → rolling update

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:11:19 -06:00
Thomas Hallock 2f82bc28ec feat(infra): scale to 3 app replicas for better load distribution
Pod-0 remains LiteFS primary (handles writes), pod-1 and pod-2 are
replicas that serve reads and forward writes to primary.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 10:16:00 -06:00
Thomas Hallock 0abed6ae55 feat(infra): add performance remediation for k8s deployment
- Increase resource limits: 1Gi memory, 2 CPU cores per pod
- Tune health probes: 10s timeout, 5 failures (75s grace period)
- Add Traefik rate limiting: 50 req/sec avg, 100 burst
- Add in-flight request limiting: max 100 concurrent connections

Fixes pod crashes under moderate load (50+ concurrent connections).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 10:10:38 -06:00
Thomas Hallock 6c51182c15 refactor(flowchart): remove legacy schema-specific formatting, add display.problem check
- Remove legacy schema-specific formatting fallbacks in formatting.ts and example-generator.ts
- All flowcharts now require explicit display.problem and display.answer expressions
- Add DISP-003 diagnostic for missing display.problem expressions
- Update doctor to treat missing display.answer as error (was warning)

Also includes:
- Terraform: generate LiteFS config at runtime, add AUTH_TRUST_HOST, add volume mounts for vision-training and uploads data
- Terraform: add storage.tf for persistent volume claims
- Add Claude instructions for terraform directory
- Various UI component formatting updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 11:03:15 -06:00
Thomas Hallock 2765b081bc fix(litefs): simplify candidate env var and add debug logging
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 16:50:02 -06:00
Thomas Hallock 42f55855eb fix(litefs): remove HOSTNAME env var to allow pod hostname detection
LiteFS needs the actual pod hostname for cluster communication,
but HOSTNAME=0.0.0.0 was being set in both the Dockerfile and
ConfigMap, overriding the pod's hostname.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 14:38:30 -06:00
Thomas Hallock e69a33838a feat(infra): add LiteFS for distributed SQLite in k8s
- Add LiteFS binary and config to Docker image for SQLite replication
- Convert k8s Deployment to StatefulSet for stable pod identities
- Pod-0 is primary (handles writes), others are replicas
- LiteFS proxy forwards write requests to primary automatically
- Add headless service for pod-to-pod communication
- Increase Node.js heap size to 4GB for Next.js build
- Exclude large Python venvs from Docker context

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 13:19:37 -06:00
Thomas Hallock c16b70090f feat(infra): add full k8s stack mirroring docker-compose setup
Terraform now deploys a complete k8s environment:
- cert-manager with Let's Encrypt (staging + prod issuers)
- Redis deployment with persistent storage
- App deployment (2 replicas, rolling updates)
- Traefik ingress with SSL, HSTS, HTTP→HTTPS redirect

Ready for switchover by forwarding ports 80/443 to k3s VM.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 11:33:49 -06:00
Thomas Hallock 38e289f626 chore(infra): add terraform lock file for reproducible builds
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 11:04:51 -06:00
Thomas Hallock 1cac633814 feat(infra): add initial Terraform config for k3s cluster
Set up Terraform to manage k3s resources on the NAS VM:
- Kubernetes and Helm providers configured
- Created 'abaci' namespace for workloads
- Ready for BullMQ workers and future scalable services

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 11:04:07 -06:00