Commit Graph

3565 Commits

Author SHA1 Message Date
Thomas Hallock 1363a84278 chore(ci): add secrets for NAS deployment
Deploy Storybook / Build and Deploy Storybook (push) Failing after 46m45s Details
Configure repository secrets for Storybook deploy:
- NAS_HOST
- NAS_DEPLOY_PATH
- NAS_SSH_KEY

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 10:20:25 -06:00
Thomas Hallock 08746960e1 perf(ci): increase runner resources for faster builds
Deploy Storybook / Build and Deploy Storybook (push) Failing after 53m24s Details
Increased dind container from 2GB/2CPU to 4GB/3CPU.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 09:07:23 -06:00
Thomas Hallock e36909d6e2 fix(ci): install rsync in Gitea Actions workflow
Deploy Storybook / Build and Deploy Storybook (push) Failing after 17m46s Details
The node:20 container doesn't include rsync by default.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 08:23:27 -06:00
Thomas Hallock b13bb3b126 ci: fix pnpm version mismatch - use packageManager from package.json
Deploy Storybook / Build and Deploy Storybook (push) Failing after 50m54s Details
2026-01-25 07:32:13 -06:00
Thomas Hallock cd651b3262 ci: trigger storybook workflow v7 with DNS fix
Deploy Storybook / Build and Deploy Storybook (push) Failing after 15m26s Details
2026-01-25 07:16:07 -06:00
Thomas Hallock c64426ddaa chore: v6
Deploy Storybook / Build and Deploy Storybook (push) Failing after 13m57s Details
2026-01-25 06:45:09 -06:00
Thomas Hallock bd606d8d99 chore: trigger v5
Deploy Storybook / Build and Deploy Storybook (push) Failing after 2m34s Details
2026-01-25 05:55:14 -06:00
Thomas Hallock 8a1c1c0c8f chore: trigger storybook v4
Deploy Storybook / Build and Deploy Storybook (push) Has been cancelled Details
2026-01-25 05:51:17 -06:00
Thomas Hallock 6928f02a9e chore: trigger storybook v3
Deploy Storybook / Build and Deploy Storybook (push) Has been cancelled Details
2026-01-25 05:47:13 -06:00
Thomas Hallock c47ec0258a chore: trigger storybook build v2
Deploy Storybook / Build and Deploy Storybook (push) Failing after 7s Details
2026-01-25 05:43:16 -06:00
Thomas Hallock 10e086e5c9 chore: trigger storybook build
Deploy Storybook / Build and Deploy Storybook (push) Failing after 9s Details
2026-01-25 05:24:55 -06:00
Thomas Hallock 8e133ddffe chore: re-trigger storybook workflow 2026-01-25 05:23:57 -06:00
Thomas Hallock ad4cc8c4a5 chore: trigger storybook workflow 2026-01-25 05:21:48 -06:00
Thomas Hallock db1ca7fa7a feat(infra): add Gitea with Actions and Storybook deployment
Deploy Storybook / Build and Deploy Storybook (push) Failing after 7s Details
- Add self-hosted Gitea server at git.dev.abaci.one
- Configure Gitea Actions runner with Docker-in-Docker
- Set up push mirror to GitHub for backup
- Add Storybook deployment workflow to dev.abaci.one/storybook/
- Update nginx config to serve Storybook from local storage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:53:12 -06:00
Thomas Hallock 0126f76994 fix(ci): build llm-client package before Storybook 2026-01-24 17:20:57 -06:00
Thomas Hallock 97313618ae feat(dev): add redirect from /storybook/ to GitHub Pages 2026-01-24 16:59:41 -06:00
Thomas Hallock 26a9fe784f fix(ci): generate build-info.json before Storybook build
The Storybook build was failing because DeploymentInfoContent.tsx
imports @/generated/build-info.json which doesn't exist until
the generate-build-info.js script runs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:57:12 -06:00
Thomas Hallock 74565b93af fix(tracing): use resourceFromAttributes for OTel SDK 2.x compatibility
@opentelemetry/resources v2.x changed the API - Resource class constructor
was replaced with resourceFromAttributes() factory function.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:57:05 -06:00
Thomas Hallock c1475e0306 feat(dev): add dev portal index page for dev.abaci.one
Creates a nice landing page that links to all dev resources:

Testing & QA:
- /smoke-reports/ - Playwright E2E test results
- /storybook/ - Component library (coming soon)
- /coverage/ - Test coverage reports (coming soon)

Monitoring:
- grafana.dev.abaci.one - Dashboards
- prometheus.dev.abaci.one - Metrics
- status.abaci.one - Uptime monitoring

Quick links to production app and GitHub repo.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:48:27 -06:00
Thomas Hallock dcad5bca46 feat(observability): add OpenTelemetry tracing with Tempo backend
- Add instrumentation.js for OTel SDK bootstrap via --require flag
- Add tracing.ts utility functions (getCurrentTraceId, recordError, withSpan)
- Install @opentelemetry packages for auto-instrumentation
- Update Dockerfile to copy instrumentation.js and use --require
- Add trace IDs to error responses in API routes

Traces are exported to Tempo via OTLP/gRPC when running in production
(KUBERNETES_SERVICE_HOST env var present).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:31:18 -06:00
Thomas Hallock 8362db4572 fix(keel): resolve DNS lookup failures with k3s CoreDNS
Go's pure-Go DNS resolver has incompatibilities with k3s's CoreDNS that
cause intermittent "server misbehaving" errors after the initial lookup.
This prevented Keel from polling ghcr.io for new image digests.

Setting GODEBUG=netdns=cgo forces Go to use the system's cgo DNS resolver,
which works correctly with k3s.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:07:43 -06:00
Thomas Hallock 74e12c0029 feat(metrics): add session tracking and Grafana dashboard provisioning
- Add heartbeat-based session tracking (/api/heartbeat)
- Track active sessions, session duration, page views, unique visitors
- Use Page Visibility API to only send heartbeats when tab visible
- Add Grafana dashboard via ConfigMap provisioning
- Dashboard includes: sessions, Socket.IO, request rate, error rate,
  arcade games, worksheets, memory, event loop lag

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 14:40:55 -06:00
Thomas Hallock ef75a07c2c feat(metrics): add comprehensive application metrics
Expand Prometheus metrics to track:
- HTTP request timing and counts
- Socket.IO connections and events
- Database query timing
- Practice sessions and problems
- Arcade games (completions, scores, win rates)
- Worksheet generations (by operator, timing)
- Flashcard generations
- Flowchart views
- Vision/camera recordings
- Classroom and user activity
- Curriculum/BKT metrics
- LLM API calls
- Error tracking

Instrument key API endpoints:
- /api/game-results: Track game completions and scores
- /api/create/worksheets: Track worksheet generations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 14:05:30 -06:00
Thomas Hallock f1223bb81b fix(monitoring): use /api/metrics path for ServiceMonitor
The Next.js API route is at /api/metrics, not /metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 13:15:46 -06:00
Thomas Hallock 35856afb2e feat(observability): add Prometheus/Grafana monitoring stack
- Deploy kube-prometheus-stack to k3s cluster via Terraform
- Add Prometheus metrics endpoint (/api/metrics) using prom-client
- Track Socket.IO connections, HTTP requests, and Node.js runtime
- Configure ServiceMonitor for auto-discovery by Prometheus
- Expose Grafana at grafana.dev.abaci.one
- Expose Prometheus at prometheus.dev.abaci.one

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:45:32 -06:00
Thomas Hallock 3c0df8099c fix(dev-artifacts): use correct NFS path under data directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:18:56 -06:00
Thomas Hallock 5258437bef feat(dev): add dev.abaci.one for build artifacts
- Add nginx static server at dev.abaci.one for serving:
  - Playwright HTML reports at /smoke-reports/
  - Storybook (future) at /storybook/
  - Coverage reports (future) at /coverage/

- NFS-backed PVC shared between artifact producers and nginx
- Smoke tests now save HTML reports with automatic cleanup (keeps 20)
- Reports accessible at dev.abaci.one/smoke-reports/latest/

Infrastructure:
- infra/terraform/dev-artifacts.tf: nginx deployment, PVC, ingress
- Updated smoke-tests.tf to mount shared PVC
- Updated smoke-test-runner.ts to generate and save HTML reports

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 09:52:26 -06:00
Thomas Hallock 87bce550ad fix(smoke-tests): report last completed run instead of running test
Changed status endpoint to report the last COMPLETED test run instead
of any running test. This prevents Gatus from showing unhealthy status
while tests are in progress. Added currentlyRunning flag for info.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 09:39:45 -06:00
Thomas Hallock d2be19f1be fix(smoke-tests): simplify tests to only reliable critical paths
Removed tests for pages that were timing out or failing due to
hydration issues. Smoke tests should be minimal and reliable -
they detect if the site is down, not comprehensively test features.

Kept: homepage (3 tests), flowchart (1 test), arcade game (1 test),
practice navigation (1 test) = 6 total tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 09:08:47 -06:00
Thomas Hallock 5ba12ef4cc fix(smoke-tests): update Playwright Docker image to v1.56.0
The smoke tests were failing because the Playwright package (1.56.0)
didn't match the Docker image version (v1.55.0-jammy). Updated the
Dockerfile to use mcr.microsoft.com/playwright:v1.56.0-jammy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 07:42:06 -06:00
Thomas Hallock aa6506957c revert: undo performance changes that broke intervention badges
Reverts the following commits that traded functionality for
marginal (or negative) performance gains:

- Skip intervention computation during SSR (broke badges)
- Defer MiniAbacus rendering (caused visual flash)
- Batch DB queries with altered return type
- Eliminate redundant getViewerId calls

The intervention badges are critical for parents/teachers to
identify students who need help. Performance should not
compromise core functionality.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 06:51:17 -06:00
Thomas Hallock 8cdcb9f292 fix(smoke-tests): include .dockerignore in workflow paths filter
Ensures the smoke tests image is rebuilt when .dockerignore changes
affect which files are included.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 06:43:39 -06:00
Thomas Hallock 170497f245 fix(smoke-tests): add exception in .dockerignore for smoke test files
The .dockerignore was excluding **/*.spec.ts which blocked the smoke
test files from being copied into the Docker image.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 06:43:04 -06:00
Thomas Hallock 9c09851b44 fix(smoke-tests): add imagePullPolicy Always to CronJob
Ensures the latest smoke tests image is always pulled, avoiding
stale cached images when updates are pushed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 06:24:40 -06:00
Thomas Hallock 1914bcf9d0 perf(practice): eliminate redundant getViewerId and user lookups
Refactor getPlayersWithSkillData() to return viewerId and userId
along with players, avoiding 2 redundant calls in Practice page:
- Previous: 3 calls to getViewer() (via getViewerId) + 2 user lookups
- Now: 1 call to getViewer() + 1 user lookup

This should reduce Practice page SSR time by eliminating duplicate
auth checks and database queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 05:46:03 -06:00
Thomas Hallock dbc45b97b0 fix(smoke-tests): correct Playwright test path argument
The testDir in playwright.config.ts is './e2e', so we should pass 'smoke'
not 'e2e/smoke' to avoid looking in ./e2e/e2e/smoke.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 05:44:13 -06:00
Thomas Hallock affad2f4a6 feat(monitoring): add E2E smoke tests with Gatus integration
Add Playwright-based smoke tests that run every 15 minutes via k8s CronJob,
with results exposed to Gatus for status.abaci.one monitoring.

- Add smoke_test_runs table for storing test results
- Add /api/smoke-test-status endpoint (Gatus checks this)
- Add /api/smoke-test-results endpoint (CronJob reports here)
- Add smoke tests for homepage, arcade, practice, and flowchart pages
- Add smoke-test-runner.ts script
- Add Dockerfile.smoke-tests based on Playwright image
- Add GitHub Actions workflow to build smoke tests image
- Add Kubernetes CronJob Terraform config
- Update Gatus config with Browser Smoke Tests endpoint

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 05:08:50 -06:00
Thomas Hallock 958481b661 perf(homepage): defer MiniAbacus rendering until after hydration
- Skip heavy AbacusReact SVG rendering during SSR
- Render placeholder during SSR and initial hydration
- AbacusReact loads after client hydration
- Reduces SSR overhead by avoiding 4x AbacusReact renders

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 05:05:59 -06:00
Thomas Hallock 1e2f5c9010 perf(practice): skip intervention computation during SSR
Intervention badges are helpful but not critical for initial render.
By skipping the expensive BKT computation (which requires N additional
database queries for session history), we significantly reduce SSR time.

- Batched skill mastery query: N queries → 1 query
- Skipped intervention computation: N additional queries → 0

The intervention data can be computed lazily on the client if needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 21:14:37 -06:00
Thomas Hallock ed653db483 perf(practice): batch DB queries to reduce N+1 pattern
- Single query for all skill mastery records instead of N queries
- Single query for session history instead of N queries per player
- Group results in memory for O(1) lookups
- Expected improvement: ~150ms reduction in SSR time

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 20:42:49 -06:00
Thomas Hallock 30fb0e86e3 perf(worksheets): defer preview to client-side API fetch
The RSC Suspense streaming approach didn't work because the Suspense
boundary was inside a client component's props - React serializes all
props before streaming can begin.

Simpler solution: Don't embed the 1.25MB SVG in initial HTML at all.
- Page SSR returns immediately with just settings (~200ms TTFB)
- Preview is fetched via existing API after hydration (server-side generation)
- User sees page shell instantly, preview loads with loading indicator

This achieves the same UX goal: fast initial paint, preview appears when ready.
The preview generation still happens server-side via the API endpoint.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 19:29:59 -06:00
Thomas Hallock ba08409269 docs: add Keel and k8s deployment notes to agent instructions
Document lessons learned:
- Keel annotations must be on workload metadata, not pod template
- Keel namespace watching configuration
- Debugging Keel polling issues
- LiteFS replica migration handling

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 19:23:28 -06:00
Thomas Hallock 2b5d66f776 perf(worksheets): use Suspense streaming for preview generation
Problem: The worksheet page had 1.7-2.3s TTFB because the 1.25MB SVG
preview was being serialized into the initial HTML response, blocking
first paint.

Solution: Use React Suspense to stream the preview separately:
- Page shell renders immediately with settings (~200ms TTFB)
- Preview generates async and streams in when ready (~1.5s later)
- User sees the UI instantly, preview appears with loading skeleton

New components:
- StreamedPreview: async server component that generates preview
- PreviewSkeleton: loading placeholder while streaming
- StreamedPreviewContext: shares streamed data with PreviewCenter
- PreviewDataInjector: bridges server-streamed data to client context

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:54:17 -06:00
Thomas Hallock 1e43ec18f3 fix(infra): configure Keel to watch all namespaces
Add watchAllNamespaces=true to Keel helm config so it monitors
workloads in the abaci namespace (not just keel namespace).

Update documentation to clarify that Keel annotations must be on
the workload metadata, not the pod template.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:45:04 -06:00
Thomas Hallock 747bc4a5f0 fix(infra): move Keel annotations to StatefulSet metadata
Keel reads annotations from the workload's metadata, not the pod template.
Moving annotations from spec.template.metadata to metadata fixes auto-updates.

Also:
- Set NAMESPACE="" on Keel deployment to watch all namespaces
- Keep ghcr credentials config (optional, for private registries)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:44:38 -06:00
Thomas Hallock c1809d72ae feat(infra): add ghcr.io registry credentials for Keel polling
Keel needs to authenticate with ghcr.io to poll for new image digests
(ghcr.io requires auth for manifest API even on public images).

- Add ghcr_token and ghcr_username variables
- Create docker-registry secret for ghcr.io
- Add imagePullSecrets to StatefulSet (Keel reads these for auth)
- Document the setup in keel.tf

To enable auto-updates:
1. Create GitHub PAT with read:packages scope
2. Set ghcr_token in terraform.tfvars
3. terraform apply

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 15:56:26 -06:00
Thomas Hallock e72018ae44 fix(server): skip migrations on LiteFS replicas
LiteFS replicas are read-only, so migrations fail with "read only replica"
error. Check LITEFS_CANDIDATE env var and skip migrations on replicas.
The primary (pod-0) will run migrations and replicate the changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 14:31:19 -06:00
Thomas Hallock 1f1083773d perf: add timing instrumentation to worksheet page SSR
Track where time is spent during worksheet page render:
- loadWorksheetSettings (DB query + getViewerId)
- generateWorksheetPreview (problem generation + Typst compilation)
- Total page render time

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 14:21:12 -06:00
Thomas Hallock b04d0caeaf feat(infra): add OpenAI API key for LLM features
Add openai_api_key variable to terraform configuration for AI-powered
features like flowchart generation. The key is stored as a k8s secret
and exposed to pods as LLM_OPENAI_API_KEY environment variable.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:58:34 -06:00
Thomas Hallock c80eefa5e3 docs(infra): document LiteFS write routing for k8s deployments
- Explain why LiteFS proxy fly-replay doesn't work outside Fly.io
- Document the primary service and IngressRoute solution
- Add troubleshooting symptoms for broken write routing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:56:29 -06:00