Deployment + CI/CD spec — two targets, two stories

Bucket: Backend (Agent B) Status: Reviewed — 2026-05-12 (Phase B seams applied: R2 release manifest auto-update parked; Phase I Jetson updates are SSH push) Owner: Sophia · Reviewers: Andrew, Agent A (Jetson side) Supersedes / refines: docs/technical/deployment.md is on-site procedure; this doc is the CI/CD plumbing layer above it.

Context

Two completely different deploy targets, often conflated:

lbzfai.com — Cloudflare auto-deploys on push to main via its native GitHub integration. A scoped Cloudflare API token now lives in .env so agents can also trigger a deploy directly.
The Jetson — sits on LBZF’s network in Pereira; auto-deploy story has to handle “what if the deploy bricks the box and Sophia is in California?”

The decision being made: the safe blast-radius pattern for agent-driven Cloudflare deploys, and how the Jetson updates without a Sophia-flies-to-Colombia recovery scenario.

Goals

An agent can deploy a website change with the minimum required token scope and a clear blast-radius preview
Jetson updates work without a physical visit, with a rollback path that doesn’t require Tailscale (in case the update breaks Tailscale)
The two deploy targets are clearly named, clearly scoped, never confused for each other
The CI loop covers test, build, deploy, smoke-check
Secrets never end up in commits or in agent transcripts

Non-goals

Multi-environment (staging vs prod) for the Jetson — Phase I has one Jetson; staging is “Sophia’s dev Jetson in CA before flight”
Blue/green deploys; canarying — overkill at our scale
IaC for the Cloudflare account itself (Terraform / Pulumi) — Cloudflare dashboard is fine until we have >5 routes
Container orchestration on the Jetson — systemd is sufficient

Proposed approach

Target 1: lbzfai.com (Cloudflare Workers Static Assets)

How it deploys today

Push to main on github.com/sophiamann/lbzfai
Cloudflare’s GitHub integration receives a webhook
Cloudflare build VM clones the repo
Auto-runs astro add cloudflare (build VM only — does NOT touch the repo’s astro.config.mjs)
Runs npm run build → static dist/
Deploys to the lbzfai-com Worker (and lbzfai.sophiainesmann.workers.dev route)
Cache flush is automatic
Build vars (PUBLIC_AUTH0_DOMAIN, PUBLIC_AUTH0_CLIENT_ID) come from Cloudflare dashboard → Settings → Builds → Build variables

This works and shouldn’t be changed casually.

Agent-driven deploys (the new path)

A scoped Cloudflare API token is in .env. Token scope (recommended; verify in dashboard):

Scope	Why
`Account → Workers Scripts → Edit`	publish the Worker
`Account → Workers Routes → Edit`	add/move routes
`Account → Workers KV Storage → Edit`	(Phase II) write user table once it moves to KV/D1
`User → User Details → Read`	API health check
Zone scope: only `lbzfai.com`	blast-radius

Do NOT grant Account → Account Settings → Edit, Account → DNS → Edit (for new zones), or User → API Tokens → Edit — those are nuclear.

Smallest-blast-radius agent pattern

Three rules:

Agents do not push to main. Agents open a PR from a worktree branch. The auto-deploy fires only after a human merges. Cloudflare’s GitHub integration is the deploy mechanism; agents are just authors.
For preview deploys, agents use wrangler deploy --env preview (or equivalent Worker route like pr-<n>.lbzfai-com.preview). Previews are namespaced and do not touch lbzfai.com. The Cloudflare API token used by agents is preview-scoped if possible; if Cloudflare doesn’t support a preview-only token, agents call a wrangler deploy --dry-run to surface the intended change before any actual write.
The .env Cloudflare token is dev-only. It is in .gitignore and exists in .env.example only as a placeholder. Token rotation is manual; rotate quarterly and after every contractor offboard.

Smallest blast-radius deploy pattern, in order of preference:

Action	Affected surface	Risk
Open PR; let CI build a preview Worker	`pr-N.lbzfai-com.preview.workers.dev`	None to prod
`wrangler deploy --dry-run` to validate config	nothing changes	None
`wrangler deploy --env preview`	preview Worker only	None to prod
Push to a non-main branch	nothing deploys (GitHub integration is main-only)	None
Merge to `main` (after PR review)	lbzfai.com prod	Owned by reviewer
Direct `wrangler deploy --env production` (skipping git)	lbzfai.com prod	Forbidden for agents — emergency only, Sophia/Andrew

Agent checklist for any Cloudflare-touching change

Run wrangler whoami to confirm the token is the scoped agent token, not a personal token
If editing wrangler.toml or astro.config.mjs, dry-run the build locally first (npm run build)
Open PR; do not push to main
Include the URL of the preview deploy in the PR description
Note any new Cloudflare resources (routes, secrets) so they’re caught at review

Target 2: The Jetson

The problem statement

The Jetson is in Pereira. The only inbound channel is Tailscale. If a deploy breaks Tailscale, recovery requires Armando or a local technical contact physically rebooting. This is the operational gating constraint on Jetson CI/CD.

Phase I: SSH push via Tailscale

Phase I uses Option A (SSH push) for Jetson updates. The Watchtower-style R2-manifest auto-pull (Option C) is parked to docs/design/60-parking/r2-release-manifest.md.

Reasoning:

Phase I has one production Jetson + one ITBA twin and one release author (Sophia). The R2 manifest design pays off when there is a fleet to coordinate or a non-author operator; neither holds in Phase I.
SSH push is the simplest mechanism that works and is fully reversible: ssh sophia@<jetson> "cd /opt/lbzf && git pull && ./bin/migrate.py && systemctl restart lbzf-*".
Rollback is git checkout <prev_sha> && systemctl restart lbzf-* over the same SSH session.
The ITBA Jetson uses the same workflow over its own Tailscale tag (tag:itba-dev).

Trigger to thaw the R2 design: more than one production Jetson or a non-Sophia engineer regularly merging to main.

Tailscale-breaking changes (still applies)

The single failure that defeats any unattended update path: a release that breaks Tailscale itself. Mitigations:

tailscaled is pinned and updated separately from lbzf-*.
The systemd unit tailscaled has Restart=always and is independent of lbzf-* units.
Tested in CA: Sophia does a “deliberately break Tailscale” rehearsal once before the Pereira flight to ensure the local technical contact at LBZF has the runbook (out-of-band: phone Armando, who walks LBZF staff through the reboot).

Rollback (Phase I)

SSH via Tailscale: ssh sophia@<jetson> "cd /opt/lbzf && git checkout <prev_sha> && systemctl restart lbzf-*".
If Tailscale is broken, the runbook escalates to Armando → LBZF on-site reboot.

Rollback is always faster than diagnose; we rollback first, debug later.

Secrets management

Secret	Where	Rotation
Cloudflare API token (agent)	`.env` (dev), Cloudflare Worker secret (prod)	Quarterly + on offboard
Auth0 tenant config (public client ID, domain)	Repo + Cloudflare build vars	Never (it’s public)
`JETSON_PROXY_SECRET` (Worker → Jetson)	Cloudflare Worker secret + `/etc/lbzf/proxy.token` on Jetson	Quarterly
Internal HMAC for CV writer → API	`/etc/lbzf/internal.token` on Jetson	On model deploy
SQLite encryption key (if we add SQLCipher)	Not in Phase I	n/a
Camera RTSP passwords	`/etc/lbzf/cameras.yaml` (mode 0600) on Jetson	When cameras change
Tailscale auth key	Per-machine; not stored after enrollment	n/a
AWS credentials	Per Andrew’s “AWS day 1” — not yet provisioned	TBD

Rule: no secret in the repo, ever. .env.example shows the keys; .env is .gitignored. Pre-commit hook scans for AKIA / sk_live_ / etc. patterns (gitleaks or hand-rolled).

CI matrix (Phase I)

Workflow	Trigger	Target	Job
`web-build.yml`	push to any branch	lbzfai.com Cloudflare	(handled by Cloudflare native integration, not GH Actions)
`web-preview.yml` (new)	PR	preview Worker	`wrangler deploy --env preview-pr-<n>`
`lint.yml` (new)	every push	none	ruff, mypy, eslint
`pytest.yml` (new)	push to `main` if `app/**` changed	none (test results only)	lint, mypy, pytest — but does not publish artifacts

jetson-build.yml and jetson-promote.yml are parked alongside the R2 manifest design. Phase II re-introduces them when fleet size justifies.

GH Actions secrets (Phase I):

CLOUDFLARE_API_TOKEN_CI — scoped only to Workers Scripts → Edit for the preview worker. No prod write, no R2 (since no manifest in Phase I).
CLOUDFLARE_ACCOUNT_ID — public-ish, but in secret store for tidiness.

Observability for deploys

Every Worker deploy logs a structured line {deploy_id, sha, user, ts} to Cloudflare Logs (free; 7-day retention).
Every SSH-driven Jetson update is a manual operation. Sophia logs the git SHA range she deployed in a deploy-log.md (or just git log --oneline <prev>..<new>) and pastes it into the project Slack/Discord channel.
Phase II adds structured deploy logs when the R2-manifest auto-pull thaws.

Argentina ITBA twin Jetson

ITBA’s Jetson uses the same SSH-push workflow as LBZF’s, scoped to the tag:itba-dev tailnet tag. Configuration that differs (camera RTSP URLs, org_id, Tailscale auth) lives in /etc/lbzf/site.yaml and is not in the git repo — it’s per-machine.

When ITBA wants to diverge (try a different CV model, etc.), they branch from main locally on their Jetson and pin to that branch.

Alternatives considered

GitHub Actions deploys directly to Cloudflare instead of Cloudflare’s native GitHub integration: works, but duplicates the existing wiring. Native integration handles the build VM, secrets, and route configuration; reimplementing in Actions is a regression.
R2 release manifest + Watchtower-style auto-pull on the Jetson — the right design at fleet scale. Parked to 60-parking/r2-release-manifest.md until fleet exists or a non-author engineer ships.
systemd-timer git pull on the Jetson — every commit to main ships immediately. Fast but no decoupling. Sophia explicitly opts into SSH-push instead so “merged ≠ deployed” stays true.
Container image deploys (Docker on Jetson) — adds Docker to the Jetson stack we don’t otherwise need. JetPack + apt + systemd is the simpler base; revisit in Phase II if we get tired of dependency drift.
Ansible / Salt / cloud-init for Jetson provisioning — overkill for one box. A bash script + this doc is fine.
Cloudflare Pages instead of Workers Static Assets — we’re already on Workers Static Assets (per CLAUDE.md); the README is stale on this. Migrating is not a Phase I priority.

Open questions

PARKED: Cloudflare R2 cost at our scale — moot for Phase I; revisit when R2 manifest thaws.
OPEN: agent push permission scope — currently .env has the Cloudflare token; should an agent ever directly call wrangler against prod, or only against preview? Default: preview only. Owner: Sophia.
OPEN: pre-commit secret scanner — gitleaks is the standard; install + add to CI. Owner: Sophia.
OPEN: how do migrations roll back? If 0.4.7 adds a column, 0.4.6 should still work; if 0.4.7 removes a column, downgrade silently breaks. Defaulting to additive-only migrations in Phase I; document the policy.

Cross-bucket dependencies

This doc depends on	Owner bucket	What we need
Tailscale on Jetson	Hardware (Agent A) + existing deployment.md	The always-on Tailscale daemon is the rollback escape hatch
Cloudflare Worker route config	This bucket (`lbzfai-jetson-integration.md`)	We deploy the Worker
systemd unit files for `lbzf-dashboard`, `lbzf-cv-writer`, `lbzf-exporter`	this bucket + ML (Agent D for cv-writer)	All units co-evolve. `lbzf-updater` ships only when the R2-manifest design thaws.
DB migration policy	This bucket (`data-model.md`)	Additive-only by default

This doc implies	Owner	Ask
ML/CV writer is part of the same git repo on the Jetson	Agent D	A single `git pull` updates all Jetson processes
Frontend deploys are decoupled from backend Jetson deploys	Frontend (Agent C)	They can ship UI changes without waiting for a Jetson SSH push

What’s weak in this doc

SSH-push depends on Sophia being available. A 3-day Sophia outage (sickness, travel without internet) means no Jetson updates. Phase II thaws the R2-manifest auto-pull design specifically to remove this single-person bottleneck.
Rollback rehearsal is mentioned but not scheduled. Without a documented dry-run on the dev Jetson, the first actual rollback in production is going to be exciting.
.env Cloudflare token + agents — a clever attacker who gets read access to .env (e.g., a careless cat .env in an agent transcript) has Worker write access. The blast radius is the lbzfai.com Worker, not the AWS account, not Auth0, not the Jetson. Acceptable but worth being clear-eyed about.
No security scan in CI yet (pip-audit, npm audit, gitleaks). Should be table stakes.
No structured deploy audit trail in Phase I. Sophia’s hand-written “I deployed <sha> at <time>” notes are the only record. Acceptable at this scale; the R2-manifest thaw replaces this with proper structured logs.

Rollout

Now (May 2026):
- Document the scoped Cloudflare token in .env.example
- Write web-preview.yml GH Action for PR previews
- Add gitleaks pre-commit
Before Argentina (2026-05-15):
- Demo a PR-driven preview deploy of lbzfai.com
- ITBA gets the SSH-push workflow on the BA handoff Jetson
Before Pereira (Jul 2026):
- SSH-push runbook tested end-to-end on the dev Jetson in CA, including a forced rollback
- Runbook for local technical contact: “if the dashboard is down, call Armando, who phones Sophia”
Day-1 Pereira:
- Deploy a known-good main SHA before flight; freeze the SHA during the first week of operation
- First in-prod release is the bug-fix release that comes out of the deploy week
Phase II:
- Thaw 60-parking/r2-release-manifest.md: build R2 bucket + manifest + lbzf-updater.service
- Canary on ITBA’s Jetson first; promote to LBZF after 72h of green
- Auto-rollback richer signals (cycle_events count drops to zero for 5 min → automatic rollback)

Unblocks: every other backend doc’s path to running on the Jetson; safer agent-driven website iteration; ITBA’s parallel work.