Training & fine-tuning plan

Bucket: technical/ml (Agent D) · Status: reviewed (Phase B: 3–5 fps per ADR-004) · Owner: Sophia Mann · Phase: I → II · Last updated: 2026-05-12

Context

The v1.9 plan (2026-04-27) commits to YOLOv8n + TensorRT, person class only, off-shelf COCO-pretrained. Andrew Kent’s 2026-05-02 direction reinforced this: “skip custom training in Phase I; start with off-shelf models.” The Granola transcription of that meeting renders Andrew’s model family as “VLMs” but this is widely suspected to be a transcription artifact (Granola also transcribes the project name as “LBCF” — see docs/references/meetings.md).

The off-shelf-COCO position is well-defended for cycle counting. COCO’s person class is exactly the right tool for “is a human standing/sitting in this ROI.” But Phase I’s validation methodology can benefit from a small amount of fine-tuning if the COCO model’s recall drops on actually-sewing operators in cluttered apparel-line scenes (fabric on hands, leaning over machine, partial occlusion by sewing-machine head).

Outstanding compute blocker: Sophia’s MacBook Air M1 is insufficient for any non-trivial fine-tuning. Two paths under discussion: (a) AWS GPU credits, (b) new laptop with discrete GPU. Andrew leans AWS. This doc closes that decision.

Goals

G1: Establish a clear “no-fine-tune baseline” measurement of off-shelf YOLOv8n on Pereira-like footage before the trip, so we know whether fine-tuning is needed at all.
G2: If fine-tuning is needed, deliver yolov8n_lbzf_v0.engine against a documented, reproducible training set with proper train/val/test discipline.
G3: Make a concrete recommendation on compute (AWS vs new laptop) with cost figures.
G4: Define dataset construction discipline strict enough that the paper can rest on it (no train/test contamination, no Sophia-only annotation).

Non-goals

Phase II behavioral models (phone use, eating, talking) — see phase-ii-preview.md. This doc covers only person-detection fine-tuning for Phase I cycle counting.
Garment-type classification (Phase II/III).
Active learning, semi-supervised, self-supervised regimes. Phase I budget says “label what you must, ship.”

Proposed approach

Stage A — measure first (the “do we even need to fine-tune” experiment)

Before committing any annotation labor or AWS budget, run a structured comparison on a pre-Pereira sample.

From “Shared by Ronald 4/15/videos/” (folder 1mK-4my0gqGipGd5Zb0aKA1HG9vSCoxd6, ~41 operation videos), sample 3 videos per operation category (Collar, Cuffs, Back/yoke, Front, Assembly, Buttonholes & buttons, Inspection & pack — 7 categories × 3 = 21 videos).
From each sampled video, extract 1 frame every 5 seconds for a 60-second window centered on the operator actively working. ~12 frames × 21 videos = ~250 frames.
Manually label person bboxes in these 250 frames (Sophia + Armando, ~2 hours of work; CVAT or LabelImg, both free).
Run COCO-pretrained YOLOv8n (no fine-tune) over the 250 frames. Compute:
- Per-frame recall @ IoU 0.5 (does the detector find the operator at all?)
- Per-frame confidence distribution for the highest-conf person bbox
- False-positive count (detections outside the operator)

If recall ≥ 0.95 and median confidence ≥ 0.55, the off-shelf model is fine; skip Stage B for Phase I.

If recall drops below 0.95 — likely failure modes: operator heavily occluded by machine, hands-only visible, mannequin / dress-form in frame mistaken for person — proceed to Stage B.

This is a paper-worthy experiment in its own right: “do you need to fine-tune COCO YOLOv8 for an apparel line?” The answer is a contribution.

Stage B — fine-tuning plan (only if Stage A says we need to)

B.1 — Dataset construction

Sources:

41 operation videos (“Shared by Ronald 4/15/videos/”). ~2–5 min each. Spanish operation names in filenames.
16 plant photos (“Shared by Ronald 4/15”). Wide shots of the floor, mostly people-in-context.
Phase-I recording window (Jul 2026): from the 4TB rolling buffer, we will have continuous footage from cameras in actual install positions.

Frame extraction:

From the 41 videos: 1 frame / 2 s = ~60 frames/video × 41 = ~2,500 frames.
From the 16 photos: 16 frames.
From Phase-I (post-install) recordings: 1 frame / 30 s × 6 cameras × 8 hours/day × 5 days = ~5,760 frames. (Adds to the dataset for v0.1 retrains, not v0.0.)

Annotation:

Tool: CVAT (self-hosted Docker, free, open-source, COCO-format export). Roboflow is the obvious alternative — easier UI, free tier limited to 10K images for public projects; LBZF data is confidential (Confidencial — Uso Interno) so the dataset can’t be Roboflow-public-tier. Decision: CVAT, self-hosted on the same Jetson during off-hours, or on a laptop.
Labels: only person. Phase I doesn’t need machine / fabric / garment classes.
Annotators: Sophia + Armando + (optional) ITBA team for inter-annotator agreement. Target Cohen’s κ ≥ 0.85 on a 50-frame overlap subset between annotators. (Paper-worthy.)

Train / val / test split:

Critical paper discipline: split must be by video / camera / day, not by frame. Naive random frame split leaks ROIs and operator identities across train and test, inflating apparent accuracy.
Proposed: 70/15/15 split, but with the constraint that no operation video appears in more than one split. With 41 videos that’s 29 train / 6 val / 6 test.
Phase-I post-install recordings: held out as a second test set (“deployment test set”) for the final paper result.
Hash the split assignment (dataset_split_v0.csv checked into the model repo) so the split is reproducible.

B.2 — Training recipe

ultralytics yolo train \
  model=yolov8n.pt \
  data=lbzf_person_v0.yaml \
  imgsz=640 \
  epochs=50 \
  batch=32 \
  optimizer=SGD \
  lr0=0.01 \
  cos_lr=True \
  patience=10 \
  device=0

Augmentations: Ultralytics defaults (HSV jitter, mosaic, mixup off for person-only). Add a custom augmentation: random horizontal flip OFF for the assembly line (operators are seated in a specific orientation relative to the machine; left-right flips create unrealistic training examples). Re-evaluate this in Phase II if behavioral models care about handedness.

Validation metric: mAP@0.5 on the val split. Early-stopping patience = 10 epochs.

B.3 — Compute: AWS vs new laptop (the open decision)

Option	Up-front	Per-run (Stage B once)	Time to first train	Notes
AWS `g5.xlarge` (1× A10G, 24 GB)	$0 setup	~$1.00/hr on-demand. Fine-tune 50 epochs × 2,500 frames ≈ 2–3 hr. $3 per full retrain.	Hours, once AWS account is build-grade	Pairs with Andrew’s “full prod stack day 1” framing. Uses Sophia’s AWS credits if available (OPEN).
AWS `g6.xlarge` (1× L4, 24 GB)	$0 setup	$0.80/hr on-demand. $2.50 per full retrain.	Same	L4 is the newer, better $/perf option.
AWS SageMaker	$0 setup	More $/hr (~$1.41/hr `ml.g5.xlarge`); managed	Add a few hours	Pays for managed-ness Phase I doesn’t need.
New laptop — Apple M4 Max (16-inch, 48 GB)	~$3,500	$0/run, slower than A10G	1–3 weeks ordering + setup	Local; no cloud transfer cost; no PII exit from your machine.
New laptop — Linux + NVIDIA RTX 5090 desktop	~$3,000	$0/run, faster than A10G	1–2 weeks	Higher peak; ugly form factor; you can’t carry it to Pereira.

Recommendation: AWS g6.xlarge for Stage B fine-tuning.

Reasoning:

Phase I fine-tune is at most a handful of full retrains before deployment. Even at 20 retrains × $3 = $60, the laptop’s $3,000 capex is silly.
Andrew’s “full prod stack day 1” direction is consistent with cloud GPU.
Phase II will need more compute; AWS gets us a working IAM + pipeline ready for that.
The MacBook M1 remains adequate for everything else (annotation in CVAT, inference profiling against TensorRT engines built on the Jetson, paper-writing).
Caveat: the dataset leaves Sophia’s machine and enters AWS US-East. Coordinate with Agent E for the data-residency check under Colombian Law 1581/2012. If LBZF objects to data leaving Sophia’s local machine, we revisit (the local-laptop path becomes necessary).

Cost ceiling: budget $200 for Phase I AWS fine-tuning compute. If we exceed that, something has gone wrong.

Account-prep blockers (chase before any training):

OPEN[Sophia, by 2026-05-20]: AWS account with billing + build-grade IAM (the earlier read-only IAM sketched in 2026-04-29 Sam meeting is insufficient).
OPEN[Sophia, by 2026-05-20]: S3 bucket policy for the dataset upload (private; SSE-KMS at minimum).
OPEN[Sophia, by 2026-05-20]: Apply for AWS credits via Anthropic/Cloudflare/InspiritAI startup channels if available.

B.4 — Deliverables of Stage B

If Stage B runs:

yolov8n_lbzf_v0.pt (Ultralytics PyTorch checkpoint)
yolov8n_lbzf_v0.engine (TensorRT, built on the Jetson against the actual JetPack version)
lbzf_person_v0.yaml (dataset config)
dataset_split_v0.csv (reproducible split hash)
train_run_v0.tensorboard / wandb run link
eval_v0.json (mAP, precision, recall, per-split breakdown)
All stored per reproducibility-and-artifacts.md.

Stage C — periodic retraining

After Phase I goes live:

Once per month, sample 200 frames from the Phase-I rolling buffer that the production model marked quality=red or quality=yellow. Label them, add to training set, retrain. (Active-learning-lite.)
Version bumps v0 → v0.1 → v0.2. Each new model is A/B tested in shadow mode against the production model for 24 hours before swap.

Alternatives considered

Alt	Why rejected
No measurement, just fine-tune anyway	Wastes a week of annotation if Stage A says off-shelf is fine. Also weakens the paper — without a baseline you can’t claim fine-tuning helped.
Roboflow public tier	Free but dataset becomes public; LBZF is `Confidencial`. Roboflow private tier is ~$249/user/mo — not worth it vs self-hosted CVAT.
LabelImg instead of CVAT	Adequate for 250-frame Stage A, but doesn’t scale to the 2,500-frame Stage B set; no team annotation features.
YOLO-NAS / DETR / Grounding DINO	NAS has license complications; DETR is heavier on Orin Nano; Grounding DINO is the right Phase II move (text-promptable) but well over budget for Phase I’s 3–5 fps × 2-cam load when richer per-frame work has to happen.
Just train on Roboflow’s public retail/factory datasets	Domain mismatch; LBZF apparel line has specific occlusions and operator posture not represented in retail-store datasets.
Fine-tune YOLOv8s instead of v8n	Higher capacity, higher latency. If Stage A shows v8n is on the edge of failing, this is a reasonable fallback. Premature commitment otherwise.
Bigger laptop (the Apple M4 Max path)	See B.3 — capex doesn’t pencil for Phase I scale of training. Revisit if Phase II proves we need continuous local iteration.

Open questions

OPEN[Sophia, by 2026-05-18]: Run Stage A before the Argentina trip — having the off-shelf-baseline numbers in hand makes the ITBA conversation 10× more concrete.
OPEN[Andrew, by 2026-05-20]: Cloud GPU specifically g5 or g6? (Probably indifferent; just need confirmation Andrew agrees with $200 ceiling.)
OPEN[Mariana / Agent E, by 2026-06-01]: Is dataset transit to AWS US-East acceptable, or does LBZF require data-stays-in-Colombia? Major consequence — flips the recommendation toward local-laptop.
OPEN[ITBA, by 2026-06-15]: Do ITBA researchers want a copy of the dataset for their parallel work? Affects how we share (S3 bucket + IAM cross-account, or signed-URL one-time download).
OPEN[Ronald via Armando, by 2026-05-30]: Were the 41 operation videos staged or recorded during normal production? Affects training-data validity — staged demonstrations don’t reflect real operator behavior (Hawthorne-light).
OPEN[Sophia, paper]: Phase II will fine-tune for behaviors. Are we able to reuse the Phase I dataset at all, or is it just person-bboxes that don’t carry over? (Likely: bboxes carry, but you’ll re-annotate with phone/food classes — see phase-ii-preview.md.)

Cross-bucket dependencies

Agent B (backend): cycle-event rows must include model_version so post-hoc analyses can correlate behavior with the deployed model. Already requested in cycle-event-detection.md.
Agent C (hardware): TensorRT engine builds must run on the actual deployed Jetson hardware + JetPack version. Confirm JetPack target before Stage B finishes (engines are not portable).
Agent E (business/legal): dataset privacy + AWS data-residency posture; consent for using training frames showing identifiable operators. Coordinate with the validation-methodology IRB section.
Agent A (frontend): dashboard’s “model version” badge in the footer; visible to Ronald so he knows which model is producing the day’s numbers.

What’s weak in this doc

Stage A’s 250-frame sample size is small. Statistically, you cannot estimate a recall around 0.95 with tight confidence intervals from 250 frames. You can rule out “catastrophic failure” but you cannot distinguish 0.92 from 0.96. The paper version of this experiment needs ≥ 1,000 labeled frames. The 250 number here is a pre-Pereira sanity check, not a publishable result.
The “labels: only person” stance leaves Phase II annotation labor on the floor. While annotators are looking at each frame, they could also label phones, food, machines, fabric bundles — for ~30% extra time. Not specified here; Phase II will probably regret it.
No data-augmentation experiments. “Ultralytics defaults” is a defensible starting point but not a result. Without an ablation (mosaic on/off, mixup on/off, custom augmentations vs none) the paper cannot claim the augmentation choices were considered.
The “no horizontal flip” claim is asserted as obviously-true. Some Angela operators may be left-handed at machines that are nominally right-handed setups, or the same operation may exist on mirror-image machines elsewhere in the factory. A defensible spec would test flip-on vs flip-off and report.
AWS recommendation does not address VPN / Tailscale-only access to the dataset bucket. If Sophia’s connection has the Firewalla-blocks-GitHub problem already documented, getting credentials and SDKs onto her laptop is its own setup story. Not specified.
No mention of model distillation for Phase II tier change. When Phase II moves to NX 32GB AGX, we might want to distill the production behavioral models down. The Phase I dataset is the substrate for that, but no preservation policy is specified.

Rollout

Date	Gate
2026-05-15	Stage A measurement complete on 250-frame sample. Decision: fine-tune yes/no.
2026-05-20	AWS account build-grade IAM + S3 bucket provisioned; if fine-tune=yes, CVAT instance up.
2026-06-01	First Stage B annotation batch (500 frames) complete; first training run started.
2026-06-15	`yolov8n_lbzf_v0.engine` deployed to Jetson; shadow-mode A/B vs off-shelf for 24 h.
2026-07-01	Deployment in Pereira uses whichever (off-shelf or v0) won the A/B.
2026-08-01	Stage C monthly retraining cadence begins.

Paper alignment

Methods: dataset construction, split discipline, annotator agreement (Cohen’s κ), training recipe — Section 3.3.
Experimental setup: Stage A baseline table (off-shelf YOLOv8n on the 250-frame sample) — Table 1.
Results: per-class mAP and recall, off-shelf vs fine-tuned — Table 4. Bonus: a “did fine-tuning help on which operations?” breakdown — Fig. 6.
Limitations: small dataset, staged-vs-real-production uncertainty, single-line generalization gap.
Replicability angle: open dataset release decision lives in reproducibility-and-artifacts.md; if cleared, this becomes a stronger paper.