Training & fine-tuning plan
Bucket: technical/ml (Agent D) · Status: reviewed (Phase B: 3–5 fps per ADR-004) · Owner: Sophia Mann · Phase: I → II · Last updated: 2026-05-12
Context
Section titled “Context”The v1.9 plan (2026-04-27) commits to YOLOv8n + TensorRT, person class only, off-shelf COCO-pretrained. Andrew Kent’s 2026-05-02 direction reinforced this: “skip custom training in Phase I; start with off-shelf models.” The Granola transcription of that meeting renders Andrew’s model family as “VLMs” but this is widely suspected to be a transcription artifact (Granola also transcribes the project name as “LBCF” — see docs/references/meetings.md).
The off-shelf-COCO position is well-defended for cycle counting. COCO’s person class is exactly the right tool for “is a human standing/sitting in this ROI.” But Phase I’s validation methodology can benefit from a small amount of fine-tuning if the COCO model’s recall drops on actually-sewing operators in cluttered apparel-line scenes (fabric on hands, leaning over machine, partial occlusion by sewing-machine head).
Outstanding compute blocker: Sophia’s MacBook Air M1 is insufficient for any non-trivial fine-tuning. Two paths under discussion: (a) AWS GPU credits, (b) new laptop with discrete GPU. Andrew leans AWS. This doc closes that decision.
- G1: Establish a clear “no-fine-tune baseline” measurement of off-shelf YOLOv8n on Pereira-like footage before the trip, so we know whether fine-tuning is needed at all.
- G2: If fine-tuning is needed, deliver
yolov8n_lbzf_v0.engineagainst a documented, reproducible training set with proper train/val/test discipline. - G3: Make a concrete recommendation on compute (AWS vs new laptop) with cost figures.
- G4: Define dataset construction discipline strict enough that the paper can rest on it (no train/test contamination, no Sophia-only annotation).
Non-goals
Section titled “Non-goals”- Phase II behavioral models (phone use, eating, talking) — see
phase-ii-preview.md. This doc covers only person-detection fine-tuning for Phase I cycle counting. - Garment-type classification (Phase II/III).
- Active learning, semi-supervised, self-supervised regimes. Phase I budget says “label what you must, ship.”
Proposed approach
Section titled “Proposed approach”Stage A — measure first (the “do we even need to fine-tune” experiment)
Section titled “Stage A — measure first (the “do we even need to fine-tune” experiment)”Before committing any annotation labor or AWS budget, run a structured comparison on a pre-Pereira sample.
- From “Shared by Ronald 4/15/videos/” (folder
1mK-4my0gqGipGd5Zb0aKA1HG9vSCoxd6, ~41 operation videos), sample 3 videos per operation category (Collar, Cuffs, Back/yoke, Front, Assembly, Buttonholes & buttons, Inspection & pack — 7 categories × 3 = 21 videos). - From each sampled video, extract 1 frame every 5 seconds for a 60-second window centered on the operator actively working. ~12 frames × 21 videos = ~250 frames.
- Manually label
personbboxes in these 250 frames (Sophia + Armando, ~2 hours of work; CVAT or LabelImg, both free). - Run COCO-pretrained YOLOv8n (no fine-tune) over the 250 frames. Compute:
- Per-frame recall @ IoU 0.5 (does the detector find the operator at all?)
- Per-frame confidence distribution for the highest-conf person bbox
- False-positive count (detections outside the operator)
If recall ≥ 0.95 and median confidence ≥ 0.55, the off-shelf model is fine; skip Stage B for Phase I.
If recall drops below 0.95 — likely failure modes: operator heavily occluded by machine, hands-only visible, mannequin / dress-form in frame mistaken for person — proceed to Stage B.
This is a paper-worthy experiment in its own right: “do you need to fine-tune COCO YOLOv8 for an apparel line?” The answer is a contribution.
Stage B — fine-tuning plan (only if Stage A says we need to)
Section titled “Stage B — fine-tuning plan (only if Stage A says we need to)”B.1 — Dataset construction
Section titled “B.1 — Dataset construction”Sources:
- 41 operation videos (“Shared by Ronald 4/15/videos/”). ~2–5 min each. Spanish operation names in filenames.
- 16 plant photos (“Shared by Ronald 4/15”). Wide shots of the floor, mostly people-in-context.
- Phase-I recording window (Jul 2026): from the 4TB rolling buffer, we will have continuous footage from cameras in actual install positions.
Frame extraction:
- From the 41 videos: 1 frame / 2 s = ~60 frames/video × 41 = ~2,500 frames.
- From the 16 photos: 16 frames.
- From Phase-I (post-install) recordings: 1 frame / 30 s × 6 cameras × 8 hours/day × 5 days = ~5,760 frames. (Adds to the dataset for v0.1 retrains, not v0.0.)
Annotation:
- Tool: CVAT (self-hosted Docker, free, open-source, COCO-format export). Roboflow is the obvious alternative — easier UI, free tier limited to 10K images for public projects; LBZF data is confidential (
Confidencial — Uso Interno) so the dataset can’t be Roboflow-public-tier. Decision: CVAT, self-hosted on the same Jetson during off-hours, or on a laptop. - Labels: only
person. Phase I doesn’t need machine / fabric / garment classes. - Annotators: Sophia + Armando + (optional) ITBA team for inter-annotator agreement. Target Cohen’s κ ≥ 0.85 on a 50-frame overlap subset between annotators. (Paper-worthy.)
Train / val / test split:
- Critical paper discipline: split must be by video / camera / day, not by frame. Naive random frame split leaks ROIs and operator identities across train and test, inflating apparent accuracy.
- Proposed: 70/15/15 split, but with the constraint that no operation video appears in more than one split. With 41 videos that’s 29 train / 6 val / 6 test.
- Phase-I post-install recordings: held out as a second test set (“deployment test set”) for the final paper result.
- Hash the split assignment (
dataset_split_v0.csvchecked into the model repo) so the split is reproducible.
B.2 — Training recipe
Section titled “B.2 — Training recipe”ultralytics yolo train \ model=yolov8n.pt \ data=lbzf_person_v0.yaml \ imgsz=640 \ epochs=50 \ batch=32 \ optimizer=SGD \ lr0=0.01 \ cos_lr=True \ patience=10 \ device=0Augmentations: Ultralytics defaults (HSV jitter, mosaic, mixup off for person-only). Add a custom augmentation: random horizontal flip OFF for the assembly line (operators are seated in a specific orientation relative to the machine; left-right flips create unrealistic training examples). Re-evaluate this in Phase II if behavioral models care about handedness.
Validation metric: mAP@0.5 on the val split. Early-stopping patience = 10 epochs.
B.3 — Compute: AWS vs new laptop (the open decision)
Section titled “B.3 — Compute: AWS vs new laptop (the open decision)”| Option | Up-front | Per-run (Stage B once) | Time to first train | Notes |
|---|---|---|---|---|
AWS g5.xlarge (1× A10G, 24 GB) | $0 setup | ~$1.00/hr on-demand. Fine-tune | Hours, once AWS account is build-grade | Pairs with Andrew’s “full prod stack day 1” framing. Uses Sophia’s AWS credits if available (OPEN). |
AWS g6.xlarge (1× L4, 24 GB) | $0 setup | Same | L4 is the newer, better $/perf option. | |
| AWS SageMaker | $0 setup | More $/hr (~$1.41/hr ml.g5.xlarge); managed | Add a few hours | Pays for managed-ness Phase I doesn’t need. |
| New laptop — Apple M4 Max (16-inch, 48 GB) | ~$3,500 | $0/run, slower than A10G | 1–3 weeks ordering + setup | Local; no cloud transfer cost; no PII exit from your machine. |
| New laptop — Linux + NVIDIA RTX 5090 desktop | ~$3,000 | $0/run, faster than A10G | 1–2 weeks | Higher peak; ugly form factor; you can’t carry it to Pereira. |
Recommendation: AWS g6.xlarge for Stage B fine-tuning.
Reasoning:
- Phase I fine-tune is at most a handful of full retrains before deployment. Even at 20 retrains × $3 = $60, the laptop’s $3,000 capex is silly.
- Andrew’s “full prod stack day 1” direction is consistent with cloud GPU.
- Phase II will need more compute; AWS gets us a working IAM + pipeline ready for that.
- The MacBook M1 remains adequate for everything else (annotation in CVAT, inference profiling against TensorRT engines built on the Jetson, paper-writing).
- Caveat: the dataset leaves Sophia’s machine and enters AWS US-East. Coordinate with Agent E for the data-residency check under Colombian Law 1581/2012. If LBZF objects to data leaving Sophia’s local machine, we revisit (the local-laptop path becomes necessary).
Cost ceiling: budget $200 for Phase I AWS fine-tuning compute. If we exceed that, something has gone wrong.
Account-prep blockers (chase before any training):
- OPEN[Sophia, by 2026-05-20]: AWS account with billing + build-grade IAM (the earlier read-only IAM sketched in 2026-04-29 Sam meeting is insufficient).
- OPEN[Sophia, by 2026-05-20]: S3 bucket policy for the dataset upload (private; SSE-KMS at minimum).
- OPEN[Sophia, by 2026-05-20]: Apply for AWS credits via Anthropic/Cloudflare/InspiritAI startup channels if available.
B.4 — Deliverables of Stage B
Section titled “B.4 — Deliverables of Stage B”If Stage B runs:
yolov8n_lbzf_v0.pt(Ultralytics PyTorch checkpoint)yolov8n_lbzf_v0.engine(TensorRT, built on the Jetson against the actual JetPack version)lbzf_person_v0.yaml(dataset config)dataset_split_v0.csv(reproducible split hash)train_run_v0.tensorboard/ wandb run linkeval_v0.json(mAP, precision, recall, per-split breakdown)- All stored per
reproducibility-and-artifacts.md.
Stage C — periodic retraining
Section titled “Stage C — periodic retraining”After Phase I goes live:
- Once per month, sample 200 frames from the Phase-I rolling buffer that the production model marked
quality=redorquality=yellow. Label them, add to training set, retrain. (Active-learning-lite.) - Version bumps
v0→v0.1→v0.2. Each new model is A/B tested in shadow mode against the production model for 24 hours before swap.
Alternatives considered
Section titled “Alternatives considered”| Alt | Why rejected |
|---|---|
| No measurement, just fine-tune anyway | Wastes a week of annotation if Stage A says off-shelf is fine. Also weakens the paper — without a baseline you can’t claim fine-tuning helped. |
| Roboflow public tier | Free but dataset becomes public; LBZF is Confidencial. Roboflow private tier is ~$249/user/mo — not worth it vs self-hosted CVAT. |
| LabelImg instead of CVAT | Adequate for 250-frame Stage A, but doesn’t scale to the 2,500-frame Stage B set; no team annotation features. |
| YOLO-NAS / DETR / Grounding DINO | NAS has license complications; DETR is heavier on Orin Nano; Grounding DINO is the right Phase II move (text-promptable) but well over budget for Phase I’s 3–5 fps × 2-cam load when richer per-frame work has to happen. |
| Just train on Roboflow’s public retail/factory datasets | Domain mismatch; LBZF apparel line has specific occlusions and operator posture not represented in retail-store datasets. |
| Fine-tune YOLOv8s instead of v8n | Higher capacity, higher latency. If Stage A shows v8n is on the edge of failing, this is a reasonable fallback. Premature commitment otherwise. |
| Bigger laptop (the Apple M4 Max path) | See B.3 — capex doesn’t pencil for Phase I scale of training. Revisit if Phase II proves we need continuous local iteration. |
Open questions
Section titled “Open questions”- OPEN[Sophia, by 2026-05-18]: Run Stage A before the Argentina trip — having the off-shelf-baseline numbers in hand makes the ITBA conversation 10× more concrete.
- OPEN[Andrew, by 2026-05-20]: Cloud GPU specifically
g5org6? (Probably indifferent; just need confirmation Andrew agrees with $200 ceiling.) - OPEN[Mariana / Agent E, by 2026-06-01]: Is dataset transit to AWS US-East acceptable, or does LBZF require data-stays-in-Colombia? Major consequence — flips the recommendation toward local-laptop.
- OPEN[ITBA, by 2026-06-15]: Do ITBA researchers want a copy of the dataset for their parallel work? Affects how we share (S3 bucket + IAM cross-account, or signed-URL one-time download).
- OPEN[Ronald via Armando, by 2026-05-30]: Were the 41 operation videos staged or recorded during normal production? Affects training-data validity — staged demonstrations don’t reflect real operator behavior (Hawthorne-light).
- OPEN[Sophia, paper]: Phase II will fine-tune for behaviors. Are we able to reuse the Phase I dataset at all, or is it just person-bboxes that don’t carry over? (Likely: bboxes carry, but you’ll re-annotate with phone/food classes — see
phase-ii-preview.md.)
Cross-bucket dependencies
Section titled “Cross-bucket dependencies”- Agent B (backend): cycle-event rows must include
model_versionso post-hoc analyses can correlate behavior with the deployed model. Already requested incycle-event-detection.md. - Agent C (hardware): TensorRT engine builds must run on the actual deployed Jetson hardware + JetPack version. Confirm JetPack target before Stage B finishes (engines are not portable).
- Agent E (business/legal): dataset privacy + AWS data-residency posture; consent for using training frames showing identifiable operators. Coordinate with the validation-methodology IRB section.
- Agent A (frontend): dashboard’s “model version” badge in the footer; visible to Ronald so he knows which model is producing the day’s numbers.
What’s weak in this doc
Section titled “What’s weak in this doc”- Stage A’s 250-frame sample size is small. Statistically, you cannot estimate a recall around 0.95 with tight confidence intervals from 250 frames. You can rule out “catastrophic failure” but you cannot distinguish 0.92 from 0.96. The paper version of this experiment needs ≥ 1,000 labeled frames. The 250 number here is a pre-Pereira sanity check, not a publishable result.
- The “labels: only person” stance leaves Phase II annotation labor on the floor. While annotators are looking at each frame, they could also label phones, food, machines, fabric bundles — for ~30% extra time. Not specified here; Phase II will probably regret it.
- No data-augmentation experiments. “Ultralytics defaults” is a defensible starting point but not a result. Without an ablation (mosaic on/off, mixup on/off, custom augmentations vs none) the paper cannot claim the augmentation choices were considered.
- The “no horizontal flip” claim is asserted as obviously-true. Some Angela operators may be left-handed at machines that are nominally right-handed setups, or the same operation may exist on mirror-image machines elsewhere in the factory. A defensible spec would test flip-on vs flip-off and report.
- AWS recommendation does not address VPN / Tailscale-only access to the dataset bucket. If Sophia’s connection has the Firewalla-blocks-GitHub problem already documented, getting credentials and SDKs onto her laptop is its own setup story. Not specified.
- No mention of model distillation for Phase II tier change. When Phase II moves to NX 32GB AGX, we might want to distill the production behavioral models down. The Phase I dataset is the substrate for that, but no preservation policy is specified.
Rollout
Section titled “Rollout”| Date | Gate |
|---|---|
| 2026-05-15 | Stage A measurement complete on 250-frame sample. Decision: fine-tune yes/no. |
| 2026-05-20 | AWS account build-grade IAM + S3 bucket provisioned; if fine-tune=yes, CVAT instance up. |
| 2026-06-01 | First Stage B annotation batch (500 frames) complete; first training run started. |
| 2026-06-15 | yolov8n_lbzf_v0.engine deployed to Jetson; shadow-mode A/B vs off-shelf for 24 h. |
| 2026-07-01 | Deployment in Pereira uses whichever (off-shelf or v0) won the A/B. |
| 2026-08-01 | Stage C monthly retraining cadence begins. |
Paper alignment
Section titled “Paper alignment”- Methods: dataset construction, split discipline, annotator agreement (Cohen’s κ), training recipe — Section 3.3.
- Experimental setup: Stage A baseline table (off-shelf YOLOv8n on the 250-frame sample) — Table 1.
- Results: per-class mAP and recall, off-shelf vs fine-tuned — Table 4. Bonus: a “did fine-tuning help on which operations?” breakdown — Fig. 6.
- Limitations: small dataset, staged-vs-real-production uncertainty, single-line generalization gap.
- Replicability angle: open dataset release decision lives in
reproducibility-and-artifacts.md; if cleared, this becomes a stronger paper.