YOLOv8n + TensorRT (person class only) for Phase I detection

Context

The v1.9 plan committed to YOLOv8n person-class detection on the Jetson Orin Nano Super, compiled to TensorRT for inference latency. Andrew Kent (mentor since 2026-05-02) suggested in conversation that we “skip custom training in Phase I; start with off-shelf models” and Granola transcribed the suggested model family as “VLMs” (Vision-Language Models). This is a meaningfully different family from YOLO.

Two issues with switching:

Granola transcription is known unreliable (e.g., spells lbzfai as LBCF in the same session).
VLMs on edge hardware in 2026 are not a drop-in replacement for YOLO on a Jetson Orin Nano Super — inference latency, memory footprint, and tooling maturity all differ.

Decision

Stay on YOLOv8n + TensorRT (person class only) for Phase I. Defer any VLM evaluation to Phase II.

Why:

Off-shelf COCO-pretrained YOLOv8n meets the Phase I requirement (detect a person in a workstation ROI to time cycle start/stop) without custom training, satisfying Andrew’s “off-shelf only” framing.
Proven inference latency. YOLOv8n on Orin Nano Super has well-documented benchmarks and engine compilation pipelines.
Bench-test plan dependency. The pre-Argentina bench tests (G1) assume YOLOv8n — switching now would invalidate the demo path.

Consequences

Per-station fine-tuning deferred to Phase II. Phase I uses the off-shelf weights as-is. If detection quality is unacceptable for Pereira’s lighting/operator-position conditions during the July install, fine-tuning becomes a Phase I emergency task — not the current default.
JetPack version pinning. Use JetPack 6.0 or 6.1, NOT 6.2. JetPack 6.2 ships TensorRT 10.x, which currently breaks INT8 quantization for YOLOv8 (per Ultralytics docs). 6.0/6.1 ships TRT 8.6 and matches working tutorials.
Engine compiled on the Jetson itself. TensorRT engines are hardware-bound; cross-compilation from a MacBook will not run on the Jetson. Use yolo export model=yolov8n.pt format=engine half=True device=0 on the Jetson.
VLM revisit trigger. If Phase II behavioral monitoring (phone use, eating, talking, absence) needs richer scene understanding than person detection provides, re-open this decision.

References

Andrew Kent transition memo, 2026-05-02
Granola transcript caveats noted in docs/design/40-prototype/architecture.md § “Post-v1.9 direction”