Cycle-event detection — spec

Bucket: technical/ml (Agent D) · Status: reviewed (Phase B seams applied: 3–5 fps per ADR-004; CV writer commits to direct SQLite) · Owner: Sophia Mann · Phase: I · Last updated: 2026-05-12

Context

Phase I’s product is a stream of cycle-time events in the cycle_events SQLite table (one row per garment-unit-cycle per workstation), feeding the dashboard and the Excel export that mirrors INDICADORES ABRIL.xlsx. The cycle event is the only thing the CV pipeline produces in Phase I — everything else (efficiency %, bottleneck flagging, behavioral codes) is downstream math.

A “cycle” at LBZF means one unit of garment work at a workstation — e.g., one collar topstitched at the PESPUNTAR CUELLO station. The Angela module’s reference garment (Ref22 Slim, 24.16 min total SAM, 21 workstations, target 49 units/hr) defines per-station Standard Allowed Minutes (SAMs) in the Balanceo tab of Ref22 Slim - Angela.xlsx. Per-cycle durations therefore range roughly 20 seconds to 4 minutes across stations (24.16 min / 21 ops ≈ 1.15 min mean; the slowest assembly steps run ~3–4 min, the fastest hem/press ops run well under a minute).

The v1.9 spec says: YOLOv8n + TensorRT, person class only (COCO class 0), rtspsrc → nvv4l2decoder → appsink → OpenCV. This doc defines what the inference loop does with the detections it gets — specifically how the start and end of a cycle are decided, and how the system avoids counting noise as cycles.

Naive approach: person enters station ROI → cycle starts; person leaves ROI → cycle ends. This is wrong in at least seven ways (operator leans back, talks to neighbor, supervisor enters frame, machine runs unattended, partial occlusion, hi-vis vest, lighting). Below specifies the actual algorithm.

Goals

G1: Generate one cycle_events row per real garment unit produced at the workstation, with start_ts, end_ts, duration_seconds, and a confidence tag.
G2: Per-cycle duration agreement with Ronald’s stopwatch: target Lin’s CCC ≥ 0.85 and Bland-Altman 95% LoA within ±10 s for cycles with mean duration ≤ 90 s, ±15 s for longer cycles. (Calibrated to validation doc.)
G3: False-cycle rate (events the operator/Ronald disagrees were a cycle): ≤ 5% over the validation window.
G4: Missed-cycle rate (real cycles the system did not log): ≤ 5%.
G5: End-to-end latency from frame capture to event write ≤ 2 s p95 (Phase I is not real-time-critical, but anything over a few seconds breaks the live dashboard’s illusion).
G6: Inference budget on the Orin Nano: ≤ 10 ms/frame YOLOv8n FP16 TensorRT per camera, with massive headroom at the 2 cameras × 3–5 fps Phase I load (≤ 10 frames/s aggregate, per ADR-004). Public benchmarks show YOLOv8n at 640×480 FP16 TensorRT on Orin Nano around 7.5–10 ms per frame [verified, see literature review].

Non-goals

Garment-type classification (Phase II).
Per-operation labeling within a workstation (Phase III if station performs multiple ops).
Quality / defect detection of the produced garment.
Anything pose-, skeleton-, or action-segmentation-based. Phase I uses bounding boxes only.

Proposed approach

1. Detection layer

Model: yolov8n.pt, Ultralytics, pinned to a specific commit / release tag (e.g., Ultralytics v8.2.x — OPEN[Sophia, pre-Pereira]: pin exact tag before flight to Pereira).
Export: yolo export model=yolov8n.pt format=engine half=True imgsz=640 device=0 to produce yolov8n.engine on the actual Jetson Orin Nano Super (TensorRT engines are not portable across GPU + JetPack versions; this must be built on-device or on an identical Jetson).
Classes: only person (class 0). All other class indexes filtered out before the post-processing layer ever sees them.
Per-frame inference:
- conf threshold = 0.35 (Ultralytics default 0.25 is too liberal for industrial scenes with mannequins, posters, photos on walls).
- iou (NMS) threshold = 0.5.
- max_det = 8 per frame (factory floor; rarely more than 3–4 people legitimately in any one camera ROI at once, headroom for supervisor / mechanic).
- imgsz = 640 (the model’s native train size; 640×480 input is letterboxed to 640×640 — see “Weak” section).
Output for each frame: list of (x1, y1, x2, y2, conf) tuples for the person class.

2. Workstation-association layer

For each camera there is a set of ROI polygons — one polygon per workstation that camera sees (typically 1, occasionally 2; see roi-calibration.md). For each detected person bbox we compute:

overlap_ratio = area(bbox ∩ roi_polygon) / area(bbox)

A bbox is associated with a workstation iff overlap_ratio ≥ 0.5 and the bbox center is inside the polygon. If multiple workstations claim the bbox we take the one with highest overlap.

3. Per-workstation occupancy signal

Per camera frame, per workstation: occupied(t) ∈ {0, 1} = 1 iff at least one associated person bbox exists. (Multiple people at the same station do not split a cycle — supervisor walking by is the canonical case, see §5.)

4. State machine for cycle detection

States: IDLE, OCCUPIED, PAUSED.

IDLE → OCCUPIED:
   when occupied(t) == 1 for ≥ ENTER_HYST consecutive frames
   → tentative start_ts = t - (ENTER_HYST - 1) * frame_period

OCCUPIED → PAUSED:
   when occupied(t) == 0 for ≥ EXIT_HYST consecutive frames
   → tentative end_candidate = t - (EXIT_HYST - 1) * frame_period

PAUSED → OCCUPIED:
   when occupied(t) == 1 within REENTRY_WINDOW after entering PAUSED
   → cancel end_candidate; remain in the same cycle (this is the "operator
     turned to talk to neighbor / went to bobbin shelf" case)

PAUSED → IDLE:
   when PAUSED persists ≥ REENTRY_WINDOW
   → finalize cycle with end_ts = end_candidate; emit event if cycle valid
     (see §6)

Default parameters (per-workstation overridable, persisted in the rois table per roi-calibration.md):

Param	Default	Rationale
`ENTER_HYST`	2 frames (≈ 400–667 ms at 3–5 fps)	Robust against single-frame false positives (model flicker, motion blur). At the lower 3 fps tier, 2 frames is still sub-second.
`EXIT_HYST`	4 frames (≈ 800 ms – 1.3 s at 3–5 fps)	Operator brief lean-back / reach-for-spool should not end a cycle.
`REENTRY_WINDOW`	8 s	Operator can turn fully to neighbor / step to bobbin shelf and return. Calibrated against operation videos. OPEN[Sophia, training set]: empirically tune from the 41 videos.
`MIN_CYCLE_DURATION`	8 s	Cycles shorter than this are noise (someone walking through, supervisor inspecting). Below the fastest legitimate Angela SAM per `Ref22 Slim - Angela.xlsx`.
`MAX_CYCLE_DURATION`	`3 × SAM_station`	Anything longer than 3× the standard time gets flagged as `garbage` (operator absent but machine running, system stuck OCCUPIED).

5. Multi-person disambiguation

The “supervisor checking work” failure mode is common: Ronald or a line supervisor walks up to a station, leans in, looks at the garment, walks away. Naive code starts a new cycle.

Mitigation, in order of cost:

Same-station rule: a station that is already OCCUPIED cannot be re-started by an additional person entering. The cycle ends only when all associated bboxes leave.
Operator-track persistence (lightweight): assign each bbox a short-lived track ID using IoU-based matching frame-to-frame (no need for full ByteTrack/DeepSORT in Phase I — IoU > 0.3 across consecutive frames is enough at 3–5 fps; tracks just need to survive the gap between inference frames). The “primary operator” of a cycle = the track with the longest dwell time inside the ROI. Cycle ends when the primary operator leaves, even if a supervisor remains briefly. (This is a stretch goal — Phase I can ship with rule 1 alone.)
Phase II add-on: hi-vis vest classifier or a face-ID-free “operator embedding” so a supervisor in a different uniform is recognizable as not-the-operator. Out of Phase I scope.

6. Cycle validity filter (write or drop)

A cycle is written to cycle_events iff:

MIN_CYCLE_DURATION ≤ duration_seconds ≤ MAX_CYCLE_DURATION, and
the cycle’s primary-operator track has dwell ratio ≥ 0.6 (was present for at least 60% of duration_seconds).

Otherwise the cycle is written to a parallel cycle_events_rejected table with a reason enum (too_short, too_long, low_dwell, confidence_drop). This table is critical for the paper — it is the only audit trail of detector behavior and the substrate for the failure-mode section.

7. Confidence tagging

Each cycle event carries a quality score in {green, yellow, red}:

green — primary-operator dwell ≥ 0.85, mean per-frame max-person conf ≥ 0.6, no PAUSED excursions, duration within ±2σ of station’s running mean.
yellow — one of the above degraded (still emitted, dashboard styles it).
red — multiple criteria degraded; written but flagged for human spot-check. Red rate over a sliding window is one of the failure-monitor metrics (see failure-modes-and-monitoring.md).

8. The “operator absent, machine running” case

There is no Phase-I CV signal for “the operator’s foot is on the pedal but the operator’s torso is not in the ROI” because Phase I has no machine-on signal at all. Two mitigations:

MAX_CYCLE_DURATION = 3 × SAM catches a stuck cycle.
Phase II add: optional machine-on signal (audio classifier from camera audio if the Amcrest mic is enabled, or current-clamp sensor on the sewing machine motor). Out of Phase I scope.

9. The “operator working but occluded” case

If the operator leans forward over the fabric, much of the torso may go below the bbox-detector’s confidence threshold. Mitigations:

ROI polygons are deliberately drawn around the seated chest+head region (not the full body) so even a leaned-over operator’s head/upper-back keeps the bbox > 0.5 overlap. Coordinated with roi-calibration.md.
EXIT_HYST = 1 s + REENTRY_WINDOW = 8 s are jointly tolerant of occlusion of up to 8 s.
For workstations with chronic occlusion (operator’s body geometry occludes head at the press / planchadora), EXIT_HYST raised to 30 frames (2 s) per-station via the override table.

Alternatives considered

Alt	Pros	Cons	Why rejected
Background subtraction (OpenCV MOG2)	Cheap; no model dependency.	Sensitive to lighting changes (dusk shift), fabric motion, machine vibration. Cannot distinguish operator from supervisor.	Person-class detection is more robust and forward-compatible with Phase II behavioral models. Documented in `docs/technical/architecture.md`.
YOLOv6 / YOLOv7	Comparable accuracy; some variants faster on edge.	Ultralytics + YOLOv8 has the best Python ergonomics, TensorRT export path, and community Jetson recipes. YOLOv8n latency on Orin Nano is already well under budget.	Ergonomic / momentum reasons; not a performance call.
YOLOv8s or YOLOv8m	More accurate, better small-object recall.	2–4× the latency. Even at the Phase I 2 cameras × 3–5 fps load the Orin Nano can absorb it, but the n-tier already meets recall targets and leaves capacity for Phase II behavioral models.	Phase I task is person detection in mostly-clean ROIs at a known scale — YOLOv8n is sufficient. Re-evaluate for Phase II.
Optical-flow-based motion detection	Could detect hand motion → “actively working” signal.	Costly per pixel; flow on garment fabric is itself noisy (the fabric moves); doesn’t disambiguate operator from supervisor.	Defer to Phase II as a possible “active work” overlay signal.
Temporal action segmentation (I3D / SPOT / temporal-segment networks)	State-of-the-art for assembly-line action recognition [Rashid 2024, Ghoddoosian 2023 — needs-lit-review].	Requires labeled action segments; far more compute than Orin Nano provides; introduces a second model.	Phase II direction. Phase I cycle = ROI dwell, not action class.
Vision-Language Models (Andrew/Granola 5/2 transcript)	Could in principle zero-shot detect richer behaviors.	Costly per-frame; latency on Orin Nano hostile to real-time; the Granola “VLM” transcription is itself suspect (see `docs/references/meetings.md`).	Stay on YOLOv8 baseline until either a v2.0 plan or actual code change confirms a VLM direction.
Pose-based cycle detection (HRNet keypoints)	Lets you decide “operator is seated and facing machine” precisely.	Heavier model; pose at 3–5 fps × 2 cameras fits Phase I budget, but the value of fine pose info on cycle counting is marginal vs the implementation cost.	Phase II direction for behavioral monitoring; overkill for Phase I cycle counting.

Open questions

OPEN[Sophia, by 2026-06-15]: Pin the exact Ultralytics release tag and commit SHA we ship to Pereira. (Version churn between v8.0 and v8.3 has changed default args.)
OPEN[Ronald via Armando, by 2026-06-01]: Confirm typical and worst-case occlusion patterns at each Angela workstation. Specifically: at PLANCHAR (press), does the operator’s full upper body stay in frame from the planned camera angle? If not, override EXIT_HYST.
OPEN[Andrew, by 2026-05-20]: Is the “primary operator track” (§5 rule 2) worth shipping in Phase I, or is the same-station rule (§5 rule 1) enough? Andrew’s Form-AI experience is directly relevant — they likely have a default here.
OPEN[ITBA, by 2026-06-15]: Do the ITBA install conditions (different garment line) need different ENTER_HYST / EXIT_HYST? They should run the same algorithm with their own per-workstation thresholds, not a divergent algorithm.
OPEN[Sophia, before Argentina trip 2026-05-15]: Should the cycle-event schema include a model_version and roi_version foreign key from day one? (Answer should be: yes — see reproducibility-and-artifacts.md. Confirm with Agent B before they finalize the schema.)
OPEN[Ronald, by 2026-06-15]: Are there workstations where the operator legitimately leaves the station for >8 s during a cycle (e.g., walks to a fabric-cart 2 m away to grab the next bundle and comes back)? If yes, REENTRY_WINDOW is too short and we’ll under-count.
OPEN[Sophia, paper deadline]: Do we report cycle_events_rejected rates in the paper? Recommendation: yes, with breakdown by reason — it’s the only honest measure of how much the algorithm throws away.

CV writer → SQLite (direct write, no HTTP)

Canonical Phase I boundary (decided 2026-05-12): the CV writer process opens its own SQLite connection (/data/lbzf.db, WAL mode) and writes cycle_events, cycle_events_rejected, and manual_observations rows directly. There is no /api/v1/internal/cycles endpoint; the writer is not an HTTP client of the dashboard.

Why direct SQLite (not an internal HTTP endpoint):

One process writes (lbzf-cv-writer.service); the dashboard process and the exporter open the DB read-only.
SQLite WAL allows concurrent readers without blocking the writer; the dashboard’s reads do not interfere with the inference hot loop.
An HTTP hop between the writer and the DB would add latency and a failure mode (dashboard down → writer can’t persist) for no architectural benefit at Phase I scale.

Coordination:

Dashboard process applies migrations at startup. Writer waits for schema_version.version >= REQUIRED_MIN before opening writes. If migrations are not applied within 30 s, the writer alerts and exits.
Both processes set PRAGMA busy_timeout=5000.
Writer is the only writer (one row inserted per cycle-end transition, one update on commit). All other tables (workstations, standard_times, operators) are seeded once and read-only at runtime; admin edits go through the dashboard API and are serialized via a WriteLock in-process mutex on the dashboard side (the dashboard process opens a read-write connection only for admin writes).

The HTTP-internal-API alternative is preserved in 60-parking/websocket-live-updates.md’s neighborhood (see backend/dashboard-api.md for the rejection note).

Cross-bucket dependencies

Agent A (frontend): cycle_events.quality field must surface in the dashboard (green/yellow/red station tiles). Confirms with Agent A’s dashboard spec.
Agent B (backend): data-model.md now includes cycle_events.quality, cycle_events.primary_track_dwell_ratio, cycle_events.mean_conf, cycle_events.roi_version_id (FK), plus the rois, cycle_events_rejected, and manual_observations tables. The CV writer reads rois to pick roi_version_id and writes the other tables directly.
Agent C (hardware): camera mount must keep the operator’s seated head+chest in frame; the Amcrest dome’s varifocal lens is wide-angle but ROI shrinks at oblique angles. Camera height ~2.5 m, tilted ~30° down from horizontal is the working assumption. Confirm with on-site measurements.
Agent E (business/legal): the cycle_events_rejected table and the per-frame confidence logs are personal-data adjacent under Colombian Law 1581/2012 because they’re linked to a workstation that is linked to a named operator via the Angela roster. Retention policy must be defined before deployment.

What’s weak in this doc

MIN_CYCLE_DURATION = 8 s is asserted, not derived. It is below the fastest Angela SAM in the spreadsheet, but it is not derived from a distribution of actual cycle durations observed in Ronald’s 41 videos. A reviewer will ask “where’s the histogram?” — and rightly. The honest answer is “we’ll calibrate it during the training-set construction sprint in June 2026 and update this doc.” Until then, the 8 s number is a placeholder.
The “primary operator track” mechanism is hand-waved. “IoU > 0.3 across consecutive frames” gets you most of the way, but the well-known failure case is when the supervisor and the operator’s bboxes overlap heavily (supervisor leaning behind operator). This doc does not specify what happens then. A real implementation needs at least a track-length tiebreaker, possibly an appearance embedding. Currently spec’d as “stretch goal” — a reviewer will read that as “didn’t think it through.”
640×480 is asserted as sufficient without a hand-motion experiment. Sewing operations involve fine motor movement, and Phase II’s behavioral cases (phone-to-ear, fork-to-mouth) live at exactly the resolution scale 640×480 will smear. We have not experimentally verified that 1080p is unnecessary on Orin Nano. The latency budget (Orin Nano at 1080p sub-stream × 6 cameras) needs a real benchmark, not a thought-experiment. Logged in phase-ii-preview.md but the resolution decision for Phase I is also weak.
The state machine does not handle camera disconnects gracefully. If a camera drops out mid-cycle and reconnects 30 s later, the current spec will sit in OCCUPIED until MAX_CYCLE_DURATION, then emit a garbage cycle. A real implementation needs a “stream stale → reset state machine, drop in-flight cycle” rule, and that rule needs to be coordinated with the GStreamer pipeline’s reconnection behavior (Agent C).
No principled story for variable SAM_station-dependent thresholds. MAX_CYCLE_DURATION = 3 × SAM is round-numbered. The literature ([Rashid 2024 — needs-lit-review]) suggests operator-time distributions in apparel manual work are heavy-tailed (lognormal), and a fixed multiplier under-rejects on short SAMs and over-rejects on long ones. A percentile-based threshold (e.g., 99th-percentile-of-fitted-lognormal) is more defensible but requires us to fit the distribution first — chicken-and-egg vs the deployment.

Rollout

Date	Gate
2026-05-15	Algorithm spec frozen for Argentina handoff — ITBA must be able to run the same state machine on their twin hardware.
2026-06-01	First end-to-end run against one Angela operation video locally on the M1 (CPU PyTorch, no TensorRT) — produces a valid `cycle_events` row.
2026-06-15	First end-to-end run on the Orin Nano against a recorded Pereira clip — meets latency budget.
2026-07-01 (Pereira)	Day-1 on-site: ROI calibration (see `roi-calibration.md`), thresholds defaulted, one shift recorded for offline validation.
2026-07-02 → 07-15	Validation window vs Ronald stopwatch (see `validation-methodology.md`). Adjust thresholds.
2026-08-01	Phase I “stable” — kill switch: if false-cycle + missed-cycle rate exceeds 15% over any 4-hour window, the dashboard auto-falls-back to “manual count required” and the system writes only to `cycle_events_rejected`.

Paper alignment

Methods: state machine, hysteresis values, validity filter — Section 3.2 (“Cycle Detection”) of the paper.
Experimental setup: hardware + model + thresholds table, one row per parameter — Section 4.1.
Results: per-station cycle_events count vs ground truth (Table 2), false-cycle and missed-cycle rate over the validation window (Table 3).
Limitations: the multi-person disambiguation, the resolution choice, the camera-disconnect handling — discussed in Section 6.
Figures: state-machine diagram (Fig. 3), example timeline of occupied(t) with annotated state transitions (Fig. 4), confusion-matrix-equivalent vs stopwatch (Fig. 5).