Failure modes & self-monitoring

Bucket: technical/ml (Agent D) · Status: reviewed (Phase B: 3–5 fps per ADR-004) · Owner: Sophia Mann · Phase: I · Last updated: 2026-05-12

Context

The cycle-event detector (cycle-event-detection.md) is a probabilistic system attached to a deterministic-looking business surface (per-station cycle counts, per-shift efficiency %, Excel export matching INDICADORES ABRIL.xlsx). The danger is that the system silently lies: the dashboard shows numbers, Ronald trusts them, decisions get made on them, and nobody knows the underlying CV pipeline is mis-firing.

Phase I must include self-monitoring that fires before the numbers reach Ronald. This doc catalogs the failure modes, the detection signal for each, and the alert routing.

The “fail loudly” principle is non-negotiable for the paper: silent failure of a Phase I deployment disqualifies the paper’s validation methodology, because you can’t claim agreement with the stopwatch over a window where the CV system was secretly broken.

Goals

G1: Catalog every silent-failure mode we can think of with the detection signal for each.
G2: Specify alert thresholds and routing — what surfaces to Ronald, what surfaces to Sophia, what gets auto-mitigated.
G3: Define the kill-switch: at what aggregate metric does the system stop reporting numbers and ask for human intervention?
G4: Make the self-monitoring data part of the paper, not background noise.

Non-goals

General Linux/Jetson system monitoring (CPU, disk, network) — Agent C’s bucket, but signals listed below where they affect ML.
Application-level uptime monitoring (dashboard reachable, API responds) — Agent B.
Active model-improvement loops (retraining on bad cases) — covered in training-and-finetuning.md Stage C.

Catalog of silent failure modes

A — Pipeline-level failures (the system is broken; numbers are wrong)

#	Failure	Why silent	Detection signal	Threshold	Alert routing
A1	Camera dropped offline (network blip, PoE port reset)	GStreamer pipeline may auto-reconnect; cycle counter stops accumulating; dashboard tile freezes “last seen 3 min ago” — easy to miss	`frames_received_per_minute` per camera	< 50% of nominal (≈ 180–300/min at 3–5 fps) for ≥ 60 s	Ronald (dashboard banner) + Sophia (log)
A2	GStreamer pipeline stalled (decoder hangs but doesn’t error)	Process is alive, just stuck	Per-camera `last_frame_ts` not advancing	≥ 30 s since last frame	Auto-restart that pipeline; alert Sophia
A3	Jetson thermal throttle	Inference latency creeps up; YOLOv8 starts dropping confidence; no exception	Mean per-frame inference latency vs 24-h baseline	> 1.5× baseline over 10-min window	Ronald (banner: “system thermal — call Sophia”); Sophia (page)
A4	Inference engine OOM / crash with auto-restart	systemd restart loop; cycles missed during downtime	Process restart count per hour	≥ 1 unexpected restart	Sophia (log)
A5	SQLite locked / corrupted	`cycle_events` writes fail; events lost	DB write-success rate	< 99%	Sophia (page); fallback to JSONL on disk
A6	NVMe at >85% capacity	Rolling deletion stops new recordings	`df` on NVMe mount	> 85%	Sophia (banner)

B — Detector-quality failures (the system “works” but is silently mis-detecting)

#	Failure	Why silent	Detection signal	Threshold	Alert routing
B1	Camera lens fogged / smudged	Cycles still emit; confidence drops gradually	Per-camera median person-detection confidence, 1-hour window vs trailing-7-day baseline	drop > 0.15 absolute	Ronald (banner: “clean camera N”)
B2	Lighting change (sunset, fluorescent failure, new lamp)	Same as B1 plus possible bbox flicker	(a) Per-camera mean frame luminance, drift > 30% vs baseline. (b) Confidence distribution shift (KS test, p < 0.01 on 1-hour vs trailing 7-day).	Either signal	Sophia (log); Ronald only if persistent > 6 h
B3	Camera mount shift	Cycles emit but ROIs are mis-aligned → wrong workstation → wrong SAM comparison	ORB drift detection from `roi-calibration.md` §3	Per `roi-calibration.md` thresholds	Ronald (banner: re-calibrate camera N)
B4	Model drift on new operator / new garment SKU	Recall drops on specific operator; system silently under-counts their station	Per-station cycle count vs trailing-7-day median	drop > 25% over 4-hour window	Ronald (banner); flag the operator + station in `cycle_events_rejected` review
B5	Hi-vis vest / new uniform confuses person detector	Bbox misses operator; under-count	Per-station `frames_with_no_detection_inside_roi` while station should be working	spike vs trailing baseline	Sophia (log); if persistent → manual review
B6	Detector “stuck on” — a poster of a person on the wall is being detected	Phantom cycles on a station that is empty	A station detected as OCCUPIED for ≥ 4 hours with zero `PAUSED` excursions	duration > 4 h continuously	Ronald (banner) + cycle suppressed
B7	Confidence collapse — model produces no detections at all	Cycle count → 0 for the whole module	Aggregate `n_detections_per_minute` across all cameras	< 5% of 24-h baseline for 10 min	Sophia (page)
B8	State machine wedged (a bug, not a model issue)	Cycle stays OCCUPIED past MAX_CYCLE_DURATION repeatedly	`cycle_events_rejected` rate with `reason='too_long'`	> 3 per hour per station	Sophia (log); auto-snapshot state for debug

C — Validation-window-specific failures (during the paper-grade validation)

#	Failure	Detection	Action
C1	Ronald’s stopwatch and CV disagree by > 20% on count for a single shift	live tally during W2 (`validation-methodology.md`)	Pause validation; investigate before next shift
C2	Tape-replay shows CV is right and Ronald’s stopwatch is wrong	post-shift review	Note for paper (the §3 pivot rule)
C3	Validation observer (Ronald or secondary) fatigue late in shift	last-2-hours-vs-first-2-hours inter-rater agreement drop	Cap validation observation at 6 h/day

#	Failure	Detection	Action
D1	Camera repositioned to view a non-workstation area (bathroom hallway, locker room)	Manual periodic review of one frame per camera per week	Camera relocated / pointed away
D2	Operator who withdrew consent is still being recorded	`withdrawn_consent_operators` table cross-checked against `shift_assignments` and the relevant station’s camera	Auto-disable recording for that station for that operator’s shifts
D3	Raw frame for a face-blurred figure is accidentally checked into the public repo	git pre-commit hook scans for image files in figure paths	Block commit

Aggregate health score

A single system_health value ∈ [0, 1] is computed every minute and surfaced in the dashboard footer:

health = 1.0
  - 0.30 * (any A-level alert active)
  - 0.10 * (number of B-level alerts active, capped at 0.40)
  - 0.10 * (fraction of cycles in trailing 1h tagged 'red' or rejected, capped at 0.20)
  - 0.10 * (any active drift warning from roi-calibration §3)

Bands:

health ≥ 0.85 — green dashboard
0.65 ≤ health < 0.85 — yellow dashboard, with banner listing active alerts
health < 0.65 — red dashboard; kill-switch fires: cycle counts continue to be logged to the DB but the dashboard hides per-station efficiency numbers and shows “system degraded — manual count required” instead. Excel export annotates affected shifts.

The kill-switch is the critical Phase I safety net: it converts a silent failure into a noisy one. Ronald is trained to revert to stopwatch when the kill-switch fires; the system never knowingly outputs wrong numbers to the plant decision loop.

Logging plan

Per-frame logs are heavy. Two-tier:

Hot tier (always on): per-cycle event with quality tag, per-camera 1-minute aggregate (detections_per_min, mean_conf, median_conf, frame_count, mean_luminance).

Warm tier (sampled): per-frame log for 1 frame per camera per minute (frame_id, ts, n_detections, confidences[], bboxes[]). Stored as rotated JSONL.

Cold tier (audit): one full-resolution frame per camera per hour, immutable, retained for the project’s life (per phase-ii-preview.md §3).

Total log volume ≈ 50–200 MB/day, dwarfed by video buffer.

Logs go to local SQLite (telemetry table) + rotated JSONL. No external log service in Phase I — Tailscale-pulled logs only. Phase II revisit when AWS comes online.

Alert routing

Severity	Channel	Audience
Banner (UI)	Dashboard top bar, Spanish + English	Ronald + anyone with dashboard access
Log	`journalctl` on Jetson + Tailscale-pullable	Sophia / Andrew
Page (P1)	OPEN[Sophia, by 2026-06-15]: Slack/Discord/email/SMS — pick one channel. Most likely: Discord webhook (free) or email via Mailgun/Resend.	Sophia, escalate to Andrew if no ack in 4 h
Auto-mitigate	systemd restart, suppress phantom cycle	(none — silent recovery, but logged)

Ronald has no email / calendar / phone in any of Sophia’s connected services (per docs/overview/people.md). All Ronald-facing alerts must surface in the dashboard banner. Email-to-Ronald is not a path in Phase I.

Alternatives considered

Alt	Why rejected
No self-monitoring; trust the model	The “silent lie” problem is the whole point of this doc. Hard no.
Datadog / Sentry / production observability stack	Overkill, costly, requires cloud connection. Phase II revisit.
Confidence threshold alerts only (single metric)	Misses the most common failures: phantom cycles (confidence is fine), camera bumps (confidence is fine), state machine bugs.
Email Ronald on every alert	Ronald has no email path in our infrastructure. Even if he did, alert fatigue would shut him down within a week.
Auto-pause the whole system on any B-level alert	Too aggressive; lighting changes are constant. Soft-banner + log is the right level for B-tier.
No kill-switch	Risk of silently-wrong production numbers driving real plant decisions. Hard no.

Open questions

OPEN[Sophia, by 2026-06-01]: Pick the page channel. Discord webhook is the cheapest and we already have a Discord; Mailgun free tier (5K/mo) is the more “production” answer.
OPEN[Ronald via Armando, by 2026-06-15]: When the dashboard banner is in Spanish, what term does Ronald already use for “system degraded”? Match plant vernacular, don’t invent terminology.
OPEN[Andrew, by 2026-05-30]: Form AI’s playbook for what’s “alertable” vs “log-only” — crib from it.
OPEN[Sophia, before validation W2]: Does the kill-switch fire during the validation window itself, or is it suppressed so the paper can report what happens with no human-mitigation? Recommendation: kill-switch fires during validation but the underlying cycle_events table continues to record; the dashboard display is what hides numbers. The paper analyzes the unfiltered table.
OPEN[Agent C]: Jetson thermal sensor read path. Without it, A3 is unmonitorable. tegrastats exposes temps; pipe to telemetry.
OPEN[Sophia, paper]: What fraction of validation-window hours had health < 0.85? This number is paper-honest; it goes in Limitations whether or not it’s flattering.

Cross-bucket dependencies

Agent A (frontend): dashboard banner UI, health-score display, kill-switch “manual count required” state.
Agent B (backend): telemetry table; alert evaluator (a periodic Python job); withdrawn_consent_operators table referenced by D2.
Agent C (hardware): thermal monitoring path; PoE switch SNMP / link-state for A1 confirmation; periodic-frame review process for D1.
Agent E (business/legal): D-tier failures are the consent-posture safety net; D2 specifically depends on a tracked withdraw-consent log per validation-methodology.md §5.1.

What’s weak in this doc

The health-score weighting is hand-tuned and untested. The 0.30 / 0.10 / 0.10 / 0.10 coefficients haven’t been validated against a window where we know which numbers were wrong. They’ll need to be re-tuned after the W1+W2 validation window in Pereira. A defensible alternative is to make the score a simple min(...) over component sub-scores rather than a weighted sum.
No statistical-process-control angle. Confidence-distribution drift via KS test is one signal; a real industrial monitoring system uses CUSUM or EWMA charts for slow drift. Phase I doc skips this — it’s a defensible Phase II add but the paper will note the gap.
The “page channel” is unresolved. Without a working page channel, A-level alerts are log-only, and Sophia is in California while the system runs in Pereira. A 6-hour delay between failure and Sophia seeing the log is a real possibility. This must be closed before go-live.
B4 (model drift on new operator) uses a per-station baseline that assumes the station’s nominal cycle rate is stable. In practice operators rotate between stations and rates vary. A more robust signal would be operator-conditioned, but Phase I doesn’t have operator identity at the CV layer (consent constraint). The pragmatic compromise — per-station baseline — is what’s specified, with the known weakness that operator rotation can trip the alert.
B6’s “stuck-on” rule (4-hour OCCUPIED) is generous — a slow-but-real workstation could legitimately run a 2-hour cycle on the longest assembly operation. The 4-hour threshold is safe but late. A smarter rule looks at frame-to-frame bbox motion: a poster of a person produces zero motion; a real seated operator produces some. Adds compute. Not specified here.
No paging-fatigue analysis. If A-level alerts fire more than ~1× per week on average, Sophia will start ignoring them. The doc lists severities but does not estimate frequency. Real production observability frameworks (e.g., Google SRE handbook) emphasize this. Worth a paragraph after first month.

Rollout

Date	Gate
2026-06-01	Telemetry table + warm-tier per-frame log running locally.
2026-06-15	Alert evaluator job running; banner UI wired into dashboard.
2026-06-22	Page channel live and tested with a synthetic failure.
2026-07-01	Pereira day 1: monitoring is on from minute one. Banners may flicker as we tune thresholds — acceptable.
2026-07-15	First post-validation threshold tuning pass; health-score weights revised.
2026-08-01	Phase I “stable” — false-banner rate ≤ 1 per camera per day.

Paper alignment

Methods / Experimental setup: §“Catalog of silent failure modes” → Table 7 in the paper. The act of enumerating these is itself a contribution at the IEEE-deployment-paper level.
Results: report health < 0.85 minutes during the validation window; report which alerts fired and what we learned from them. Honest reporting here makes the paper much stronger.
Discussion: the kill-switch as a deployment-safety design pattern is worth a paragraph — small-factory deployments rarely have it and rarely admit they don’t.