Skip to content

Failure modes & self-monitoring

Bucket: technical/ml (Agent D) · Status: reviewed (Phase B: 3–5 fps per ADR-004) · Owner: Sophia Mann · Phase: I · Last updated: 2026-05-12

The cycle-event detector (cycle-event-detection.md) is a probabilistic system attached to a deterministic-looking business surface (per-station cycle counts, per-shift efficiency %, Excel export matching INDICADORES ABRIL.xlsx). The danger is that the system silently lies: the dashboard shows numbers, Ronald trusts them, decisions get made on them, and nobody knows the underlying CV pipeline is mis-firing.

Phase I must include self-monitoring that fires before the numbers reach Ronald. This doc catalogs the failure modes, the detection signal for each, and the alert routing.

The “fail loudly” principle is non-negotiable for the paper: silent failure of a Phase I deployment disqualifies the paper’s validation methodology, because you can’t claim agreement with the stopwatch over a window where the CV system was secretly broken.

  • G1: Catalog every silent-failure mode we can think of with the detection signal for each.
  • G2: Specify alert thresholds and routing — what surfaces to Ronald, what surfaces to Sophia, what gets auto-mitigated.
  • G3: Define the kill-switch: at what aggregate metric does the system stop reporting numbers and ask for human intervention?
  • G4: Make the self-monitoring data part of the paper, not background noise.
  • General Linux/Jetson system monitoring (CPU, disk, network) — Agent C’s bucket, but signals listed below where they affect ML.
  • Application-level uptime monitoring (dashboard reachable, API responds) — Agent B.
  • Active model-improvement loops (retraining on bad cases) — covered in training-and-finetuning.md Stage C.

A — Pipeline-level failures (the system is broken; numbers are wrong)

Section titled “A — Pipeline-level failures (the system is broken; numbers are wrong)”
#FailureWhy silentDetection signalThresholdAlert routing
A1Camera dropped offline (network blip, PoE port reset)GStreamer pipeline may auto-reconnect; cycle counter stops accumulating; dashboard tile freezes “last seen 3 min ago” — easy to missframes_received_per_minute per camera< 50% of nominal (≈ 180–300/min at 3–5 fps) for ≥ 60 sRonald (dashboard banner) + Sophia (log)
A2GStreamer pipeline stalled (decoder hangs but doesn’t error)Process is alive, just stuckPer-camera last_frame_ts not advancing≥ 30 s since last frameAuto-restart that pipeline; alert Sophia
A3Jetson thermal throttleInference latency creeps up; YOLOv8 starts dropping confidence; no exceptionMean per-frame inference latency vs 24-h baseline> 1.5× baseline over 10-min windowRonald (banner: “system thermal — call Sophia”); Sophia (page)
A4Inference engine OOM / crash with auto-restartsystemd restart loop; cycles missed during downtimeProcess restart count per hour≥ 1 unexpected restartSophia (log)
A5SQLite locked / corruptedcycle_events writes fail; events lostDB write-success rate< 99%Sophia (page); fallback to JSONL on disk
A6NVMe at >85% capacityRolling deletion stops new recordingsdf on NVMe mount> 85%Sophia (banner)

B — Detector-quality failures (the system “works” but is silently mis-detecting)

Section titled “B — Detector-quality failures (the system “works” but is silently mis-detecting)”
#FailureWhy silentDetection signalThresholdAlert routing
B1Camera lens fogged / smudgedCycles still emit; confidence drops graduallyPer-camera median person-detection confidence, 1-hour window vs trailing-7-day baselinedrop > 0.15 absoluteRonald (banner: “clean camera N”)
B2Lighting change (sunset, fluorescent failure, new lamp)Same as B1 plus possible bbox flicker(a) Per-camera mean frame luminance, drift > 30% vs baseline. (b) Confidence distribution shift (KS test, p < 0.01 on 1-hour vs trailing 7-day).Either signalSophia (log); Ronald only if persistent > 6 h
B3Camera mount shiftCycles emit but ROIs are mis-aligned → wrong workstation → wrong SAM comparisonORB drift detection from roi-calibration.md §3Per roi-calibration.md thresholdsRonald (banner: re-calibrate camera N)
B4Model drift on new operator / new garment SKURecall drops on specific operator; system silently under-counts their stationPer-station cycle count vs trailing-7-day mediandrop > 25% over 4-hour windowRonald (banner); flag the operator + station in cycle_events_rejected review
B5Hi-vis vest / new uniform confuses person detectorBbox misses operator; under-countPer-station frames_with_no_detection_inside_roi while station should be workingspike vs trailing baselineSophia (log); if persistent → manual review
B6Detector “stuck on” — a poster of a person on the wall is being detectedPhantom cycles on a station that is emptyA station detected as OCCUPIED for ≥ 4 hours with zero PAUSED excursionsduration > 4 h continuouslyRonald (banner) + cycle suppressed
B7Confidence collapse — model produces no detections at allCycle count → 0 for the whole moduleAggregate n_detections_per_minute across all cameras< 5% of 24-h baseline for 10 minSophia (page)
B8State machine wedged (a bug, not a model issue)Cycle stays OCCUPIED past MAX_CYCLE_DURATION repeatedlycycle_events_rejected rate with reason='too_long'> 3 per hour per stationSophia (log); auto-snapshot state for debug

C — Validation-window-specific failures (during the paper-grade validation)

Section titled “C — Validation-window-specific failures (during the paper-grade validation)”
#FailureDetectionAction
C1Ronald’s stopwatch and CV disagree by > 20% on count for a single shiftlive tally during W2 (validation-methodology.md)Pause validation; investigate before next shift
C2Tape-replay shows CV is right and Ronald’s stopwatch is wrongpost-shift reviewNote for paper (the §3 pivot rule)
C3Validation observer (Ronald or secondary) fatigue late in shiftlast-2-hours-vs-first-2-hours inter-rater agreement dropCap validation observation at 6 h/day
Section titled “D — Ethics / consent failures (the system silently violates the consent posture)”
#FailureDetectionAction
D1Camera repositioned to view a non-workstation area (bathroom hallway, locker room)Manual periodic review of one frame per camera per weekCamera relocated / pointed away
D2Operator who withdrew consent is still being recordedwithdrawn_consent_operators table cross-checked against shift_assignments and the relevant station’s cameraAuto-disable recording for that station for that operator’s shifts
D3Raw frame for a face-blurred figure is accidentally checked into the public repogit pre-commit hook scans for image files in figure pathsBlock commit

A single system_health value ∈ [0, 1] is computed every minute and surfaced in the dashboard footer:

health = 1.0
- 0.30 * (any A-level alert active)
- 0.10 * (number of B-level alerts active, capped at 0.40)
- 0.10 * (fraction of cycles in trailing 1h tagged 'red' or rejected, capped at 0.20)
- 0.10 * (any active drift warning from roi-calibration §3)

Bands:

  • health ≥ 0.85 — green dashboard
  • 0.65 ≤ health < 0.85 — yellow dashboard, with banner listing active alerts
  • health < 0.65 — red dashboard; kill-switch fires: cycle counts continue to be logged to the DB but the dashboard hides per-station efficiency numbers and shows “system degraded — manual count required” instead. Excel export annotates affected shifts.

The kill-switch is the critical Phase I safety net: it converts a silent failure into a noisy one. Ronald is trained to revert to stopwatch when the kill-switch fires; the system never knowingly outputs wrong numbers to the plant decision loop.

Per-frame logs are heavy. Two-tier:

Hot tier (always on): per-cycle event with quality tag, per-camera 1-minute aggregate (detections_per_min, mean_conf, median_conf, frame_count, mean_luminance).

Warm tier (sampled): per-frame log for 1 frame per camera per minute (frame_id, ts, n_detections, confidences[], bboxes[]). Stored as rotated JSONL.

Cold tier (audit): one full-resolution frame per camera per hour, immutable, retained for the project’s life (per phase-ii-preview.md §3).

Total log volume ≈ 50–200 MB/day, dwarfed by video buffer.

Logs go to local SQLite (telemetry table) + rotated JSONL. No external log service in Phase I — Tailscale-pulled logs only. Phase II revisit when AWS comes online.

SeverityChannelAudience
Banner (UI)Dashboard top bar, Spanish + EnglishRonald + anyone with dashboard access
Logjournalctl on Jetson + Tailscale-pullableSophia / Andrew
Page (P1)OPEN[Sophia, by 2026-06-15]: Slack/Discord/email/SMS — pick one channel. Most likely: Discord webhook (free) or email via Mailgun/Resend.Sophia, escalate to Andrew if no ack in 4 h
Auto-mitigatesystemd restart, suppress phantom cycle(none — silent recovery, but logged)

Ronald has no email / calendar / phone in any of Sophia’s connected services (per docs/overview/people.md). All Ronald-facing alerts must surface in the dashboard banner. Email-to-Ronald is not a path in Phase I.

AltWhy rejected
No self-monitoring; trust the modelThe “silent lie” problem is the whole point of this doc. Hard no.
Datadog / Sentry / production observability stackOverkill, costly, requires cloud connection. Phase II revisit.
Confidence threshold alerts only (single metric)Misses the most common failures: phantom cycles (confidence is fine), camera bumps (confidence is fine), state machine bugs.
Email Ronald on every alertRonald has no email path in our infrastructure. Even if he did, alert fatigue would shut him down within a week.
Auto-pause the whole system on any B-level alertToo aggressive; lighting changes are constant. Soft-banner + log is the right level for B-tier.
No kill-switchRisk of silently-wrong production numbers driving real plant decisions. Hard no.
  • OPEN[Sophia, by 2026-06-01]: Pick the page channel. Discord webhook is the cheapest and we already have a Discord; Mailgun free tier (5K/mo) is the more “production” answer.
  • OPEN[Ronald via Armando, by 2026-06-15]: When the dashboard banner is in Spanish, what term does Ronald already use for “system degraded”? Match plant vernacular, don’t invent terminology.
  • OPEN[Andrew, by 2026-05-30]: Form AI’s playbook for what’s “alertable” vs “log-only” — crib from it.
  • OPEN[Sophia, before validation W2]: Does the kill-switch fire during the validation window itself, or is it suppressed so the paper can report what happens with no human-mitigation? Recommendation: kill-switch fires during validation but the underlying cycle_events table continues to record; the dashboard display is what hides numbers. The paper analyzes the unfiltered table.
  • OPEN[Agent C]: Jetson thermal sensor read path. Without it, A3 is unmonitorable. tegrastats exposes temps; pipe to telemetry.
  • OPEN[Sophia, paper]: What fraction of validation-window hours had health < 0.85? This number is paper-honest; it goes in Limitations whether or not it’s flattering.
  • Agent A (frontend): dashboard banner UI, health-score display, kill-switch “manual count required” state.
  • Agent B (backend): telemetry table; alert evaluator (a periodic Python job); withdrawn_consent_operators table referenced by D2.
  • Agent C (hardware): thermal monitoring path; PoE switch SNMP / link-state for A1 confirmation; periodic-frame review process for D1.
  • Agent E (business/legal): D-tier failures are the consent-posture safety net; D2 specifically depends on a tracked withdraw-consent log per validation-methodology.md §5.1.
  1. The health-score weighting is hand-tuned and untested. The 0.30 / 0.10 / 0.10 / 0.10 coefficients haven’t been validated against a window where we know which numbers were wrong. They’ll need to be re-tuned after the W1+W2 validation window in Pereira. A defensible alternative is to make the score a simple min(...) over component sub-scores rather than a weighted sum.
  2. No statistical-process-control angle. Confidence-distribution drift via KS test is one signal; a real industrial monitoring system uses CUSUM or EWMA charts for slow drift. Phase I doc skips this — it’s a defensible Phase II add but the paper will note the gap.
  3. The “page channel” is unresolved. Without a working page channel, A-level alerts are log-only, and Sophia is in California while the system runs in Pereira. A 6-hour delay between failure and Sophia seeing the log is a real possibility. This must be closed before go-live.
  4. B4 (model drift on new operator) uses a per-station baseline that assumes the station’s nominal cycle rate is stable. In practice operators rotate between stations and rates vary. A more robust signal would be operator-conditioned, but Phase I doesn’t have operator identity at the CV layer (consent constraint). The pragmatic compromise — per-station baseline — is what’s specified, with the known weakness that operator rotation can trip the alert.
  5. B6’s “stuck-on” rule (4-hour OCCUPIED) is generous — a slow-but-real workstation could legitimately run a 2-hour cycle on the longest assembly operation. The 4-hour threshold is safe but late. A smarter rule looks at frame-to-frame bbox motion: a poster of a person produces zero motion; a real seated operator produces some. Adds compute. Not specified here.
  6. No paging-fatigue analysis. If A-level alerts fire more than ~1× per week on average, Sophia will start ignoring them. The doc lists severities but does not estimate frequency. Real production observability frameworks (e.g., Google SRE handbook) emphasize this. Worth a paragraph after first month.
DateGate
2026-06-01Telemetry table + warm-tier per-frame log running locally.
2026-06-15Alert evaluator job running; banner UI wired into dashboard.
2026-06-22Page channel live and tested with a synthetic failure.
2026-07-01Pereira day 1: monitoring is on from minute one. Banners may flicker as we tune thresholds — acceptable.
2026-07-15First post-validation threshold tuning pass; health-score weights revised.
2026-08-01Phase I “stable” — false-banner rate ≤ 1 per camera per day.
  • Methods / Experimental setup: §“Catalog of silent failure modes” → Table 7 in the paper. The act of enumerating these is itself a contribution at the IEEE-deployment-paper level.
  • Results: report health < 0.85 minutes during the validation window; report which alerts fired and what we learned from them. Honest reporting here makes the paper much stronger.
  • Discussion: the kill-switch as a deployment-safety design pattern is worth a paragraph — small-factory deployments rarely have it and rarely admit they don’t.