Failure modes & self-monitoring
Bucket: technical/ml (Agent D) · Status: reviewed (Phase B: 3–5 fps per ADR-004) · Owner: Sophia Mann · Phase: I · Last updated: 2026-05-12
Context
Section titled “Context”The cycle-event detector (cycle-event-detection.md) is a probabilistic system attached to a deterministic-looking business surface (per-station cycle counts, per-shift efficiency %, Excel export matching INDICADORES ABRIL.xlsx). The danger is that the system silently lies: the dashboard shows numbers, Ronald trusts them, decisions get made on them, and nobody knows the underlying CV pipeline is mis-firing.
Phase I must include self-monitoring that fires before the numbers reach Ronald. This doc catalogs the failure modes, the detection signal for each, and the alert routing.
The “fail loudly” principle is non-negotiable for the paper: silent failure of a Phase I deployment disqualifies the paper’s validation methodology, because you can’t claim agreement with the stopwatch over a window where the CV system was secretly broken.
- G1: Catalog every silent-failure mode we can think of with the detection signal for each.
- G2: Specify alert thresholds and routing — what surfaces to Ronald, what surfaces to Sophia, what gets auto-mitigated.
- G3: Define the kill-switch: at what aggregate metric does the system stop reporting numbers and ask for human intervention?
- G4: Make the self-monitoring data part of the paper, not background noise.
Non-goals
Section titled “Non-goals”- General Linux/Jetson system monitoring (CPU, disk, network) — Agent C’s bucket, but signals listed below where they affect ML.
- Application-level uptime monitoring (dashboard reachable, API responds) — Agent B.
- Active model-improvement loops (retraining on bad cases) — covered in
training-and-finetuning.mdStage C.
Catalog of silent failure modes
Section titled “Catalog of silent failure modes”A — Pipeline-level failures (the system is broken; numbers are wrong)
Section titled “A — Pipeline-level failures (the system is broken; numbers are wrong)”| # | Failure | Why silent | Detection signal | Threshold | Alert routing |
|---|---|---|---|---|---|
| A1 | Camera dropped offline (network blip, PoE port reset) | GStreamer pipeline may auto-reconnect; cycle counter stops accumulating; dashboard tile freezes “last seen 3 min ago” — easy to miss | frames_received_per_minute per camera | < 50% of nominal (≈ 180–300/min at 3–5 fps) for ≥ 60 s | Ronald (dashboard banner) + Sophia (log) |
| A2 | GStreamer pipeline stalled (decoder hangs but doesn’t error) | Process is alive, just stuck | Per-camera last_frame_ts not advancing | ≥ 30 s since last frame | Auto-restart that pipeline; alert Sophia |
| A3 | Jetson thermal throttle | Inference latency creeps up; YOLOv8 starts dropping confidence; no exception | Mean per-frame inference latency vs 24-h baseline | > 1.5× baseline over 10-min window | Ronald (banner: “system thermal — call Sophia”); Sophia (page) |
| A4 | Inference engine OOM / crash with auto-restart | systemd restart loop; cycles missed during downtime | Process restart count per hour | ≥ 1 unexpected restart | Sophia (log) |
| A5 | SQLite locked / corrupted | cycle_events writes fail; events lost | DB write-success rate | < 99% | Sophia (page); fallback to JSONL on disk |
| A6 | NVMe at >85% capacity | Rolling deletion stops new recordings | df on NVMe mount | > 85% | Sophia (banner) |
B — Detector-quality failures (the system “works” but is silently mis-detecting)
Section titled “B — Detector-quality failures (the system “works” but is silently mis-detecting)”| # | Failure | Why silent | Detection signal | Threshold | Alert routing |
|---|---|---|---|---|---|
| B1 | Camera lens fogged / smudged | Cycles still emit; confidence drops gradually | Per-camera median person-detection confidence, 1-hour window vs trailing-7-day baseline | drop > 0.15 absolute | Ronald (banner: “clean camera N”) |
| B2 | Lighting change (sunset, fluorescent failure, new lamp) | Same as B1 plus possible bbox flicker | (a) Per-camera mean frame luminance, drift > 30% vs baseline. (b) Confidence distribution shift (KS test, p < 0.01 on 1-hour vs trailing 7-day). | Either signal | Sophia (log); Ronald only if persistent > 6 h |
| B3 | Camera mount shift | Cycles emit but ROIs are mis-aligned → wrong workstation → wrong SAM comparison | ORB drift detection from roi-calibration.md §3 | Per roi-calibration.md thresholds | Ronald (banner: re-calibrate camera N) |
| B4 | Model drift on new operator / new garment SKU | Recall drops on specific operator; system silently under-counts their station | Per-station cycle count vs trailing-7-day median | drop > 25% over 4-hour window | Ronald (banner); flag the operator + station in cycle_events_rejected review |
| B5 | Hi-vis vest / new uniform confuses person detector | Bbox misses operator; under-count | Per-station frames_with_no_detection_inside_roi while station should be working | spike vs trailing baseline | Sophia (log); if persistent → manual review |
| B6 | Detector “stuck on” — a poster of a person on the wall is being detected | Phantom cycles on a station that is empty | A station detected as OCCUPIED for ≥ 4 hours with zero PAUSED excursions | duration > 4 h continuously | Ronald (banner) + cycle suppressed |
| B7 | Confidence collapse — model produces no detections at all | Cycle count → 0 for the whole module | Aggregate n_detections_per_minute across all cameras | < 5% of 24-h baseline for 10 min | Sophia (page) |
| B8 | State machine wedged (a bug, not a model issue) | Cycle stays OCCUPIED past MAX_CYCLE_DURATION repeatedly | cycle_events_rejected rate with reason='too_long' | > 3 per hour per station | Sophia (log); auto-snapshot state for debug |
C — Validation-window-specific failures (during the paper-grade validation)
Section titled “C — Validation-window-specific failures (during the paper-grade validation)”| # | Failure | Detection | Action |
|---|---|---|---|
| C1 | Ronald’s stopwatch and CV disagree by > 20% on count for a single shift | live tally during W2 (validation-methodology.md) | Pause validation; investigate before next shift |
| C2 | Tape-replay shows CV is right and Ronald’s stopwatch is wrong | post-shift review | Note for paper (the §3 pivot rule) |
| C3 | Validation observer (Ronald or secondary) fatigue late in shift | last-2-hours-vs-first-2-hours inter-rater agreement drop | Cap validation observation at 6 h/day |
D — Ethics / consent failures (the system silently violates the consent posture)
Section titled “D — Ethics / consent failures (the system silently violates the consent posture)”| # | Failure | Detection | Action |
|---|---|---|---|
| D1 | Camera repositioned to view a non-workstation area (bathroom hallway, locker room) | Manual periodic review of one frame per camera per week | Camera relocated / pointed away |
| D2 | Operator who withdrew consent is still being recorded | withdrawn_consent_operators table cross-checked against shift_assignments and the relevant station’s camera | Auto-disable recording for that station for that operator’s shifts |
| D3 | Raw frame for a face-blurred figure is accidentally checked into the public repo | git pre-commit hook scans for image files in figure paths | Block commit |
Aggregate health score
Section titled “Aggregate health score”A single system_health value ∈ [0, 1] is computed every minute and surfaced in the dashboard footer:
health = 1.0 - 0.30 * (any A-level alert active) - 0.10 * (number of B-level alerts active, capped at 0.40) - 0.10 * (fraction of cycles in trailing 1h tagged 'red' or rejected, capped at 0.20) - 0.10 * (any active drift warning from roi-calibration §3)Bands:
health ≥ 0.85— green dashboard0.65 ≤ health < 0.85— yellow dashboard, with banner listing active alertshealth < 0.65— red dashboard; kill-switch fires: cycle counts continue to be logged to the DB but the dashboard hides per-station efficiency numbers and shows “system degraded — manual count required” instead. Excel export annotates affected shifts.
The kill-switch is the critical Phase I safety net: it converts a silent failure into a noisy one. Ronald is trained to revert to stopwatch when the kill-switch fires; the system never knowingly outputs wrong numbers to the plant decision loop.
Logging plan
Section titled “Logging plan”Per-frame logs are heavy. Two-tier:
Hot tier (always on): per-cycle event with quality tag, per-camera 1-minute aggregate (detections_per_min, mean_conf, median_conf, frame_count, mean_luminance).
Warm tier (sampled): per-frame log for 1 frame per camera per minute (frame_id, ts, n_detections, confidences[], bboxes[]). Stored as rotated JSONL.
Cold tier (audit): one full-resolution frame per camera per hour, immutable, retained for the project’s life (per phase-ii-preview.md §3).
Total log volume ≈ 50–200 MB/day, dwarfed by video buffer.
Logs go to local SQLite (telemetry table) + rotated JSONL. No external log service in Phase I — Tailscale-pulled logs only. Phase II revisit when AWS comes online.
Alert routing
Section titled “Alert routing”| Severity | Channel | Audience |
|---|---|---|
| Banner (UI) | Dashboard top bar, Spanish + English | Ronald + anyone with dashboard access |
| Log | journalctl on Jetson + Tailscale-pullable | Sophia / Andrew |
| Page (P1) | OPEN[Sophia, by 2026-06-15]: Slack/Discord/email/SMS — pick one channel. Most likely: Discord webhook (free) or email via Mailgun/Resend. | Sophia, escalate to Andrew if no ack in 4 h |
| Auto-mitigate | systemd restart, suppress phantom cycle | (none — silent recovery, but logged) |
Ronald has no email / calendar / phone in any of Sophia’s connected services (per docs/overview/people.md). All Ronald-facing alerts must surface in the dashboard banner. Email-to-Ronald is not a path in Phase I.
Alternatives considered
Section titled “Alternatives considered”| Alt | Why rejected |
|---|---|
| No self-monitoring; trust the model | The “silent lie” problem is the whole point of this doc. Hard no. |
| Datadog / Sentry / production observability stack | Overkill, costly, requires cloud connection. Phase II revisit. |
| Confidence threshold alerts only (single metric) | Misses the most common failures: phantom cycles (confidence is fine), camera bumps (confidence is fine), state machine bugs. |
| Email Ronald on every alert | Ronald has no email path in our infrastructure. Even if he did, alert fatigue would shut him down within a week. |
| Auto-pause the whole system on any B-level alert | Too aggressive; lighting changes are constant. Soft-banner + log is the right level for B-tier. |
| No kill-switch | Risk of silently-wrong production numbers driving real plant decisions. Hard no. |
Open questions
Section titled “Open questions”- OPEN[Sophia, by 2026-06-01]: Pick the page channel. Discord webhook is the cheapest and we already have a Discord; Mailgun free tier (5K/mo) is the more “production” answer.
- OPEN[Ronald via Armando, by 2026-06-15]: When the dashboard banner is in Spanish, what term does Ronald already use for “system degraded”? Match plant vernacular, don’t invent terminology.
- OPEN[Andrew, by 2026-05-30]: Form AI’s playbook for what’s “alertable” vs “log-only” — crib from it.
- OPEN[Sophia, before validation W2]: Does the kill-switch fire during the validation window itself, or is it suppressed so the paper can report what happens with no human-mitigation? Recommendation: kill-switch fires during validation but the underlying
cycle_eventstable continues to record; the dashboard display is what hides numbers. The paper analyzes the unfiltered table. - OPEN[Agent C]: Jetson thermal sensor read path. Without it, A3 is unmonitorable.
tegrastatsexposes temps; pipe to telemetry. - OPEN[Sophia, paper]: What fraction of validation-window hours had
health < 0.85? This number is paper-honest; it goes in Limitations whether or not it’s flattering.
Cross-bucket dependencies
Section titled “Cross-bucket dependencies”- Agent A (frontend): dashboard banner UI, health-score display, kill-switch “manual count required” state.
- Agent B (backend):
telemetrytable; alert evaluator (a periodic Python job);withdrawn_consent_operatorstable referenced by D2. - Agent C (hardware): thermal monitoring path; PoE switch SNMP / link-state for A1 confirmation; periodic-frame review process for D1.
- Agent E (business/legal): D-tier failures are the consent-posture safety net; D2 specifically depends on a tracked withdraw-consent log per
validation-methodology.md§5.1.
What’s weak in this doc
Section titled “What’s weak in this doc”- The health-score weighting is hand-tuned and untested. The 0.30 / 0.10 / 0.10 / 0.10 coefficients haven’t been validated against a window where we know which numbers were wrong. They’ll need to be re-tuned after the W1+W2 validation window in Pereira. A defensible alternative is to make the score a simple
min(...)over component sub-scores rather than a weighted sum. - No statistical-process-control angle. Confidence-distribution drift via KS test is one signal; a real industrial monitoring system uses CUSUM or EWMA charts for slow drift. Phase I doc skips this — it’s a defensible Phase II add but the paper will note the gap.
- The “page channel” is unresolved. Without a working page channel, A-level alerts are log-only, and Sophia is in California while the system runs in Pereira. A 6-hour delay between failure and Sophia seeing the log is a real possibility. This must be closed before go-live.
- B4 (model drift on new operator) uses a per-station baseline that assumes the station’s nominal cycle rate is stable. In practice operators rotate between stations and rates vary. A more robust signal would be operator-conditioned, but Phase I doesn’t have operator identity at the CV layer (consent constraint). The pragmatic compromise — per-station baseline — is what’s specified, with the known weakness that operator rotation can trip the alert.
- B6’s “stuck-on” rule (4-hour OCCUPIED) is generous — a slow-but-real workstation could legitimately run a 2-hour cycle on the longest assembly operation. The 4-hour threshold is safe but late. A smarter rule looks at frame-to-frame bbox motion: a poster of a person produces zero motion; a real seated operator produces some. Adds compute. Not specified here.
- No paging-fatigue analysis. If A-level alerts fire more than ~1× per week on average, Sophia will start ignoring them. The doc lists severities but does not estimate frequency. Real production observability frameworks (e.g., Google SRE handbook) emphasize this. Worth a paragraph after first month.
Rollout
Section titled “Rollout”| Date | Gate |
|---|---|
| 2026-06-01 | Telemetry table + warm-tier per-frame log running locally. |
| 2026-06-15 | Alert evaluator job running; banner UI wired into dashboard. |
| 2026-06-22 | Page channel live and tested with a synthetic failure. |
| 2026-07-01 | Pereira day 1: monitoring is on from minute one. Banners may flicker as we tune thresholds — acceptable. |
| 2026-07-15 | First post-validation threshold tuning pass; health-score weights revised. |
| 2026-08-01 | Phase I “stable” — false-banner rate ≤ 1 per camera per day. |
Paper alignment
Section titled “Paper alignment”- Methods / Experimental setup: §“Catalog of silent failure modes” → Table 7 in the paper. The act of enumerating these is itself a contribution at the IEEE-deployment-paper level.
- Results: report
health < 0.85minutes during the validation window; report which alerts fired and what we learned from them. Honest reporting here makes the paper much stronger. - Discussion: the kill-switch as a deployment-safety design pattern is worth a paragraph — small-factory deployments rarely have it and rarely admit they don’t.