Reproducibility & artifacts

Bucket: technical/ml (Agent D) · Status: draft · Owner: Sophia Mann · Phase: I (paper-grade) · Last updated: 2026-05-10

Context

For the IEEE submission (CASE / Access / T-ASE / IROS targets), reproducibility is increasingly an explicit reviewer criterion. IEEE Access in particular asks authors to make data and code available “to the extent permitted by the institution / data provider.” A reviewer who can re-run the analysis script against the released dataset and see the paper’s numbers come out is a paper that gets accepted; one who cannot is a paper that gets rejected at R2.

LBZF data is classified Confidencial — Uso Interno (per the Needs Definition). Operator identifiers are linked to named individuals via Ref22 Slim - Angela.xlsx and INDICADORES ABRIL.xlsx. So the reproducibility plan cannot be “publish everything.” It must be: publish enough that the methodology is reproducible without compromising the operators or LBZF.

This doc specifies the artifact set, the open-source / data-release posture, and the storage / version discipline.

Goals

G1: Every figure/table in the paper can be regenerated from a small, named set of artifacts that lives in a known location.
G2: A third-party reviewer can run an evaluation script that exercises at least the model-side reproducibility, against a public sample.
G3: Operator identifiability is removable at the artifact level (face-blurring is a step in the pipeline, not a manual figure edit).
G4: LBZF retains control over its own footage; nothing exits without Mariana’s sign-off.

Non-goals

Long-term archive of all 4TB rolling video (this is operational, not artifact).
Public release of operator-identifiable production data.
Open-sourcing the dashboard/landing-page web stack — Agent A’s call.
Hosting infrastructure for downloadable artifacts beyond what’s already in scope (S3, GitHub, Drive).

Artifact set

The full set a paper-reviewer should be able to access (modulo LBZF approval per category):

Artifact	Format	Public?	Location
Code: detector + state machine	Python package, `lbzf-cv/`	YES	GitHub `sophiamann/lbzfai-cv` (separate repo from the landing page)
Code: calibration tool	Python + JS in same repo	YES	same
Code: training pipeline	Same repo, `training/` subdir	YES	same
Code: validation analysis script (`validate_v1.py`)	Same repo	YES	same
Model: yolov8n_lbzf_v0.pt (fine-tuned, if Stage B runs)	PyTorch checkpoint	MAYBE — needs LBZF approval	If yes: GitHub LFS or HuggingFace Hub
Model: yolov8n.engine (TensorRT)	TensorRT engine	NO — non-portable; rebuild on user’s Jetson	(doc only)
Dataset: full Phase I training set	COCO-format JSON + frames	NO (operator-identifiable)	LBZF-owned S3
Dataset: 100-frame “public sample”	COCO-format + frames, face-blurred + manual review	MAYBE — LBZF sign-off	HuggingFace Datasets or Zenodo
Dataset: split CSV (`dataset_split_v0.csv`)	CSV, frame-id only (no images)	YES	GitHub
Validation: raw `cycle_events.csv` (anonymized)	CSV	MAYBE — aggregated only	OSF or Zenodo
Validation: per-shift `manual_observations.csv` (Ronald stopwatch)	CSV	MAYBE — aggregated only	OSF or Zenodo
Validation: paired Bland-Altman input data	CSV	YES (aggregated, no operator IDs)	OSF or Zenodo
Telemetry: health-score timeseries for validation window	CSV	YES	same
Paper figures source files (notebooks / matplotlib scripts)	`.ipynb` / `.py`	YES	same
Design docs (this directory)	Markdown	YES	this repo

The artifact set is intentionally conservative on raw data and liberal on code and aggregates. A reviewer can:

Read every design decision (markdown in this repo).
Re-run the model architecture and training pipeline (code).
Re-run the analysis pipeline against the released aggregate-level CSVs to confirm the paper’s statistical conclusions (analysis script).
Run the model on their own footage (a public-sample dataset with LBZF sign-off, optionally).

What a reviewer cannot do:

Re-train the model on the full LBZF dataset (data is LBZF-owned).
Audit operator-level performance (deliberately anonymized).

Versioning discipline

Model versions

Field	Convention	Example
`model_name`	`yolov8n_lbzf_vN` (N = 0, 1, …) or `yolov8n_coco_pretrained` (off-shelf baseline)	`yolov8n_lbzf_v0`
`model_sha256`	SHA-256 of the `.pt` file	`a3b9...`
`trained_on`	dataset hash, see below	`lbzf_person_v0`
`trained_at`	UTC timestamp of training run	`2026-06-12T14:33:00Z`
`git_sha`	commit SHA of the training repo at training time	`f4e5...`
`framework_version`	`ultralytics==8.x.y`, `torch==2.z.w`	(pinned in `requirements.txt`)
`tensorrt_engine_jetpack`	JetPack version the `.engine` was built against (TRT engines are non-portable)	`6.0-r36.3`

Dataset versions

Each dataset version is a directory with:

images/ (frames, anonymized if released)
annotations.coco.json
dataset_split_vN.csv — frame_id, split (train/val/test), source_video, recorded_at
manifest.json — hash of every image + annotation
LICENSE — for the public sample; the private full set has an LBZF data-use agreement (Agent E)

Hash the manifest. dataset_id = sha256(manifest.json). This is what model.trained_on references.

Cycle-event log versions

Every row in cycle_events carries model_version, roi_version (per roi-calibration.md), and git_sha of the inference code. So any historical analysis can name the exact configuration that produced any given event.

Storage tiers and access control

Tier	What	Where	Who
Tier 0 — public	Code, design docs, dataset split CSV, aggregate CSVs, public-sample dataset (if approved), figures source.	GitHub `sophiamann/lbzfai-cv`; HuggingFace Hub for model + sample dataset (if approved); Zenodo / OSF for aggregate CSVs (durable DOIs for paper citation).	Anyone
Tier 1 — restricted	Full training dataset, full validation manual_observations, raw video clips used for figures.	LBZF-owned S3 bucket (or LBZF Google Drive). Access via signed-URL for reviewers under NDA, or via Mariana-approved researcher list.	LBZF + named collaborators (Sophia, Andrew, ITBA, designated reviewers)
Tier 2 — private	4TB rolling video buffer; per-frame logs; operator-identity coding keys; consent forms.	Jetson NVMe (locally); not exported.	LBZF only

GitHub LFS for model checkpoints (Tier 0). Avoid putting large files in the main repo. The training-pipeline repo is sophiamann/lbzfai-cv (or whatever Sophia + Andrew settle on; the landing-page repo is intentionally separate per project hygiene).

Open-source posture (the LBZF coordination question)

Coordinate with Agent E. This doc proposes the following default, to be confirmed with Mariana before any public release:

Code (Tier 0): permissive license (MIT or Apache 2.0). No business reason to restrict the CV pipeline code; it’s a reference implementation of a small-factory deployment. Andrew’s “open-source ambition” framing (docs/overview/project.md) supports this.
Public-sample dataset (~100 frames, face-blurred): CC-BY-4.0 or CDLA-Permissive-2.0 with an explicit “this dataset is for research only; do not attempt to re-identify operators” clause. Requires Mariana sign-off.
Full dataset: NOT released. LBZF-owned. A reviewer-NDA path exists but is gated on Mariana.
Fine-tuned model checkpoint: released IF Stage B fine-tuning happens AND the model does not memorize identifiable operators (verify via membership inference attack on a sample subset before release). Default conservative: not released without check.

Reviewer reproduction path

A reviewer who wants to reproduce the paper’s claims should be able to:

# 1. Clone the repo
git clone https://github.com/sophiamann/lbzfai-cv
cd lbzfai-cv
git checkout paper-v1   # tagged at paper submission

# 2. Set up env
pip install -r requirements.txt   # pinned versions

# 3. Pull the public sample dataset (if released)
make data-sample   # downloads from HuggingFace

# 4. Re-run the validation analysis against aggregate CSVs
python validate_v1.py --inputs data/aggregate/ --output results/

# 5. Compare to the paper's results
diff results/ paper_results/   # files should match

This is the test: if diff is empty, the paper’s numbers are reproducible from the released artifacts. Any divergence is a paper bug.

Alternatives considered

Alt	Why rejected
Release everything publicly	Violates LBZF confidentiality; jeopardizes operator consent; legally fraught under Colombian Law 1581/2012.
Release nothing publicly	Disqualifies the paper from competitive IEEE venues; defeats the project’s reference-implementation framing.
Release only code, no data	Acceptable but weaker — reviewers can’t independently verify any data-dependent claim. The aggregate-CSV-and-public-sample middle path is better.
Use Roboflow Public for the dataset	Same Roboflow-public-license issue as `training-and-finetuning.md` §B.1 — incompatible with LBZF confidentiality without re-licensing.
HuggingFace gated dataset (request access)	Reasonable for the full dataset if LBZF allows. Default for now: full dataset not released even gated; revisit.
Self-host artifact storage on the lbzfai.com Cloudflare worker	Cute but Cloudflare R2/Workers are not durable-DOI venues. Zenodo / OSF / HuggingFace are the right homes for paper-cited artifacts.

Open questions

OPEN[Mariana via Agent E, by 2026-08-01]: Approve the open-source posture. Specifically: (a) public release of code, (b) public release of a face-blurred 100-frame sample, (c) NDA path for the full dataset for reviewers.
OPEN[Andrew, by 2026-06-01]: License choice for the code — MIT vs Apache 2.0. (Apache 2.0 has the patent grant which is mildly relevant if this is ever commercialized by LBZF.)
OPEN[Sophia, by 2026-06-15]: Reserve a Zenodo DOI for the aggregate-data archive before paper submission so the paper can cite a durable URL.
OPEN[ITBA, by 2026-06-15]: ITBA’s twin install may add its own dataset and metrics. Coordinate whether ITBA’s data is in the same Zenodo deposit or separate. Co-authorship implications.
OPEN[Sophia, by 2026-09-01]: Membership-inference attack on the fine-tuned model — does the model leak operator identity? If yes, do not release the checkpoint.
OPEN[Agent C]: TensorRT engine versioning is JetPack-pinned; what’s the right “minimum-supported JetPack version” claim for the paper?

Cross-bucket dependencies

Agent A (frontend): footer of the dashboard surfaces model_version + roi_version + commit SHA so any user can ask “which version produced this number.” Five-line UI change but essential for reproducibility.
Agent B (backend): cycle_events.model_version, cycle_events.roi_version, cycle_events.code_sha columns. Already requested in cycle-event-detection.md and roi-calibration.md.
Agent C (hardware): TensorRT engine is non-portable; the artifact set documents how to rebuild it on a Jetson but does not include the binary. Confirm with Agent C the rebuild recipe is robust enough that a reviewer can follow it.
Agent E (business/legal): LBZF approval; dataset license drafting; reviewer-NDA template.

What’s weak in this doc

The “public sample” is conditional on Mariana’s approval and the approval has not been requested. Without it, the paper falls back to “code-only reproducibility” which is weaker. The ask has to be teed up well before paper submission; the conversation with Mariana should happen before the Pereira trip, not after.
No formal data-management plan (DMP). Funder DMPs (NSF, EU) are not strictly required here (no funder), but IEEE Access reviewers increasingly look for one. A 1-page DMP would close the gap.
Membership-inference checks on the fine-tuned model are listed as a release gate but no procedure is specified. Real procedure: hold out 50 known-training and 50 known-not-training operator frames, run a shadow-model-based inference attack; if AUC > 0.6, do not release. This needs to be a step in the training pipeline.
No story for retracting an artifact once released. If LBZF revokes consent or an operator withdraws after publication, what’s the takedown procedure? HuggingFace and Zenodo both support takedowns but the process is manual; not specified.
GitHub LFS for model checkpoints has a free-tier 1 GB storage / 1 GB-month bandwidth cap. YOLOv8n’s .pt is ~6 MB — fine — but if Phase II’s bigger models go in the same repo, the cap bites. Plan to migrate model artifacts to HuggingFace Hub from day 1 to dodge the issue.
Reproducibility-via-diff assumes deterministic analysis. Bootstrap CIs depend on RNG seed; the analysis script must pin seeds and the validation doc must say which seed. A reviewer running with a different seed sees ε-level disagreement and may flag it. Specify the canonical seed.

Rollout

Date	Gate
2026-05-25	`sophiamann/lbzfai-cv` repo created with skeleton + LICENSE.
2026-06-01	`model_version` / `roi_version` columns in DB; inference code logs both per cycle.
2026-06-15	First version of `validate_v1.py` running against synthetic data.
2026-07-15	First version of `validate_v1.py` running against real Pereira data.
2026-08-01	Mariana approval ask for the public-sample dataset; Zenodo DOI reserved.
2026-09-01	Paper-v1 tag in repo; artifact set frozen.
2026-09-15	Paper submitted.

Paper alignment

Data availability section of the paper points to the Tier-0 artifacts and to the reviewer-NDA path for Tier-1.
Reproducibility checklist (some venues require it; if so, this doc is the checklist).
Methods cite the model_version / roi_version / dataset_id triple — makes the experimental setup unambiguous.
A footnote: “Code at https://github.com/sophiamann/lbzfai-cv, tagged paper-v1.”
An acknowledgement: “Data made available under the terms of an institutional research agreement with Louis Barton Zona Franca SA.”