LineWise — OEE Decision Support for Damm Canning Lines

HackDAMM 2026 entry. A decision-support layer for production planning on Damm's lines 14, 17 and 19 at El Prat. LineWise enriches Damm's existing theoretical planning (Blue Yonder) with evidence from what actually happened: it enforces hard physical constraints, flags historically-toxic SKU transitions, predicts weekly OEE, and gives planners a risk-banded view of any urgent-demand insertion.

TL;DR — the value, in three claims that survive scrutiny

  1. Three parallel data-rules catch 33% of urgent-demand insertion options as flagged (332 / 995 feasible options across 108 held-out test scenarios), on top of the format-compatibility safety net. The flags are split across (a) historically bottom-decile OEE pairs, (b) pairs whose actual changeover is ≥1.5× line median, and (c) cross-format insertions on multi-format lines. In 26% of scenarios there is no clean option across any line — the system tells the planner "pick the least-bad option" with reasons. (reports/urgent_demand_backtest_summary.txt)
  2. The weekly OEE forecaster predicts next-week 4-wk trailing OEE per line at R² = 0.82 — a calibrated capacity signal at the level the data supports. (models/framing_comparison.png)
  3. Annualised soft-rule value range: €49k–€99k per year across 3 lines (planner-time only; the HARD format check is excluded from this number because Blue Yonder is assumed to already enforce it — see "Why not just Blue Yonder?" below).

The daily run-level OEE model (R² ≈ 0.40, MAE ≈ 0.10) is deliberately not the headline. We tested it two independent ways on held-out data:

  • Predicted-lift backtest: 95% CI on within-day reordering lift is [-0.0016, +0.0023] — statistically rules out any meaningful daily reordering effect.
  • Realised-OEE agreement backtest: no significant correlation between optimizer agreement and realised OEE (p = 0.17).

Both findings are documented honestly in POST_MORTEM.md and the reports/ folder. Within-day sequence reordering is not where the project's value lives, and we don't claim it does.

What's in the box

Capability Where it lives Validation
Hard line-format constraint layer (L14: 1/3+1/2 · L17: 1/3 only · L19: 1/3+1/2+2/5) src/optimizer.py (LINE_FORMATS, line_can_run) 100% catch — 15.3% of test-period urgent-demand options blocked
Worst-decile transition avoidance (data filter: 16 known-toxic pairs with mean OEE ≤ 0.38) src/optimizer.py (worst_decile_transitions) Surfaced in the urgent-demand tab as flagged options
Urgent-demand triage (rank by safety → then OEE band → then changeover) src/simulator.py (inject_urgent_demand) reports/urgent_demand_backtest_summary.txt
Weekly OEE forecaster (next-week 4-wk trailing mean per line) src/weekly_forecast.py R² = 0.82 on held-out weeks
Line-relative risk bands (per-line μ/σ, not absolute 0.70/0.80) src/simulator.py (_classify_risk) Makes the risk traffic-light meaningful given mean OEE of 0.40–0.53
Daily-run OEE predictor (CatBoost + XGBoost + Combined ensemble) src/catboost_model.py, src/xgboost_model.py, src/predict.py Test MAE ≈ 0.10, R² ≈ 0.40 — feeds the simulator as a tiebreaker
Quantile prediction intervals (q10 / q50 / q90) src/xgboost_model.py (train_quantile) ~71% empirical coverage
What-if simulator + scenario comparison src/simulator.py UI: Sequence Planner + Scenario Comparison tabs
Three-framing R² ceiling diagnostic (AS-IS / LEAKAGE / WEEKLY) scripts/evaluate_model.py + src/framings.py Headline plot models/framing_comparison.png
Data post-mortem (12-panel EDA) scripts/eda_report.py reports/eda_report.png, reports/eda_findings.txt
Held-out backtest #1: within-day sequence reordering scripts/backtest_recommender.py Statistical CI framing in reports/backtest_summary.txt
Held-out backtest #2: optimizer-agreement vs realised OEE scripts/backtest_similarity.py reports/backtest_similarity_summary.txt
Held-out backtest #3: urgent-demand triage counterfactual scripts/backtest_urgent_demand.py reports/urgent_demand_backtest_summary.txt
FastAPI bridge (5 data endpoints + Gemini chatbot stream) api/server.py /lines, /skus, /history, /simulate, /urgent, /chat
React frontend (4 routes + floating Gemini assistant) web/ TanStack Start + shadcn/ui + typed API client in src/lib/api/
Gemini-backed assistant (system prompt grounded in this repo) src/chatbot.py Used by POST /chat; UI widget in web/src/components/linewise/Chatbot.tsx

Repository layout

HackDAMM2026/
├── README.md                          ← this file
├── POST_MORTEM.md                     ← ceiling story + documented negative results
├── requirements.txt
├── LICENSE
│
├── data/
│   ├── raw/                           ← original Damm CSV/Excel exports
│   ├── parsed/                        ← normalised per-source DataFrames
│   └── processed/                     ← runs_df.csv, changeover_matrix.csv, product_meta.csv
│
├── src/                               ← all library code (importable as `src.X`)
│   ├── pipeline.py                    ← raw → parsed → processed (runs_df builder)
│   ├── features.py                    ← 37 leakage-free features + LabelEncoders
│   ├── framings.py                    ← AS-IS / LEAKAGE / WEEKLY problem framings
│   ├── xgboost_model.py               ← XGB train/tune/predict + quantile models
│   ├── catboost_model.py              ← CatBoost train/tune/predict
│   ├── predict.py                     ← Combined model + predict_oee_any dispatcher
│   ├── weekly_forecast.py             ← weekly panel + forecaster (R² 0.82)
│   ├── optimizer.py                   ← LINE_FORMATS + worst_pair + OR-Tools TSP
│   └── simulator.py                   ← what-if + urgent-demand + worst-pair flagging
│
├── scripts/                           ← entry-point scripts (runnable from anywhere)
│   ├── evaluate_model.py              ← train all 3 framings × 3 model variants
│   ├── backtest_recommender.py        ← within-day reorder predicted-lift backtest
│   ├── backtest_similarity.py         ← optimizer-agreement vs realised OEE
│   ├── backtest_urgent_demand.py      ← rule-layer counterfactual on urgent demand
│   └── eda_report.py                  ← 12-panel data post-mortem
│
├── api/
│   ├── __init__.py
│   └── server.py                      ← FastAPI bridge (5 data endpoints + /chat SSE)
│
├── web/           ← React (TanStack Start) frontend
│   ├── src/lib/api/                   ← typed fetch clients + reference store
│   ├── src/components/linewise/       ← LineWise UI components (incl. Chatbot)
│   └── src/routes/                    ← /, /forensic, /optimizado, /urgente
│
├── src/
│   ├── ...                            ← (same library code as above)
│   └── chatbot.py                     ← Gemini wrapper used by /chat
│
├── models/                            ← trained artefacts (.pkl, .json, .png)
├── reports/                           ← backtest + EDA outputs
└── notebooks/                         ← exploratory notebooks

Setup (one-time)

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Running the project

All commands assume you are at the repo root with .venv activated.

One-shot demo prep (recommended before showing a judge)

# Trains models + runs all three backtests. ~3-5 min total.
make demo
# or, if make isn't available:
python scripts/evaluate_model.py --no-shap
python scripts/backtest_recommender.py
python scripts/backtest_similarity.py
python scripts/backtest_urgent_demand.py

Then open the app in two terminals:

# Terminal 1 — FastAPI (models + Gemini chatbot bridge) on :8000
make api

# Terminal 2 — React frontend on :8080
make ui

Individual commands

# Train + diagnose all 3 framings × 3 model variants (~2-3 min)
python scripts/evaluate_model.py
python scripts/evaluate_model.py --no-shap         # skip SHAP for speed
python scripts/evaluate_model.py --tune --trials 60  # re-tune via Optuna (~15 min)

# The three held-out backtests
python scripts/backtest_recommender.py     # within-day reorder (predicted lift)
python scripts/backtest_similarity.py      # optimizer agreement vs realised OEE
python scripts/backtest_urgent_demand.py   # rule-layer counterfactual (headline)

# 12-panel data post-mortem
python scripts/eda_report.py

React frontend + FastAPI bridge

The React app (web/) is the only UI now — it talks to the Python models through api/server.py. The API loads the Combined (XGBoost + CatBoost) ensemble by default; fall back is CatBoost, then XGBoost.

URL What it shows
/ Home — today's Gantt across L14/L17/L19 + day highlights
/forensic Historical OEE per line + worst-decile transition heatmap
/optimizado Plan A (alternated baseline) vs Plan B (brand-grouped) with real /simulate predictions
/urgente Insert urgent SKUs, see per-rule evidence flags from /urgent
floating bot button Ask LineWise — Gemini-powered assistant streaming from /chat

The frontend resolves the API URL from VITE_API_URL — copy web/.env.example to .env.local if you change the port.

Enable the Gemini chatbot

echo "GOOGLE_API_KEY=your-gemini-api-key" >> .env
make api    # restart so the API picks up the env var

Then click the bot button in the bottom-right of the React app. The dot is green when the assistant is reachable and amber when offline (e.g. key missing) — the offline reason is shown inline.

Docker (one command, no local Python or Node needed)

The whole stack — Python ML backend + React frontend — runs in two containers wired together by docker-compose.yml.

# Optional: enable the chatbot
echo "GOOGLE_API_KEY=your-gemini-key" > .env

# Build + launch both services in the background
make docker-up           # or:  docker compose up --build -d

# Open the app
open http://localhost:8080
Service Container Port What it runs
api linewise-api 8000 Python 3.11 + uvicorn + the Combined ML ensemble + the Gemini bridge
web linewise-web 8080 Node 22 + Vite dev server serving the React UI

The first build downloads ~1.2 GB of Python ML wheels (catboost / xgboost / shap / ortools) and ~250 MB of npm modules; subsequent builds reuse the layer cache and complete in seconds.

make docker-logs         # tail combined logs
make docker-down         # stop and remove containers
make docker-rebuild      # force a fresh build (no cache)

Customise ports / API URL by exporting env vars before bringing the stack up:

API_PORT=9000 WEB_PORT=3000 VITE_API_URL=http://localhost:9000 \
  docker compose up --build

The four routes of the React app

  1. / — Home — line KPIs and today's Gantt across L14/L17/L19 with brand colours and risk bands.
  2. /forensic — Cockpit Forensic — historical OEE per line, inefficient-transition heatmap, drill-down per shift.
  3. /optimizado — Plan Optimizado — two candidate weekly sequences run through the real /simulate endpoint, with KPI deltas, drag-to-reorder, and accept-into-store.
  4. /urgente — Demanda Urgente — drop urgent SKUs into a queue, replan against the live plan via /urgent. Format-incompatible lines filtered; per-rule evidence flags (low-OEE, friction, cross-format) visible in the breakdown.

A floating bot button is available on every route: opens the Asistente LineWise, a Gemini-streamed chat scoped to the LineWise tool. The assistant accepts a line context (14/17/19) so answers are line-aware. It is intentionally narrow — its system prompt refuses off-topic questions.

The legacy Streamlit app (app/app.py) has been removed in favour of the React frontend + FastAPI bridge.

Chatbot system prompt

The Gemini assistant is grounded by the system prompt in src/chatbot.py. The model is gemini-2.5-flash. Both env-var names work for the key: GOOGLE_API_KEY or GEMINI_API_KEY. The API auto-loads .env at the repo root, so a single line in .env is enough — no extra dependency.

If the key is missing, the chat button stays visible with an amber dot and the side panel shows a clear offline reason — the rest of the app keeps working.

Why not just Blue Yonder?

Blue Yonder is Damm's existing theoretical planner. It almost certainly already enforces format compatibility (the HARD rule in our urgent-demand triage). The HARD rule in LineWise is belt-and-suspenders — its purpose is to be a safety net, not a new feature.

LineWise's new value is the empirical layer that Blue Yonder cannot infer because it plans from theoretical changeover times, not from how those changeovers actually behaved on the shop floor:

Rule What Blue Yonder sees What LineWise adds
Soft 1 — low-OEE pair This transition takes 20 min of theoretical changeover, OEE assumed nominal. This specific (line, from→to) pair runs at mean OEE 0.34 across 3 historical observations — the theoretical OEE never materialises.
Soft 2 — high-friction pair Theoretical changeover = 20 min. Median actual changeover on this pair has been 50 min — 2.5× the theoretical figure.
Soft 3 — cross-format pair "L19 can run all three formats, no flag." "But the 1/3 → 1/2 transition on L19 historically loses 1.8 OEE points to setup overhead — flag for review."

On the held-out test period, the soft-rule layer fires 332 times across 108 urgent-demand scenarios (33% of feasible options), at a flag rate of ~3 per scenario. In 26% of scenarios there's no clean option across all three lines, meaning the planner is told "every option here trips at least one historical-evidence flag — pick the least-bad."

That information is impossible to derive from theoretical changeover matrices alone. That's the value layer.

Key design choices (and why they matter to a judge)

  1. Line-format constraints are hard physical rules, not data-derived. L17 cannot run a 50 cl SKU even if a stray row exists in the history. Encoded in LINE_FORMATS (src/optimizer.py). Enforced in both recommend_line and inject_urgent_demand.
  2. Real product metadata. 183 SKUs in data/processed/product_meta.csv with brand/format/color — features like same_brand, same_format, color_change carry real domain signal, not synthetic stubs.
  3. No leakage in features. Every rolling/serial feature uses shift(1) on a chronological sort within (line, SKU) groups. The LEAKAGE framing in evaluate_model.py deliberately shows what R² would look like if we did leak (~0.99 with availability × performance ≈ OEE), as a ceiling diagnostic.
  4. Three framings, not one. Daily OEE has an irreducible ~0.40 R² ceiling from within-(line, SKU) noise. We make this explicit rather than overclaiming.
  5. Risk bands are line-relative, not absolute. With line means at 0.40 / 0.53 / 0.47, a threshold like "<0.70 = high risk" flags every single prediction. Bands are anchored to per-line μ/σ instead. See LINE_BASELINES in src/simulator.py.
  6. Worst-decile transitions are a data rule, not a model output. 16 historically-bad SKU pairs (mean OEE ≤ 0.38 with n ≥ 2 observations) are pre-computed from the training period and flagged at urgent-demand time. Defensible regardless of model accuracy.
  7. Optimiser uses real costs. build_cost_matrix mixes 0.7 × historical-actual-changeover + 0.3 × (1 − historical-OEE) per transition. Theoretical times are only the fallback, padded by 20%.

Headline numbers (last full run)

Metric Value Source
HARD-rule format blocks (test period) 187 / 1,182 (15.8%) reports/urgent_demand_backtest_summary.txt
SOFT rule 1 — low-OEE-pair firings 2 / 995 feasible (0.2%) ↑ same
SOFT rule 2 — high-friction firings 10 / 995 feasible (1.0%) ↑ same
SOFT rule 3 — cross-format firings 321 / 995 feasible (32.3%) ↑ same
SOFT rule (any) — total firings 332 (33.4% of feasible) ↑ same

Honest read of the rule distribution: Rule 3 (cross-format) carries 97% of the soft-flag firings; rule 2 (high-friction) 3%; rule 1 (low-OEE) 0.6%. The three rules are complementary, not equal-weight. Rule 3 is the dominant signal because cross-format setups are common on multi-format lines and are structurally costly. Rules 1 and 2 catch edge cases the format predicate misses — same-format pairs that nevertheless run at bottom-decile OEE, and same-format pairs with documented multi-x changeover overruns. Without them, these would slip past the predicate.

| Scenarios with NO clean option | 28 / 108 (25.9%) | ↑ same | | Annualised soft-rule value (3 lines, BY-excluded) | €49k – €99k / year | ↑ same | | Weekly OEE forecaster, test R² | 0.823 | models/framing_comparison.png | | Daily OEE model — Combined ensemble test MAE / R² | 0.099 / 0.398 | models/framing_metrics.json | | Daily within-day reorder lift — 95% CI | [-0.0016, +0.0023] | reports/backtest_summary.txt | | Optimizer-agreement ↔ realised-OEE Spearman ρ | -0.17 (p = 0.09, n.s.) | reports/backtest_similarity_summary.txt |

Demo scenario (90-second walkthrough)

Setup. In two terminals: make api and make ui. Open http://localhost:8080.

Step 1 — /forensic. Show the per-line OEE history and the worst-decile transition heatmap. "This is what Damm has today — descriptive."

Step 2 — /urgente. Add a 1/2 SKU urgent demand and click Replanificar con Damm.

  • The system shows L14 and L19 as feasible; L17 is filtered out because it can't run 1/2.
  • The Damm replan panel shows per-rule evidence flags from the backend (low-OEE / friction / cross-format).
  • The recommendation explains why in plain language.

Step 3 — /optimizado. Optimize the week and walk the two Gantts side by side. KPI deltas are real /simulate predictions, not pre-baked numbers. Drag to reorder and see the comparative KPIs update.

Step 4 — Ask LineWise (bot button, bottom-right). Open the assistant. Pick Tren 17. Ask: "Why isn't Line 17 showing up for my 50 cl SKU?" — the answer streams from Gemini, grounded in the system prompt that knows the hard format rules.

Known limitations + next steps

See POST_MORTEM.md for the full ceiling story. Headline limitations:

  • ~60% of OEE variance lives WITHIN (line, SKU) cells, driven by operator skill, material lot quality, and micro-stops — none of which are in the dataset. Daily R² > ~0.42 is unreachable without those.
  • No € ROI claim for daily reordering — the lift is too small relative to the model's MAE (0.10) to convert to currency meaningfully without Damm's internal economic data (hl/OEE-point sensitivity × €/hl margin).
  • Quantile coverage 71% vs 80% target — could be lifted with conformal recalibration.
  • No external data signal yet (weather, holidays). The brief explicitly says external data is encouraged but not required; we focused on getting maximum signal from the operational history first.

Next steps if Damm wants to deploy:

  1. Integrate Damm's actual planned_changeover_min field (we currently substitute actual) — removes the documented train/inference distribution shift.
  2. Integrate operator and material-lot IDs — lifts the daily R² ceiling.
  3. Calibrate quantile intervals via conformal prediction — reaches 80% coverage.
  4. Wire the urgent-demand tool into Blue Yonder via a REST hook — present blocked / flagged options as advisory annotations on theoretical schedules.

🍺 Damm × Engineering HUB Hackathon 2026

Built With

Share this project:

Updates