Inspiration

In 2023 a string of near-miss runway incursions put a number most travelers never think about on the front page: the United States is thousands of air- traffic controllers short, and has been for years. The FAA, the GAO (GAO-26-107320), and the National Academies keep issuing the same warning — yet "hire more controllers" never seems to translate into urgency. We wanted to understand why doing nothing feels free when it clearly isn't. That became TowerGuard: a tool that puts a dollar figure and a safety number on every year we wait.

What it does

TowerGuard is a "Cost of Doing Nothing" simulator for the controller staffing crisis. It projects the Certified Professional Controller (CPC) workforce across five policy scenarios through FY2036 and answers the three questions a policymaker actually asks:

  • How bad does it get? On the do-nothing path, CPCs collapse from ~11,000 to ~2,412 (−78%) and staffing stays below the 85% safety floor for the whole horizon.
  • What does waiting cost? On the order of \$365B in controller- attributable delay and overtime versus the current plan — and it is front- loaded and largely irreversible, because the certification pipeline takes 2–3 years.
  • What about safety? Relative fatigue-error risk climbs to ~3.6× the rested baseline — the cost money can't buy back.

A second, live pipeline then validates the model against real traffic.

How we built it

TowerGuard is two decoupled halves that close a loop: a simulator that projects the future, and a live system that checks whether reality is unfolding the way the model assumed.

The core is a discrete-time system-dynamics stock-flow model. The crux is the certification lag: hiring grows developmentals $D$, but CPCs only grow after a delay $\tau$:

$$\mathrm{CPC}{t+1} = \mathrm{CPC}_t + p\,D{t-\tau} - a_t\,\mathrm{CPC}_t$$

where $\tau \approx 2\text{–}3$ years, $p$ is the OJT pass rate, and $a_t$ is attrition. The do-nothing collapse is driven by a reinforcing burnout loop — understaffing forces overtime, overtime raises attrition, attrition deepens the gap:

$$a_t = a_0\left(1 + \beta\,\max!\bigl(0,\;1 - s_t\bigr)\right), \qquad s_t = \frac{\mathrm{CPC}_t}{\mathrm{CPC}^{\text{target}}}$$

A Monte Carlo wrapper samples uncertain parameters to produce $P_{10}\text{–}P_{90}$ confidence bands instead of false-precision points. The cost of delay compares starting a plan in year $y$ against starting in 2026:

$$\Delta C(y) = \sum_{t}\left(C_t^{\,\text{start }y} - C_t^{\,\text{start }2026}\right)$$

The live half streams ADS-B traffic into three deterministic risk modules (traffic density, conflict geometry, workload index) over Redis. When risk escalates, Claude (claude-opus-4-8) phrases the advisory and the shift briefing — but never decides escalation (we call this Option B). The deterministic engine owns every decision; the LLM only turns structured evidence into readable text, with a template fallback. A human controller confirms every alert.

Stack: Python (pure-stdlib SD model), FastAPI + SSE, Redis, the Anthropic SDK, OpenSky, pypdf; a React / Vite / Tailwind / shadcn dashboard built with Lovable; 304 tests guard two frozen contracts that keep the halves decoupled.

Validation — how we know it isn't just a pretty chart

  • Out-of-sample backtest (FY2020–2025): the model under-predicts CPC with a MAPE of 7.91%, breaching our own 5% threshold. We show this honestly — the drift monitor is catching the COVID structural break, and a model that flags its own failure mode is more credible than a fake-perfect one.
  • Extreme-condition tests: a one-year hiring flood does not raise next- year CPC — the model respects the certification lag.
  • Reproduction with honesty tiers: every check is labeled IN-SAMPLE vs. INDEPENDENT, so a tight fit is never dressed up as predictive skill.

Responsible AI

  • The AI phrases; it never decides. Escalation is deterministic and auditable; a human confirms every advisory.
  • Fail-safe: missing data shows DEGRADED, never a fake "LOW".
  • Honest about its limits: confidence bands, an assumption ledger with per-parameter confidence, explicit "model-void" conditions, and a drift monitor that self-flags when the model is stale.

Challenges we ran into

  • Keeping the two halves decoupled. We froze a JSON contract and a Redis contract so each half could be built independently — guarded by 304 tests.
  • Calibrating with thin public data. Controller counts mix CPCs, developmentals, and trainees under inconsistent definitions; we solved hiring to historical endpoints and pinned the CPC split with OJT washout. The burnout-loop coefficients remain illustrative, so we report collapse depth and cost as order-of-magnitude, not point predictions.
  • Wiring a live LLM without letting it "drive." Choosing augmentation over automation took several iterations plus a full template-fallback path.
  • Streaming a local engine to a cloud dashboard. Free tunnels buffer SSE, so the always-on dashboard replays a real captured session (real Claude output, re-sequenced for timing) while the live engine runs in our demo video as proof the agent is real.

What we learned

  • The crisis isn't "knowing what to do" — it's that the fix takes years to land, so delay compounds before it can be reversed. That reframing is the whole project.
  • Honest uncertainty (bands, a self-breaching backtest, confidence tiers) persuades decision-makers more than confident point estimates.
  • The right division of labor for safety-critical AI: let deterministic logic decide, let the language model communicate.

What's next for TowerGuard

  • Calibrate the burnout-loop coefficients against facility-level attrition data.
  • Run live OpenSky end-to-end (currently demo / replay).
  • Per-facility scenarios so a region can model its own staffing path.

Built With

Share this project:

Updates