OJLK Team 15- Airfare Markets Under Pressure

Model & Prediction
Temporal Patterns
Market Structure
Competitions

Project Story

About the Project

Airfare headlines kept citing “mystery surcharges,” but most articles lacked route-level evidence. That gap pushed me to build a transparent analytics stack that traces each fare back to measurable structure—distance, competition, and hub power—instead of vague narratives. The Data Marathon brief further nudged me to turn a static dataset into a story-first dashboard that stakeholders can actually explore, filter, and validate.

What I Learned

Translating aviation economics into engineered features matters: deriving price-per-mile, demand bins, and hub premiums made the model’s behavior far more interpretable.
Building a clean MLOps loop (data → XGBoost training → FastAPI serving) enforced disciplined reproducibility: versioned CSVs, pinned dependencies, and deterministic seeds.
Story-first UX beats chart dumps: curating 16 endpoints into 5 dashboard sections helped surface the “so what” behind each metric instead of overwhelming users with raw plots.

How I Built It

Data pipeline — clean_data.py ingests ~14k rows, handles currency coercion, imputes missing city metrics, then exports both vis-ready and ML-ready tables. Market concentration is computed via the Herfindahl–Hirschman Index:

$$ HHI = \sum_i s_i^2 $$

where $ s_i $ is carrier $ i $’s market share (using a three-share proxy when full carrier breakdowns aren’t available).

Modeling layer — backend/app/models/train_cli.py runs a 100-trial XGBoost search (hist method, time-based split) and logs artifacts for deployment, including tuned hyperparameters and evaluation outputs.
Service + UX — FastAPI exposes 16 visualization routes plus /predict/fare, while a Vite + Tailwind React dashboard consumes them with live filters and an interactive fare predictor for “what-if” exploration.

Challenges

HHI estimation: Limited carrier breakdowns required a proxy that wouldn’t oscillate across routes. Clipping shares and smoothing medians stabilized the concentration signal without destroying ranking.
GPU/CPU parity: Training on GPU introduced slight numeric drift. Exporting the tuned parameters and re-running with fallback_to_cpu kept serving behavior consistent and easier to reproduce.

Built With

fast
javascripts
python
react
scikit-learn
tailwind

Submitted to

SDSS Datathon 2026

Created by

I was responsible for recording and editing the final presentation video, as well as integrating all project materials. I led the competition and affordability analysis, initially creating visualizations in R, and later helping transition to more simplified visuals to improve clarity in the final presentation.

Kerry Yang
I am the leader of the team. I am in charge of starting building the framework of backend and frontend. I did all the machine learning part and prediction part for visualization and interactions.

Oliverrr2424 Hou
I contributed by synthesizing the team’s insights, creating the presentation materials, and editing the final video to ensure a clear and cohesive narrative.

Lena Wang
I designed and implemented the four Section 2 (Market Structure & Pricing) charts: Average Fare by Route Type, by Competition Level, by Demand Level, and Top 20 Hub Cities. I built them in MarketSection.jsx with Recharts, useVizData for backend data, and ChartCard for layout.

JasonWang1025