Project: Decoding the Airfare Black Box
Team 5 Airline Analysis
1. Inspiration
Anyone who has booked a flight has felt the frustration of watching prices fluctuate unpredictably. A fare can change overnight, and two cities of similar distance can have wildly different travel costs. This project was born from a desire to bring transparency to this "black box." We were inspired to move beyond anecdotal evidence and build a data-driven model to answer fundamental questions: Can route-level fares be predicted systematically? And what is the quantifiable impact of market competition and airport characteristics on the price of a ticket?
2. What It Does
The Airline Analysis project provides a comprehensive look into the U.S. domestic airfare market from 2021 to 2025. By analyzing over 14,000 unique route observations, the project delivers:
- Predictive Modeling: A high-accuracy system that predicts average fares based on market structure.
- Market Analysis: Quantifying how "Hub" status and the presence of low-cost carriers (LCCs) influence ticket prices.
- Actionable Insights: Evidence-based guidance for travelers to find cheaper routes, for airlines to evaluate competitive positioning, and for policymakers to monitor market fairness.
3. How We Built It
We followed a methodical data science workflow using Python and several specialized libraries:
- Data Foundation: Utilized the U.S. Department of Transportation (DOT) Domestic Airline Consumer Airfare Report, cleaning and standardizing four years of data.
- Feature Engineering: Created custom metrics such as a Hub Score (based on passenger volume and route density) and LCC Penetration (measuring the market share of budget airlines).
- The Tech Stack: Used
PandasandNumPyfor data manipulation,MatplotlibandSeabornfor deep-dive EDA, andScikit-learnandXGBoostfor modeling. - Interpretability: Integrated
SHAP(SHapley Additive exPlanations) to move beyond "black box" predictions and quantify exactly how features like distance or competition push prices up or down.
4. Challenges We Ran Into
- Defining a 'Hub': Traditional hub definitions are often fluid. We overcame this by creating a data-driven Hub Score that segments cities into four tiers based on actual passenger traffic.
- Multicollinearity: Many variables, such as distance and specific fare types, are highly correlated. We addressed this by utilizing tree-based models like Random Forest and XGBoost, which are robust to these relationships.
- Isolating Effects: Determining if a fare was high due to distance or lack of competition was difficult. SHAP values were crucial in isolating the "LCC effect" from the "distance effect."
5. Accomplishments We're Proud Of
- High Predictive Power: Our XGBoost model achieved an $R^2$ of 0.906 using only structural market features, proving that airfare variation is highly systematic.
- Quantifying the "LCC Equalizer": We successfully measured that the presence of low-cost carriers reduces fares by an average of $17 to $34 per route.
- Systematic Transparency: Transformed a "black box" industry into a series of predictable levers.
6. What We Learned
- Market Structure is King: Distance is the primary baseline driver (accounting for 1/3 of predictive importance), but the competitive landscape determines the final premium.
- The Hub Paradox: Mid-tier hubs often have the highest fares because they lack the intense competition found at major international gateways.
- Data-Driven Policy: Market structure alone explains the vast majority of pricing; significant outliers can be identified as areas where competition may be failing.
7. What's Next
- Granular Variables: Integration of jet fuel prices and seasonal holiday calendars.
- Transparency Tooling: Developing a dashboard for travelers to check if a specific fare is "fair" based on market health.
- Global Expansion: Applying this methodology to international markets to compare competitive dynamics.
Built With
- jupyterhub
- machine-learning
- python
Log in or sign up for Devpost to join the conversation.