Project: Decoding the Airfare Black Box

Team 5 Airline Analysis

1. Inspiration

Anyone who has booked a flight has felt the frustration of watching prices fluctuate unpredictably. A fare can change overnight, and two cities of similar distance can have wildly different travel costs. This project was born from a desire to bring transparency to this "black box." We were inspired to move beyond anecdotal evidence and build a data-driven model to answer fundamental questions: Can route-level fares be predicted systematically? And what is the quantifiable impact of market competition and airport characteristics on the price of a ticket?

2. What It Does

The Airline Analysis project provides a comprehensive look into the U.S. domestic airfare market from 2021 to 2025. By analyzing over 14,000 unique route observations, the project delivers:

  • Predictive Modeling: A high-accuracy system that predicts average fares based on market structure.
  • Market Analysis: Quantifying how "Hub" status and the presence of low-cost carriers (LCCs) influence ticket prices.
  • Actionable Insights: Evidence-based guidance for travelers to find cheaper routes, for airlines to evaluate competitive positioning, and for policymakers to monitor market fairness.

3. How We Built It

We followed a methodical data science workflow using Python and several specialized libraries:

  • Data Foundation: Utilized the U.S. Department of Transportation (DOT) Domestic Airline Consumer Airfare Report, cleaning and standardizing four years of data.
  • Feature Engineering: Created custom metrics such as a Hub Score (based on passenger volume and route density) and LCC Penetration (measuring the market share of budget airlines).
  • The Tech Stack: Used Pandas and NumPy for data manipulation, Matplotlib and Seaborn for deep-dive EDA, and Scikit-learn and XGBoost for modeling.
  • Interpretability: Integrated SHAP (SHapley Additive exPlanations) to move beyond "black box" predictions and quantify exactly how features like distance or competition push prices up or down.

4. Challenges We Ran Into

  • Defining a 'Hub': Traditional hub definitions are often fluid. We overcame this by creating a data-driven Hub Score that segments cities into four tiers based on actual passenger traffic.
  • Multicollinearity: Many variables, such as distance and specific fare types, are highly correlated. We addressed this by utilizing tree-based models like Random Forest and XGBoost, which are robust to these relationships.
  • Isolating Effects: Determining if a fare was high due to distance or lack of competition was difficult. SHAP values were crucial in isolating the "LCC effect" from the "distance effect."

5. Accomplishments We're Proud Of

  • High Predictive Power: Our XGBoost model achieved an $R^2$ of 0.906 using only structural market features, proving that airfare variation is highly systematic.
  • Quantifying the "LCC Equalizer": We successfully measured that the presence of low-cost carriers reduces fares by an average of $17 to $34 per route.
  • Systematic Transparency: Transformed a "black box" industry into a series of predictable levers.

6. What We Learned

  • Market Structure is King: Distance is the primary baseline driver (accounting for 1/3 of predictive importance), but the competitive landscape determines the final premium.
  • The Hub Paradox: Mid-tier hubs often have the highest fares because they lack the intense competition found at major international gateways.
  • Data-Driven Policy: Market structure alone explains the vast majority of pricing; significant outliers can be identified as areas where competition may be failing.

7. What's Next

  • Granular Variables: Integration of jet fuel prices and seasonal holiday calendars.
  • Transparency Tooling: Developing a dashboard for travelers to check if a specific fare is "fair" based on market health.
  • Global Expansion: Applying this methodology to international markets to compare competitive dynamics.

Built With

Share this project:

Updates