Team 46 - Why Is My Flight SO EXPENSIVE?

Inspiration

Airfare is supposed to be simple: fly farther, pay more. At least, that’s the theory. In reality, students and families know the experience can feel more like a game show — where two trips of nearly identical distance somehow come with wildly different price tags (・_・;).

That curiosity led us to ask a bigger question: is airfare really about geography, or is it secretly a story about power? We set out to explore whether structural forces — like airline dominance, low-cost competition, and hub control — quietly determine who gets a deal and who gets a dent in their wallet (￣▽￣).

Our goal wasn’t just to predict ticket prices, but to uncover whether affordability is shaped less by miles in the sky and more by market dynamics on the ground — and to measure just how much concentration influences what travelers ultimately pay (✧ω✧).

What it does

Our project develops an end-to-end airfare affordability simulator that integrates predictive modeling with statistical analysis to examine the structural drivers of pricing.

The first component is a machine-learning framework that models airfare as a function of route distance, demand intensity, market share concentration, hub exposure, and temporal factors. The model is designed not only to generate accurate fare predictions but also to isolate the marginal effects of key structural variables. In particular, it quantifies how dominant carrier market share influences pricing power, measures the fare-reducing impact of low-cost carrier presence, and estimates hub-related pricing premiums. These outputs are operationalized in an interactive “Flight Deck” web application, where users can input route characteristics and simulate counterfactual scenarios — such as increased competition or reduced hub dominance — to observe predicted fare changes and potential savings.

The second component consists of complementary statistical analysis conducted in both R and Python. This portion is used to validate model assumptions, interpret relationships within the data, and ensure robustness through econometric techniques and exploratory analysis. Beyond supporting the predictive results, the statistical framework is intended to inform future policy and market insights by providing interpretable evidence on how structural factors shape affordability. Together, these two components translate empirical analysis into a practical decision-support tool, bridging predictive accuracy with interpretability to better understand and potentially improve airfare affordability.

How we built it

Data Cleaning & Feature Engineering

We compiled route-level airfare data and standardized key variables, including fare, distance (nsmiles), passenger volume, dominant carrier market share (large_ms), and low-cost carrier share (lf_ms). To address skewness and stabilize variance, we modeled log(fare), log(distance), and log(passengers). We further engineered structural indicators such as hub_intensity to capture network concentration effects.

Exploratory Analysis

Descriptive analysis confirmed a strongly right-skewed fare distribution. Visualization showed that while distance establishes a baseline cost, substantial price dispersion exists across routes of similar length, indicating the importance of competitive structure. Preliminary regressions demonstrated that incorporating market concentration variables significantly improves explanatory power beyond distance alone.

Econometric & Machine Learning Modeling

We implemented a tiered modeling strategy:

Baseline model: OLS regression (log_fare ~ log_distance) to establish cost fundamentals

Structural model: Extended regression including demand and competition variables to estimate interpretable marginal effects

Predictive model: Gradient Boosting to capture nonlinearities and interaction effects

Model interpretation tools — including permutation importance and partial dependence analysis — show that distance sets the baseline, while dominant carrier share raises fares, low-cost carrier penetration reduces fares, and hub-to-hub exposure introduces systematic premiums. These relationships remain robust after controlling for demand and route characteristics.

Statistical Validation & Analytical Framework

Parallel analyses in R and Python were conducted to validate assumptions, test robustness, and interpret structural relationships. This included diagnostic testing, sensitivity checks, and comparative model evaluation to ensure consistency between econometric inference and machine-learning predictions. This dual-framework approach strengthens both the credibility of findings and their relevance for future policy or market interventions.

Deployment

The final predictive model was deployed via a FastAPI backend with an interactive Streamlit frontend. The interface enables users to input route characteristics and simulate counterfactual scenarios — such as increased competition or reduced hub exposure — translating analytical results into real-time affordability insights.

Challenges we ran into

We will be honest. At the beginning, machine learning felt like this: (╯°□°）╯︵ ┻━┻

Most of us had little to no experience with predictive modeling, so building an end-to-end pipeline — data cleaning → feature engineering → regression → gradient boosting → deployment — meant a lot of rapid learning and even more debugging.

There were moments where our model outputs made no sense and we stared at the screen like: ( ••) … ( ••)>⌐■-■ … (⌐■_■) “Why is this happening?”

We also had to learn Tableau Public from scratch, figure out different variables, handle skewed distributions, and understand why it wasn't working.

Then came deployment. Virtual environments. Ports in use. API paths not found. Nested Model folders. We saw errors like: (ಥ﹏ಥ) (╥﹏╥) “Address already in use”

But each crash was just another learning checkpoint.

Accomplishments that we're proud of

Despite starting as coders, we built a fully functional, deployable ML system. (ง •̀_•́)ง

We: • Engineered meaningful structural features (hub intensity, dominance measures, LCC share) • Built both regression models and nonlinear gradient boosting models • Interpreted permutation importance and partial dependence plots • Built an interactive Streamlit “Flight Deck” simulator • Built meaningful insights from our statistical analysis to uncover some interesting analysis

We didn’t just build a model that predicts fares — we built a tool that demonstrates how competition structure affects affordability.

And we did it while learning machine learning in real time.

From: “I don’t know what gradient boosting is” to: “We can explain partial dependence plots” (ﾉ◕ヮ◕)ﾉ*:･ﾟ✧

That growth is something we’re genuinely proud of.

What we learned

This project demonstrated that machine learning functions best as a tool for structured inquiry rather than purely prediction. While distance establishes the baseline cost of airfare, incorporating market structure variables revealed deeper dynamics: dominant carrier share increases pricing power, low-cost carrier presence reduces fares, and hub-to-hub routes carry systematic premiums.

From a methodological perspective, we found that log transformations improve model stability, data preparation constitutes the majority of analytical effort, and interpreting model outputs is more informative than relying solely on fit metrics. Deployment further emphasized the importance of robustness in real-world applications.

What's next for Why Is My Flight SO EXPENSIVE?

(Our affordable team trip to Japan (ﾉ◕ヮ◕)ﾉ*:･ﾟ✧)

Right now, our model shows how competition structure shapes pricing.

Next, we want to:

✈ Add booking-timing variables (advance purchase windows, seasonality). ✈ Incorporate real-time fare APIs. ✈ Add route-level policy simulation (e.g., “What if a new LCC entered this market?”). ✈ Visualize structural disadvantage maps for students and families.

Eventually, this could become: • A student affordability tool • A consumer transparency dashboard • A policy evaluation simulator

Because if airfare affordability is structural, then it shouldn’t be invisible.