What it does

Provides data tested on multiple models to help understand the decisions of renters post covid

How it was designed

Mostly a waterfall approach method to develop the complete result. Followed 5 steps: Data Understanding and EDA, Data Preprocessing and Feature Engineering, Model Development and Training, Model Interpretation and Insights, Final predictions

Methodology Summary

BroadVail Finance Track - 2026 Rice Datathon

1. Data Understanding

Dataset Overview

  • Dataset: Property-level panel data with ~1,000 unique properties across multiple time periods and trade area definitions
  • Total Observations: Training ~7,000 rows, Scoring ~9,000 rows
  • Target Variable: RevPAR Growth (percentage change in Revenue per Available Unit)
  • Time Periods:
    • Pre-COVID: 2015-2020 (growth rate)
    • Post-COVID: 2022-2025 (growth rate)
  • Trade Areas: 10, 15, and 30-minute drivetime isochrones

Data Structure

Each row represents a property-time_window-trade_area combination with:

  • Property characteristics (age, size, type, renovation status)
  • Location features (MSA ring, normalized distance to CBD)
  • Trade area amenities (grocery, food, services)
  • AARP Livability scores (7 domain scores + 50+ metrics)
  • Supply and competition data
  • Own-vs-rent economics

2. Feature Engineering

Engineered 146 features across these categories:

Category # Features Key Examples
Property 15 age, years_since_renov, total_sqft, size_category
Location 12 msa_ring_numeric, is_suburban, is_urban_core
Drivetime 6 drv10_flag, drv15_flag, drv30_flag, drivetime_area
Amenities 20 grocery_access, food_density, amenity_density
Food/Dining 18 upscale_dining_ratio, fast_food_ratio, dining_variety
AARP Scores 50+ All 7 domain scores + individual metrics
Supply 8 competition_intensity, high_supply_growth
Own-vs-Rent 7 rent_vs_mortgage_ratio, rent_is_cheaper
Time Interactions 12 post_x_prox, post_x_trans, post_x_broadband

Feature Engineering Approach

  1. Property Features: Age calculations, renovation indicators, size categorization
  2. Composite Scores: Created weighted AARP renter score, walkability proxy, remote work score
  3. Ratio Features: Normalized amenity counts (upscale_dining_ratio, chain_food_ratio)
  4. Accessibility Features: Inverse distance features (food_accessibility)
  5. Interaction Features: COVID period interactions to capture preference shifts

3. Modeling Approach

Models Evaluated

  1. Linear Models: Ridge, Lasso, ElasticNet
  2. Tree Ensembles: Random Forest, Gradient Boosting
  3. Specialized Models: Separate Pre/Post COVID models
  4. Combined Approach: Model ensemble

Cross-Validation Strategy

  • Method: 5-fold GroupKFold
  • Grouping: By property ID (UBID) to prevent data leakage
  • Rationale: Same property appears in multiple rows (different time windows, trade areas), so grouping prevents information leakage

Hyperparameter Tuning

Initial models used reasonable defaults. More extensive tuning could be done with:

  • Random search or Bayesian optimization (Optuna)
  • Nested cross-validation for unbiased performance estimates

Final Model Selection

Ensemble (Gradient Boosting + Random Forest)

  • Averages predictions from both models
  • Leverages diversity between models for more robust predictions
  • Both models showed similar R² (~0.63), indicating complementary strengths

4. Validation Performance

Model Validation RMSE Validation R²
Post-COVID Only 0.122 0.238
Random Forest 0.133 0.631
Gradient Boosting 0.134 0.628
Final Ensemble ~0.133 ~0.63
Linear Regression 0.149 0.537
Baseline (Mean) 0.219 -0.000

Notes on Model Selection

  • The "Post-COVID Model" shows lowest RMSE but also lowest R² because it was evaluated only on post-COVID data
  • Random Forest and Gradient Boosting show best generalization (highest R²)
  • Ensemble provides robust, conservative predictions

5. Key Findings

Top Drivers of RevPAR Outperformance

  1. post_x_prox: Properties in areas with high proximity scores saw greater benefit post-COVID
  2. MARKET_target_encoded: Market-level historical performance strongly predicts future performance
  3. is_post_covid: Significant systematic difference between pre and post-COVID growth patterns
  4. post_x_remote_score: Remote work friendliness became more valuable post-COVID
  5. ownrent_avg_mortgage/rent: Local housing economics affect property performance

Pre-COVID vs Post-COVID Shift

Features that became MORE important post-COVID:

  • Hospital accessibility (aarp_met_health_hospital)
  • Property type/class (type_main_encoded, type_sub_encoded)
  • Property size (numunits, log_numunits)

Features that became LESS important post-COVID:

  • Market-level effects (MARKET_target_encoded)
  • Housing burden metrics (aarp_met_house_burden)
  • State-level effects (state_encoded)

Interpretation

The shift suggests that post-COVID:

  1. Health infrastructure access became more valued by renters
  2. Property-specific characteristics matter more than market-level trends
  3. Location preferences shifted from market-driven to property-driven decisions

Drivetime Analysis

  • Best trade area: 10-minute drivetime (drv10)
  • This suggests renters primarily care about amenities within a 10-minute drive
  • Larger trade areas (15, 30 min) may include too much noise

Geographic Patterns

  • MSA Ring 4 (outer suburbs): Highest growth
  • MSA Ring 1 (downtown): Lowest growth
  • Consistent with suburban migration trend during/after COVID

6. Prediction Summary

Scoring Dataset Predictions

  • Total predictions: 8,997
  • Pre-COVID predictions: 4,485 (mean: 21.2%)
  • Post-COVID predictions: 4,512 (mean: -5.1%)
  • Overall range: [-24.7%, 77.4%]

Distribution Characteristics

  • Predictions are conservative (narrower range than training targets)
  • Mean prediction (8.0%) aligns well with training mean (8.2%)
  • Pre-COVID predictions significantly more optimistic than post-COVID

7. Limitations

  1. Limited hyperparameter tuning: Default/light tuning used due to time constraints
  2. No external validation: All performance metrics are cross-validation based
  3. Assumption of stationarity: Model assumes feature relationships are stable within time periods
  4. Trade area overlap: Properties appear in multiple trade area definitions, potentially inflating sample size artificially

8. Future Improvements

  1. Advanced models: XGBoost, LightGBM, CatBoost with extensive hyperparameter tuning
  2. Feature selection: Use SHAP or permutation importance to reduce feature set
  3. Stacked ensemble: Train meta-learner on base model predictions
  4. Time-series validation: More rigorous temporal validation strategy
  5. External data: Incorporate additional macro-economic indicators

What's next for faaris submisson BroadVail

Develop useful applications for commercial sectors!

Built With

Share this project:

Updates