What it does
Provides data tested on multiple models to help understand the decisions of renters post covid
How it was designed
Mostly a waterfall approach method to develop the complete result. Followed 5 steps: Data Understanding and EDA, Data Preprocessing and Feature Engineering, Model Development and Training, Model Interpretation and Insights, Final predictions
Methodology Summary
BroadVail Finance Track - 2026 Rice Datathon
1. Data Understanding
Dataset Overview
- Dataset: Property-level panel data with ~1,000 unique properties across multiple time periods and trade area definitions
- Total Observations: Training ~7,000 rows, Scoring ~9,000 rows
- Target Variable: RevPAR Growth (percentage change in Revenue per Available Unit)
- Time Periods:
- Pre-COVID: 2015-2020 (growth rate)
- Post-COVID: 2022-2025 (growth rate)
- Trade Areas: 10, 15, and 30-minute drivetime isochrones
Data Structure
Each row represents a property-time_window-trade_area combination with:
- Property characteristics (age, size, type, renovation status)
- Location features (MSA ring, normalized distance to CBD)
- Trade area amenities (grocery, food, services)
- AARP Livability scores (7 domain scores + 50+ metrics)
- Supply and competition data
- Own-vs-rent economics
2. Feature Engineering
Engineered 146 features across these categories:
| Category | # Features | Key Examples |
|---|---|---|
| Property | 15 | age, years_since_renov, total_sqft, size_category |
| Location | 12 | msa_ring_numeric, is_suburban, is_urban_core |
| Drivetime | 6 | drv10_flag, drv15_flag, drv30_flag, drivetime_area |
| Amenities | 20 | grocery_access, food_density, amenity_density |
| Food/Dining | 18 | upscale_dining_ratio, fast_food_ratio, dining_variety |
| AARP Scores | 50+ | All 7 domain scores + individual metrics |
| Supply | 8 | competition_intensity, high_supply_growth |
| Own-vs-Rent | 7 | rent_vs_mortgage_ratio, rent_is_cheaper |
| Time Interactions | 12 | post_x_prox, post_x_trans, post_x_broadband |
Feature Engineering Approach
- Property Features: Age calculations, renovation indicators, size categorization
- Composite Scores: Created weighted AARP renter score, walkability proxy, remote work score
- Ratio Features: Normalized amenity counts (upscale_dining_ratio, chain_food_ratio)
- Accessibility Features: Inverse distance features (food_accessibility)
- Interaction Features: COVID period interactions to capture preference shifts
3. Modeling Approach
Models Evaluated
- Linear Models: Ridge, Lasso, ElasticNet
- Tree Ensembles: Random Forest, Gradient Boosting
- Specialized Models: Separate Pre/Post COVID models
- Combined Approach: Model ensemble
Cross-Validation Strategy
- Method: 5-fold GroupKFold
- Grouping: By property ID (UBID) to prevent data leakage
- Rationale: Same property appears in multiple rows (different time windows, trade areas), so grouping prevents information leakage
Hyperparameter Tuning
Initial models used reasonable defaults. More extensive tuning could be done with:
- Random search or Bayesian optimization (Optuna)
- Nested cross-validation for unbiased performance estimates
Final Model Selection
Ensemble (Gradient Boosting + Random Forest)
- Averages predictions from both models
- Leverages diversity between models for more robust predictions
- Both models showed similar R² (~0.63), indicating complementary strengths
4. Validation Performance
| Model | Validation RMSE | Validation R² |
|---|---|---|
| Post-COVID Only | 0.122 | 0.238 |
| Random Forest | 0.133 | 0.631 |
| Gradient Boosting | 0.134 | 0.628 |
| Final Ensemble | ~0.133 | ~0.63 |
| Linear Regression | 0.149 | 0.537 |
| Baseline (Mean) | 0.219 | -0.000 |
Notes on Model Selection
- The "Post-COVID Model" shows lowest RMSE but also lowest R² because it was evaluated only on post-COVID data
- Random Forest and Gradient Boosting show best generalization (highest R²)
- Ensemble provides robust, conservative predictions
5. Key Findings
Top Drivers of RevPAR Outperformance
- post_x_prox: Properties in areas with high proximity scores saw greater benefit post-COVID
- MARKET_target_encoded: Market-level historical performance strongly predicts future performance
- is_post_covid: Significant systematic difference between pre and post-COVID growth patterns
- post_x_remote_score: Remote work friendliness became more valuable post-COVID
- ownrent_avg_mortgage/rent: Local housing economics affect property performance
Pre-COVID vs Post-COVID Shift
Features that became MORE important post-COVID:
- Hospital accessibility (aarp_met_health_hospital)
- Property type/class (type_main_encoded, type_sub_encoded)
- Property size (numunits, log_numunits)
Features that became LESS important post-COVID:
- Market-level effects (MARKET_target_encoded)
- Housing burden metrics (aarp_met_house_burden)
- State-level effects (state_encoded)
Interpretation
The shift suggests that post-COVID:
- Health infrastructure access became more valued by renters
- Property-specific characteristics matter more than market-level trends
- Location preferences shifted from market-driven to property-driven decisions
Drivetime Analysis
- Best trade area: 10-minute drivetime (drv10)
- This suggests renters primarily care about amenities within a 10-minute drive
- Larger trade areas (15, 30 min) may include too much noise
Geographic Patterns
- MSA Ring 4 (outer suburbs): Highest growth
- MSA Ring 1 (downtown): Lowest growth
- Consistent with suburban migration trend during/after COVID
6. Prediction Summary
Scoring Dataset Predictions
- Total predictions: 8,997
- Pre-COVID predictions: 4,485 (mean: 21.2%)
- Post-COVID predictions: 4,512 (mean: -5.1%)
- Overall range: [-24.7%, 77.4%]
Distribution Characteristics
- Predictions are conservative (narrower range than training targets)
- Mean prediction (8.0%) aligns well with training mean (8.2%)
- Pre-COVID predictions significantly more optimistic than post-COVID
7. Limitations
- Limited hyperparameter tuning: Default/light tuning used due to time constraints
- No external validation: All performance metrics are cross-validation based
- Assumption of stationarity: Model assumes feature relationships are stable within time periods
- Trade area overlap: Properties appear in multiple trade area definitions, potentially inflating sample size artificially
8. Future Improvements
- Advanced models: XGBoost, LightGBM, CatBoost with extensive hyperparameter tuning
- Feature selection: Use SHAP or permutation importance to reduce feature set
- Stacked ensemble: Train meta-learner on base model predictions
- Time-series validation: More rigorous temporal validation strategy
- External data: Incorporate additional macro-economic indicators
What's next for faaris submisson BroadVail
Develop useful applications for commercial sectors!
Built With
- claude
- opus
- python
- vsc
Log in or sign up for Devpost to join the conversation.