What it does

Provides data tested on multiple models to help understand the decisions of renters post covid

How it was designed

Mostly a waterfall approach method to develop the complete result. Followed 5 steps: Data Understanding and EDA, Data Preprocessing and Feature Engineering, Model Development and Training, Model Interpretation and Insights, Final predictions

Methodology Summary

BroadVail Finance Track - 2026 Rice Datathon

1. Data Understanding

Dataset Overview

Dataset: Property-level panel data with ~1,000 unique properties across multiple time periods and trade area definitions
Total Observations: Training ~7,000 rows, Scoring ~9,000 rows
Target Variable: RevPAR Growth (percentage change in Revenue per Available Unit)
Time Periods:
- Pre-COVID: 2015-2020 (growth rate)
- Post-COVID: 2022-2025 (growth rate)
Trade Areas: 10, 15, and 30-minute drivetime isochrones

Data Structure

Each row represents a property-time_window-trade_area combination with:

Property characteristics (age, size, type, renovation status)
Location features (MSA ring, normalized distance to CBD)
Trade area amenities (grocery, food, services)
AARP Livability scores (7 domain scores + 50+ metrics)
Supply and competition data
Own-vs-rent economics

2. Feature Engineering

Engineered 146 features across these categories:

Category	# Features	Key Examples
Property	15	age, years_since_renov, total_sqft, size_category
Location	12	msa_ring_numeric, is_suburban, is_urban_core
Drivetime	6	drv10_flag, drv15_flag, drv30_flag, drivetime_area
Amenities	20	grocery_access, food_density, amenity_density
Food/Dining	18	upscale_dining_ratio, fast_food_ratio, dining_variety
AARP Scores	50+	All 7 domain scores + individual metrics
Supply	8	competition_intensity, high_supply_growth
Own-vs-Rent	7	rent_vs_mortgage_ratio, rent_is_cheaper
Time Interactions	12	post_x_prox, post_x_trans, post_x_broadband

Feature Engineering Approach

Property Features: Age calculations, renovation indicators, size categorization
Composite Scores: Created weighted AARP renter score, walkability proxy, remote work score
Ratio Features: Normalized amenity counts (upscale_dining_ratio, chain_food_ratio)
Accessibility Features: Inverse distance features (food_accessibility)
Interaction Features: COVID period interactions to capture preference shifts

3. Modeling Approach

Models Evaluated

Linear Models: Ridge, Lasso, ElasticNet
Tree Ensembles: Random Forest, Gradient Boosting
Specialized Models: Separate Pre/Post COVID models
Combined Approach: Model ensemble

Cross-Validation Strategy

Method: 5-fold GroupKFold
Grouping: By property ID (UBID) to prevent data leakage
Rationale: Same property appears in multiple rows (different time windows, trade areas), so grouping prevents information leakage

Hyperparameter Tuning

Initial models used reasonable defaults. More extensive tuning could be done with:

Random search or Bayesian optimization (Optuna)
Nested cross-validation for unbiased performance estimates

Final Model Selection

Ensemble (Gradient Boosting + Random Forest)

Averages predictions from both models
Leverages diversity between models for more robust predictions
Both models showed similar R² (~0.63), indicating complementary strengths

4. Validation Performance

Model	Validation RMSE	Validation R²
Post-COVID Only	0.122	0.238
Random Forest	0.133	0.631
Gradient Boosting	0.134	0.628
Final Ensemble	~0.133	~0.63
Linear Regression	0.149	0.537
Baseline (Mean)	0.219	-0.000

Notes on Model Selection

The "Post-COVID Model" shows lowest RMSE but also lowest R² because it was evaluated only on post-COVID data
Random Forest and Gradient Boosting show best generalization (highest R²)
Ensemble provides robust, conservative predictions

5. Key Findings

Top Drivers of RevPAR Outperformance

post_x_prox: Properties in areas with high proximity scores saw greater benefit post-COVID
MARKET_target_encoded: Market-level historical performance strongly predicts future performance
is_post_covid: Significant systematic difference between pre and post-COVID growth patterns
post_x_remote_score: Remote work friendliness became more valuable post-COVID
ownrent_avg_mortgage/rent: Local housing economics affect property performance

Pre-COVID vs Post-COVID Shift

Features that became MORE important post-COVID:

Hospital accessibility (aarp_met_health_hospital)
Property type/class (type_main_encoded, type_sub_encoded)
Property size (numunits, log_numunits)

Features that became LESS important post-COVID:

Market-level effects (MARKET_target_encoded)
Housing burden metrics (aarp_met_house_burden)
State-level effects (state_encoded)

Interpretation

The shift suggests that post-COVID:

Health infrastructure access became more valued by renters
Property-specific characteristics matter more than market-level trends
Location preferences shifted from market-driven to property-driven decisions

Drivetime Analysis

Best trade area: 10-minute drivetime (drv10)
This suggests renters primarily care about amenities within a 10-minute drive
Larger trade areas (15, 30 min) may include too much noise

Geographic Patterns

MSA Ring 4 (outer suburbs): Highest growth
MSA Ring 1 (downtown): Lowest growth
Consistent with suburban migration trend during/after COVID

6. Prediction Summary

Scoring Dataset Predictions

Total predictions: 8,997
Pre-COVID predictions: 4,485 (mean: 21.2%)
Post-COVID predictions: 4,512 (mean: -5.1%)
Overall range: [-24.7%, 77.4%]

Distribution Characteristics

Predictions are conservative (narrower range than training targets)
Mean prediction (8.0%) aligns well with training mean (8.2%)
Pre-COVID predictions significantly more optimistic than post-COVID

7. Limitations

Limited hyperparameter tuning: Default/light tuning used due to time constraints
No external validation: All performance metrics are cross-validation based
Assumption of stationarity: Model assumes feature relationships are stable within time periods
Trade area overlap: Properties appear in multiple trade area definitions, potentially inflating sample size artificially

8. Future Improvements

Advanced models: XGBoost, LightGBM, CatBoost with extensive hyperparameter tuning
Feature selection: Use SHAP or permutation importance to reduce feature set
Stacked ensemble: Train meta-learner on base model predictions
Time-series validation: More rigorous temporal validation strategy
External data: Incorporate additional macro-economic indicators

What's next for faaris submisson BroadVail

Develop useful applications for commercial sectors!

Built With

claude
opus
python
vsc

Updates

faarisrice-creator khan started this project — Jan 25, 2026 06:12 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.