Housing and Equity: Understanding the Post-COVID Real Estate Regime Shift

Inspiration

The COVID-19 pandemic permanently changed what makes a property valuable. We wanted to measure that shift quantitatively by asking simple but important questions. How did investor priorities change? Which neighborhood amenities matter now compared to before?

Our goal was to build a machine learning system that could not only predict property performance, but also surface the deeper structural changes in how neighborhoods are evaluated.

What It Does

Our system predicts hotel RevPAR (Revenue Per Available Room) growth using two separate models. One is trained on pre-COVID data from 2015 to 2020, and the other on post-COVID data from 2022 to 2025. By comparing feature importance between these models, we highlight how the pandemic reshaped real estate investment strategies.

Before COVID, markets rewarded relative positioning and amenity diversity. After COVID, performance depends far more on absolute property quality and health-focused environments.

How We Built It

Technology Stack

Python 3.14 for core development
XGBoost, LightGBM, Random Forest in an ensemble voting setup
Pandas and NumPy for data manipulation
Optuna for hyperparameter tuning
Scikit-learn for training and validation
Matplotlib and Seaborn for visualization

Architecture

We built two independent pipelines, one for each time period. Each pipeline trains an ensemble of three models to reduce overfitting and improve generalization. The system processes three drive-time datasets with 10, 15, and 30 minute radii, engineers more than 1,000 features, and applies VIF-based multicollinearity removal to identify the most predictive set.

Feature Engineering Pipeline

Starting with 115 raw columns across three spatial scales, we created:

588 multi-scale amenity features by merging drive-time datasets
440 composite features, including locality indices that compare 10 minute versus 30 minute access, diversity metrics like Shannon entropy and Gini coefficients, and economic indicators such as mortgage-to-rent spreads
VIF-based selection that reduced the final set to 62 features for the pre-COVID model and 217 for the post-COVID model

Training Infrastructure

We implemented a checkpoint system that saves progress at each major stage such as data loading, feature engineering, feature selection, and model training. This cut iteration time in half.

Hyperparameter optimization runs 75 Optuna trials with GroupKFold cross-validation and UBID grouping to prevent data leakage.

Challenges We Ran Into

1. Training Time Bottleneck

Early training runs took 2 hours per model. With 75 hyperparameter trials and five folds, that meant hundreds of model fits per run.

We solved this by adding a checkpoint system that lets us restart from any stage. When feature engineering needed changes, we could resume from feature selection instead of reprocessing everything.

2. Feature Engineering Complexity

Turning multi-scale spatial data into meaningful predictors was harder than expected. Our first version produced more than 1,500 features, many of them highly correlated.

We built a VIF analysis pipeline and correlation filters to prune redundancy. The key trade-off was between richness and interpretability. Too many features turned the model into a black box, while too few weakened predictive power.

3. Data Leakage Prevention

With temporal data, leakage is easy to introduce accidentally. We carefully checked that:

No RevPAR or growth variables appeared as features
Same-period quartile rankings were excluded
Historical quartiles were allowed as legitimate predictors
UBID grouping prevented the same property from appearing in both training and validation sets

We also added automated scans for forbidden feature patterns that halt training if anything suspicious appears.

Accomplishments We Are Proud Of

A dual-model system that exposes regime shifts through feature importance comparisons
A checkpoint framework that dramatically sped up iteration
Over 1,000 engineered features distilled down to 62 to 217 high-value predictors
Strong data leakage safeguards with automated validation
42 percent R² pre-COVID and 24 percent post-COVID, reflecting genuine market volatility rather than overfitting
A clear narrative showing the shift from relative positioning to absolute quality

What We Learned

Building a production-ready ML pipeline is about infrastructure as much as accuracy. The checkpoint system, validation checks, and training orchestration mattered just as much as the models themselves.

We also learned that lower R² does not always mean a worse model. Post-COVID markets are inherently more volatile, and our results reflect that uncertainty honestly.

What Is Next

Temporal feature engineering with seasonal and trend effects
An explainable AI dashboard to explore shifting drivers interactively
Transfer learning that initializes post-COVID training with pre-COVID weights
Causal inference methods to move beyond correlation
Real-time deployment through an API for live valuation

Built by: Femi, Ashwin, Didi, and Kene
Challenge: BroadVail Capital, Housing and Equity

Built With

lightgbm
matplotlib
numpy
optuna
pandas
python
randomforest
scikit-learn
seaborn
xgboost

Submitted to

Rice Datathon 2026

Created by

I worked on the data wrangling and cleaning. I also spent time working on the checkpoint system and implemented the parallelization. I also helped with the model architecture iterations.

ashwinrao1 Rao
I worked on data analysis, insight generation and interpretation, presentation design, and the overall communication of our findings.

deinbojack Jack
I worked on feature engineering and looking at model interpretability

Private user
I proposed and implemented the ML techniques used in training our model(XGBoost, Random Forest, and LightGBM)

kene2x

Updates

ashwinrao1 Rao started this project — Jan 25, 2026 05:04 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.