Inspiration

I was inspired by the potential of machine learning to learn about real-world social information and help narrow the income divide. The U.S. Census Income Dataset raised a particular challenge—forecasting whether someone is above or below $50K based on demographic and work-based variables. I was interested to see how far we could push the performance out of a classification model without compromising interpretability and social utility. My curiosity to find out how machine learning remedies categorical complexity and socioeconomic information driven this project start to finish.

What it does

IncomeIQ classifies whether an individual earns more or less than $50,000 annually based on factors such as age, education level, work hours, and nation of origin. The final model uses the XGBoost algorithm to process the data and predict income levels with high precision. IncomeIQ can be used by policymakers, economists, or social research agencies to gain insights into income distribution patterns across the U.S.

How I built it

The project began with loading and exploring the U.S. Census Income Dataset, which contains information regarding approximately 48,000 individuals. Exploratory data analysis was conducted using bar plots, histograms, heatmaps, and boxplots to gain knowledge on variable distributions and detect outliers. Data cleaning was achieved by dropping rows containing missing values and performing both label encoding and one-hot encoding to encode categorical features. Numeric values were scaled via a min-max scaler. I also created two new features—capital-total and hours-per-day—to gain more insight. For modeling, I used an 80/20 train-test split and tried a range of models like Linear Regression, Random Forest, and Neural Networks. XGBoost was the best finally because it strikes a balance between speed, accuracy, and stability. I used DMatrix to reduce memory usage, cross-validation to avoid overfitting, and early stopping on log loss to determine the optimal number of boosting rounds. Even though there were compatibility issues with GridSearchCV, I tuned the model manually by iterative testing.

Challenges I ran into

The greatest challenge was getting the categorical variables to work correctly, especially considering the size of the dataset. Encoding them to avoid memory bottlenecks involved multiple iterations. Hyperparameter tuning with XGBoost was a challenge as well—its difficulty in collaborating with traditional Scikit-learn grid search methods necessitated that I get creative with early stopping and cross-validation. Finally, Non-linear relationships between data were unattainable by Neural Networks and Linear Regression, so boosting algorithms were amplified.

Accomplishments that I'm proud of

I am particularly proud of achieving 84% accuracy—on a challenging dataset with this kind of high dimensionality and mixture of categorical and numerical features. I also built a highly generalized model that performs well without overfitting. From feature engineering to optimization, every stage of the project taught me how professional-grade machine learning projects are designed and polished.

What I learned

I learned how to preprocess large datasets more efficiently, especially when combining various encoding methods. I had a clearer understanding of how different boosting models like XGBoost give improved performance compared to other algorithms on complex tabular data. I also learned about tools like DMatrix, cross-validation methods, and metrics other than simple accuracy—like log loss and AUC.

What's next for IncomeIQ

Second, I would like to host IncomeIQ as a web app where users can enter individual-level data and receive income bracket predictions. I would also like to extend the model by adding other public datasets to include geographic or industry-based analysis. Furthermore, I would like to explore SHAP values in order to make my model explainable and transparent, which would be critical in real-world implementations with socioeconomic data.

Built With

Share this project:

Updates