We were at awe with the soiree of open data available by governments and the power of modern analytics. So we took our first burgeoning steps into exploratory analysis and database management of a machine learning project. The most common cause of death within the US is heart disease above even cancer, so we wanted to try and find how the community influences obesity rates.

What it does

We used a database to store the data and draw from it locally. The SVM is trained on a subset of labelled data where the obesity rates are known and the machine is shown to classify the data between obesity and non-obese. Then the SVM is used on another set of the data to attempt to predict the obesity rate from this new unseen data.

How we built it

We used mySQL to create a local database where the dataset would be stored. For the ML model we used a linear SVM.

Challenges I ran into

The algorithm draws a hard boundary between obese and non-obese when weight of a person is a very dynamic and varying quantity. The hardest part of this was figuring out how to initially approach the dataset and find a quantity that we wanted to predict from the set. It took us a very long time to find out how to really approach it before we could narrow it down. The data wrangling was what I found difficult as well, despite being extremely clean it was hard to pinpoint which variables should be focused on and which should not.

Accomplishments that I'm proud of

The Machine Learning model!

What I learned

Principle techniques in data wrangling, pre-processing, and my first attempt using scikit-learn.

What's next for SVM Obesity Predictions

Since we made this simply over a weekend there is a ton of improvements that could be made to the project. We could use more extensive K-fold cross validation methods to improve accuracy of the model, try to accommodate a third state for being overweight as the model assumes a hard threshold between Obese and Non-obese. An algorithm called PCA would reduce the number of variables that we are using to predict obesity. There is also hyper-parameter optimization that could be used for improving the model parameters.

Share this project: