As two students from Columbia interested in the intersection between healthcare and data science, we wanted to solve health-related problems given an extensive dataset of health surveys. Because of the large range of information that this dataset has provided us, we chose to look specifically at exercise and nutrition and in physical conditions relating to that in the dataset provided. We hypothesized that nutrition and exercise habits of individuals would play a large role in the health of their cardiovascular and blood sugar.

What it does

We made a dashboard to help a health professional visualize and make recommendations about an individual’s nutritional and exercise habits in order to reduce risks (such as high cholesterol and high blood pressure) for diabetes and cardiovascular issues. This dashboard includes an overview of the patient (with demographic and personal information such as smoking habits, age, race, medications/disease etc related to our topic of choice) as well as a graph that would show the position of the individual’s exercise and nutrition habits relative to both healthy others as well as unhealthy others in the same demographic group. Based on the position of the individual, exercise and nutrition habits would be analyzed, and recommendations would show up in the corner of the dashboard in order to aid the doctor with diagnosis and advice.

How we built it

First, we went through each question and formed a list of columns and questions that we wanted to explore, taking all such information from the text file provided and creating a CSV of our consolidated, relevant data. We chose data specifically from sections that explored: diet, exercise, blood pressure, diabetes, cholesterol, and cardiovascular disease. Using multiple methods such as k-means clustering, multinomial Bayes, and linear regression, we were returned extremely low accuracy (R^2) values for the data that we had pulled from the set, making us focus on a smaller group of data to use in our decision tree classifier.

Challenges we ran into

Our initial methods such as k-means clustering, multinomial Bayes, and linear regression had low accuracy values and high errors because of the diverse set of questions we tried to answer in each model, so we had to narrow down the dataset and specialize in the area that we wanted to focus on.

Accomplishments that we're proud of

After getting very low accuracy values for different machine learning models, we finally ended up with a decision tree classifier that classified patients with different diabetes levels with similar nutrition/lifestyle groups.

What we learned

We learned that cleaning up the data is very crucial, especially when the dataset also contains a lot of attributes that aren't very useful for a meaningful analysis.

What's next for Diabetes Risk and Recommendation Model

Implementing a web application for the dashboard and the recommendation model.

Share this project: