Introduction

Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. This signals the pancreas to release insulin. Insulin helps enable cells within the body to use those sugars in the bloodstream for energy. Diabetes is generally characterized by either the body not making enough insulin or being unable to use the insulin that is made as effectively as needed.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

Dataset

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, a csv of the dataset available on Kaggle for the year 2015 was used. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

diabetes 012 health indicators BRFSS2015.csv is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. This dataset has 21 feature variables. Data Points:

  • Demographics: Age, Sex, Income, Education

  • Health Status: BMI, General Health, Mental Health days, Physical Health days

  • Health Conditions: High BP, High Cholesterol, Cholesterol Check, Stroke, Heart Disease, Physical Activity, Fruits, Veggies, Heavy Alcohol, Healthcare Coverage, No Doctor Due to Cost, Difficulty Walking, Smoker

Key Findings:

  • Diabetes Rate by Age Category Age: Diabetes rate increases steadily with age category, rising from ~6% in young adults to over 30% in seniors.

  • Diabetes Rate by BMI BMI: Clear positive correlation—diabetes rate peaks around 75-80 BMI and spikes again at extreme values (200%+ at BMI 75-80 suggests these are weighted means).

  • Diabetes Rate by General Health General Health: Dramatic relationship—79% diabetes rate for those reporting "Poor" health vs. only 6% for "Excellent" health. This is the strongest visual correlation.

  • Diabetes Distribution by High Blood Pressure High Blood Pressure: Those with high BP have ~17% diabetes rate vs. only ~6% without, shown in the stacked distribution.

  • Feature Importance from Model Feature Importance: Model ranked BMI (18.2%), Age (12.3%), and Income (9.8%) as top predictors. Physical/Mental Health and Health Status round out the top factors.

Built With

Share this project:

Updates