Predicting 30Day Hospital Readmission Using Machine Learning

Precision Recall Curves
Feature Importance

Inspiration

Every year, nearly 1 in 5 Medicare patients is readmitted to the hospital within 30 days of discharge and it causes the US healthcare system over $26 billion annually. What struck me most wasn't the cost but the human side: many of these readmissions are preventable. Patients go home, struggle with complex medication, lack of adequate follow up care and then end up back in the emergency room days later.

What it does

Our tool supports but not replaces the clinical decision making, it gives clinicians a data driven second opinion. Our system takes 43 clinical and administrative features available at the time of patient discharge: things like number of prior inpatient visits, medication count, lab procedures, primary diagnosis, and length of stay and outputs a probability score representing that patient's risk of being readmitted within 30 days. A patient is flagged as high risk if their probability exceeds our optimized threshold of 0.426. For high-risk patients, we our data recommends a specific course of actions: scheduling a 7-day follow-up, reviewing medication adherence and engaging care coordination resources.

How we built it

Data: We used the Diabetes 130-US Hospitals dataset from the UCI Machine Learning Repository — 101,766 real hospital encounters from 130 US hospitals collected between 1999 and 2008. The dataset contains 50 features including patient demographics, diagnoses, medications, and procedures. Preprocessing: We made several clinically-informed decisions during cleaning. We removed 2,423 patients who were discharged to hospice or had died as they cannot be readmitted and including them would corrupt the model. We dropped features with over 40% missing data. We mapped hundreds of ICD-9 diagnosis codes into 9 broad clinical categories. We applied stratified splitting to preserve the natural class imbalance in both training and test sets. Modeling: We trained two models - Logistic Regression as an interpretable baseline, and a tuned Random Forest as our advanced model. Both used class weighting to handle the severe imbalance (only 11.2% of patients were readmitted within 30 days). We used precision-recall curve analysis to find optimal classification thresholds rather than defaulting to 0.5. Explainability: We extracted feature importances from Random Forest and coefficient directions from Logistic Regression to understand not just what the model predicts but why- to make the results meaningful to clinicians and not just data scientists.

Challenges we ran into

Class imbalance was our biggest technical challenge. With only 11.2% positive cases, a naive model could achieve 88.8% accuracy by simply predicting "not readmitted" for every patient — while being completely useless clinically. We addressed this through balanced class weighting, threshold optimization, and evaluating models on recall and ROC-AUC rather than accuracy.

Accomplishments that we're proud of

Our Random Forest model achieves a ROC-AUC of 0.664 and correctly identifies 51% of patients who will actually be readmitted within 30 days — 4.5 times better than random chance on a severely imbalanced dataset.

What we learned

We learned that accuracy is a misleading metric for imbalanced problems. A model that looks 89% accurate can be clinically worthless. ROC-AUC and recall tell a much more honest story.

What's next for Predicting 30Day Hospital Readmission Using Machine Learning

External validation: Our model was trained on a single population of diabetic patients from 1999–2008. The next critical step is validating performance on a more recent and a broader dataset. Fairness analysis: Healthcare data carries historical demographic biases. We want to audit model performance across racial, age, and gender subgroups to ensure it doesn't perform significantly worse for any population. Stronger models: We'd like to explore gradient boosting methods like XGBoost and LightGBM, which often outperform Random Forest on tabular healthcare data, as well as SHAP values for more granular patient-level explainability. Clinical collaboration: The most important next step is getting this in front of actual clinicians, hospitalists, discharge planners and care coordinators- to validate whether the model's outputs are actionable in a real workflow, and to refine the interface based on their feedback. Real-time EHR integration: Ultimately the goal is a system that pulls data directly from electronic health records at the time of discharge rather than requiring manual data entry.

Built With

pandas
python
seaborn

Updates

Ashutosh Parmar started this project — Feb 11, 2026 12:57 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.