Inspiration
Every year, unplanned hospital readmissions within 30 days cost the U.S. healthcare system over $26 billion dollars, and the frustrating part is that many of them are preventable. When a patient is discharged, clinicians have to make a judgment call about who needs a follow-up call, who needs home care, and who is actually fine.
What it does
We built an end-to-end machine learning pipeline that ingests anonymized Electronic Health Record (EHR) data across 18 linked tables, engineers clinically meaningful features from raw encounter, diagnosis, and medication records, trains and evaluates two models, Logistic Regression and Random Forest, explains every prediction using SHAP (SHapley Additive exPlanations), audits the model for demographic fairness across race, gender, and age, and delivers everything through a live, browser-based interactive dashboard. If hospitals can identify high-risk patients before discharge, they can intervene: scheduling follow-up appointments, arranging home care, or reconciling medications, and potentially prevent readmission entirely.
How we built it
We used Synthea, an open-source synthetic EHR generator used by the NIH and major academic medical centers, which produces realistic healthcare data including diagnosis codes (SNOMED CT), encounter histories, medications, and patient demographics. Our dataset included 1,163 patients and 61,459 encounters distributed across 18 linked CSV tables, which we had to merge and structure into a unified analytical dataset.
The most critical early step was defining the target variable, since a 30-day readmission label did not exist in the raw data. We constructed this manually by ordering each patient’s inpatient encounters chronologically and identifying whether a subsequent admission occurred within 30 days of discharge. This required careful handling of missing time gaps, duplicate encounters, and edge cases where patients had only a single hospitalization or incomplete follow-up data.
Once the dataset was structured, we performed feature engineering to translate raw medical records into clinically meaningful predictors. We created demographic features such as age, gender, race, and age groups, along with clinical history indicators like diabetes, hypertension, multimorbidity, and total number of conditions. We also engineered healthcare utilization features such as number of prior inpatient visits, medication counts, and length of stay, as well as time-based variables like days since last visit. To capture severity of illness, we included total claim cost (the comprehensive, final expenditure of an insurance claim) as a proxy for clinical complexity. Missing values were handled using median imputation for numerical variables and domain-informed defaults for structural cases, such as assigning a large constant value for “days since last visit” when no prior encounter existed.
Challenges we ran into
The readmission label did not exist in any of the 18 source files, so we had to define it entirely from scratch. This required carefully ordering each patient’s hospital visits chronologically and identifying whether a subsequent inpatient admission occurred within 30 days of discharge. In addition, class imbalance was severe, with readmissions representing a small fraction of all cases. As a result, early models defaulted to predicting “no readmission” for nearly every patient while still achieving deceptively high accuracy. To address this, we shifted to ROC-AUC as a primary evaluation metric and introduced class weighting to penalize missed readmissions more heavily.
What's next for Clinical Risk
Instead of analyzing hypothetical EHR data generated from Synthea, future systems can continuously update risk scores as new patient information becomes available during a hospital stay, enabling earlier and more dynamic intervention decisions. Models trained on synthetic or single-source datasets often struggle when deployed in different hospitals due to variations in coding practices, patient populations, and care pathways. Future work should focus on external validation across multiple institutions and using real publicly available patient data.
Log in or sign up for Devpost to join the conversation.