Inspiration

Medical practices require patients to make appointments in order to see patients at an efficient rate. However, many patients do not show up to their appointments. This results in lost time, and requires people to wait longer than necessary to receive treatments. We set out to create a data driven solution to this problem by creating machine learning models to predict how likely a customer is to be a “no show” to their appointment. This information will allow medical practices to act accordingly in anticipation of a likely “no show” patient to mitigate lost time and costs.

What it does

We built and trained a machine learning model that predicts whether or not a patient will be a “no show” based on factors such as distance from the medical practice, weather (including precipitation, snow, and wind), age, etc. We also built a profile of someone that is likely to be late.

How we built it

We used mysql to store the data and Python/pandas to load and analyze it. First, we cleaned the data, removing several extraneous rows and columns and mapping categorical values to numerical values. After examining the data for patterns through graphing and observation, we enacted various types of supervised classification modeling systems within scikit-learn such as Decision Trees, Random Forest, Logistic Regression, Perceptrons, Linear SVC, and KNN. We trained on and tested on various amounts and proportions of data, finally determining what the best models were and generally what type of patient tends not to show up.

Challenges we ran into

Early challenges included certain datapoints with impossible values for an attribute, such as the number of days between creating an appointment and attending the appointment reaching the tens of thousands. The most daunting challenge, however, was that the data was unfortunately very skewed in that ~90 percent of people showed up with only ~10 percent being “no shows.” This resulted in an unfortunate behavior in most of the classification predictors: many would always state “No” for IsNoShow for a high success rate, but of little interest from a learning or data science perspective.