GitHub repo: https://github.com/cye131/OfficeAllyPatientMatch
Introduction
And what is in an algorithm? Machine learning has been touted as a panacea for human imperfection; however, we are quickly learning that the one thing artificial intelligence is no match for is, in fact, our own intelligence.
For instance, you and I can most definitely identify a tank if we saw one. But what about AI? As the story goes, when the Pentagon tried to train a neural network (NN) to detect tanks, the photos it used for training were taken in the same location. Consequently, the NN's "skill" at detecting tanks was really its adeptness in recognizing certain trees, shadows, and clouds. In other settings, the detection software failed completely.
This is not to say that AI is powerless — it is not. Using tools like Lasso, we are able to wrangle enormous datasets, allowing machines to make judgment calls for us, doing, in minutes, what it would take any of us thousands of years to accomplish.
Consequently, in our project, our goal is to combine our own qualitative improvements to the dataset and matching process with the power of algorithms like those for Levenshtein and Jaccardian similarity.
This intentionality, too, extends into the name we chose for our project: "Let's Match." While the word "Match" represents the task at hand — the job we have asked our program to do, the contraction "Let's" represents human intuition because we recognize that, like many things, solving the problem of patient record redundancies is something we are better equipped to tackle when we do it together.
What it does!
We use logistic regression with coefficients chosen through an elastic net regularization process. While the specifics are explicated in our Github Repo, to explain how our project is unique from most other solutions — we went beyond utilizing, simply, a variety of common text-difference algorithms (e.g. Levenshtein similarity, Jaccardian similarity, etc.) and also created and implemented our own qualitatively derived penalty functions.
The result? A powerful patient matching script which learns not only from data, but also draws from human intuition to make decisions that are appropriate within the context of public and private care health practices. We asked ourselves to step in the shoes of the end-users and consider what kinds of clerical errors are reasonable — what are not. Our goal is not to underfit or overfit — we wanted a solution that was as precise as it was practical. We do not want a model that can 100% identify all individuals in the test set because, in reality, this is only possible through overfitting and, in real life, this can lead to, at best, inefficient and, at worst, dangerous, assumptions being made about sensitive patient records.
What's next for Let's Match!
We have created an R package and have uploaded it to the CRAN repository for future use by anyone who might be interested. Once it is accepted by CRAN and uploaded, our solution will be available for implementation in an elegent, easy-to-use format.
As a team, we also came up with some suggestions for the Office Ally team in terms of adjustments they might want to utilize in future patient-matching programs/other information they may want to use that could be available in a patient's records:
Originally, we planned on incorporating the addresses into our solution using the Google Maps API to both correct incorrectly inputted addresses and measure the distance of different addresses from another; however, because the test data set uses non-existent addresses for HIPPA compliance reasons, we did not see this to be a useful way of spending our time. Additionally, it could have resulted in a script which, yes, performed better on the actual, internal dataset, but worse on the provided test set.
We wanted to look at characteristics that we believed are more "fixed" for a patient and tended to weight those slightly more heavily because we considered these to be more reliable. For instance, intuitively, it makes sense that a birthday will not really change from person to person aside for clerical errors. We did not choose to use gender because there are individuals who may be questioning their gender identity or may have transitioned. One measure we suggested implementing, was collecting information on emergency contacts or relatives of the patient. While something like an address or a maiden name might change, the patient's parents, siblings, and emergency contacts are much less likely to change. Furthermore, having such a complete and, more fixed, contextualization of the patient makes for far more reliable identification while not proving to be more burdensome on, already busy, doctor's offices as this information is typically collected at most visits.
While we noticed that the training set tried to use N/As to create edge cases where patients might be correctly identified as different individuals while, also, having completely identical identifiers, we did not think that it was prudent to have our script weigh N/As too heavily. In real life, even a human would believe that these entries were identical — machines should not be the only ones making judgment calls. The lack of information should not automatically be cause for a script to qualify someone — like the two Sara Fields at the very end of the test dataset — to be considered different individuals when, in fact, all of the information we have on these two women would lead us to believe they are the same. Any good identification system should err on the side of caution and, in our opinion, prompt the end-user with some kind of message which alerts them to a possible redundancy. In this case, we believe it is not prudent to assume that they are different because it does not make sense to automatically do so in a real-life context.
Accomplishments that we are proud of
We had a lot of fun and learned a lot, together!


Log in or sign up for Devpost to join the conversation.