LAHacks Patient Matching

What it does

Our project is a solution to OpenAlly's first Challenge: Patient Matching

How I built it

There is a is_same_person() function which takes in 2 rows and a logging file pointer, and returns a score
The lower the score is, the more likely that the two rows are describing the same person
The exact threshold where it decides that the two rows are not equal comes from testing
most mistakes in data fields (typos) can be measured using the Damerau-Levenshtein Distance algorithm.
- Damerau-Levenshtein Distance is an algorithm which measures string similarity as the minimum number of character substitutions, insertions, removals and adjacent character swaps to transform one string to another.
- I use this algorithm for many of the columns to determine if they are equivalent
- if the distance is below some threshold for each column, then consider the values for that column equivalent
  - for last and first names, the score modifier scales with HOW different the two strings are
  - for example, "Austin" vs. "Marco" would return a high number, as the names are very different
  - if the names are very different, then it becomes increasingly likely that they are different people
  - for more unpredictable, non-name fields, scores are static
- because the data had so many inconsistensies, we had to write a lot of logic for each column to normalize its contents before looking at the similarity
- for example, some of the States entries were postal codes and others were expanded names
- to solve this, I wrote different cases based on the combination of rows
  - if only one row is abbreviated, abbreviate the other
  - get the distance between the two; if they are the same, subtract 5 from the score
  - if they are only 1 apart, then add 5 to the score
  - if they are 2 apart (entirely different), add 15 to the score
  - if both were abbreviated, use the same logic as before, except add 7.5 if they are 1 apart
  - there is no conversion happening, so it is more likely that they are actually different
  - if neither is abbreviated, get the distance as is; if the distance > 2, add 15 to the score
  - otherwise, subtract 5
- another clever solution is in handle_names(), the function that takes care of first and last names
- in the data, there are instances where a patient's name is in its short form in one row and in its long form in another.
- while this cannot be known with certainty, it must be accounted for to prevent false positives
- the solution was to see if the difference in length between the two names was more than 2
  - if so, then check if one of the names starts with the other name
  - if it does, then subtract from the score
  - otherwise, if the two names appear to be unique, return a positive score proportionate to how different they are
- there are more things in the code that I found interesting, but I will leave them out so that this does not become an essay
- I made a conscious effort to comment all of my code, so I hope that it is readable

Grouping

I thought the way that I got grouping to work pretty interesting
- a set of indices is maintained to keep track of which rows have unique groups
- each row is given a pointer to a list of sibling rows
- when comparing 2 rows, if the rows are deemed equal, then they are put in the same group
- if neither has a group at that point, then one of the rows makes a new group containing both rows and adds the index of one of the rows to the set
  - the other row's group points to this list
- if one of them has a group but the other does not, then it adds the groupless row to the group and sets its pointer to point to that group
- if both of them have groups, then the smaller group is appended onto the larger group
  - the smaller group is then removed from the set
- the end result is that all of the rows that are equivalent are put in their own grouping

Challenges I ran into

figuring out how best to group the patients
- I tested the algorithm only by comparing adjacent rows
- I did not put a focus on the end goal; I just wanted to get accuracy
- I realized that the runtime complexity was quadratic and that it was going to be tested against 100s of thousands of records
- if I had realized this sooner, I maybe could have changed my logic to incorporate some sort of hashing
- this would mean being able to cut down on the main bottleneck: the is_same_person() function
- hashing would mean less computation in the long run
realizing that using numpy arrays were not going to instantly speed it up
- np arrays are useful because they offer direct access to elements with very little overhead
- however, when the dtype of the array is an object, there is a lot of overhead because of the pointer dereferencing that has to happen to get the object
- arrays are much more beneficial when using primitives

What I learned

I learned a lot about NLP in doing this project
- Learning about the different string similarity algorithms was really interesting for me
I learned about the importance of randomness in algorithmic design
- I already knew that many quicksort implementations randomize the order of the data before processing, but this project made me realize how beneficial that is
- With the default order, running the test() function, I was maxing out at 95% accuracy
- However, when randomizing the order this number approached 100%, even hitting it at one point
- this was happening because a lot of the changes between different patients were "gradual" in the dataset
- it was as if a gradient was applied, smoothing out sharp changes between rows
- I was able to figure it out with 95% accuracy with the gradient, but randomizing the order of the rows to put distinct rows next to each other bumped up the performance even more
- While this may be a case of overfitting the weights to the data, there were very few sacrifices for accuracy that had to be made to account for edge cases
- Therefore, I believe that if the algorithm did NOT run in quadratic time, it would be very useful in industrial applications

Built With

Updates

srithon Thoniyil started this project — Mar 29, 2020 11:28 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.