What it does
Our project is a solution to OpenAlly's first Challenge: Patient Matching
How I built it
- There is a is_same_person() function which takes in 2 rows and a logging file pointer, and returns a score
- The lower the score is, the more likely that the two rows are describing the same person
- The exact threshold where it decides that the two rows are not equal comes from testing
- most mistakes in data fields (typos) can be measured using the Damerau-Levenshtein Distance algorithm.
- Damerau-Levenshtein Distance is an algorithm which measures string similarity as the minimum number of character substitutions, insertions, removals and adjacent character swaps to transform one string to another.
- I use this algorithm for many of the columns to determine if they are equivalent
- if the distance is below some threshold for each column, then consider the values for that column equivalent
- for last and first names, the score modifier scales with HOW different the two strings are
- for example, "Austin" vs. "Marco" would return a high number, as the names are very different
- if the names are very different, then it becomes increasingly likely that they are different people
- for more unpredictable, non-name fields, scores are static
- because the data had so many inconsistensies, we had to write a lot of logic for each column to normalize its contents before looking at the similarity
- for example, some of the States entries were postal codes and others were expanded names
- to solve this, I wrote different cases based on the combination of rows
- if only one row is abbreviated, abbreviate the other
- get the distance between the two; if they are the same, subtract 5 from the score
- if they are only 1 apart, then add 5 to the score
- if they are 2 apart (entirely different), add 15 to the score
- if both were abbreviated, use the same logic as before, except add 7.5 if they are 1 apart
- there is no conversion happening, so it is more likely that they are actually different
- if neither is abbreviated, get the distance as is; if the distance > 2, add 15 to the score
- otherwise, subtract 5
- another clever solution is in handle_names(), the function that takes care of first and last names
- in the data, there are instances where a patient's name is in its short form in one row and in its long form in another.
- while this cannot be known with certainty, it must be accounted for to prevent false positives
- the solution was to see if the difference in length between the two names was more than 2
- if so, then check if one of the names starts with the other name
- if it does, then subtract from the score
- otherwise, if the two names appear to be unique, return a positive score proportionate to how different they are
- there are more things in the code that I found interesting, but I will leave them out so that this does not become an essay
- I made a conscious effort to comment all of my code, so I hope that it is readable
Grouping
- I thought the way that I got grouping to work pretty interesting
- a set of indices is maintained to keep track of which rows have unique groups
- each row is given a pointer to a list of sibling rows
- when comparing 2 rows, if the rows are deemed equal, then they are put in the same group
- if neither has a group at that point, then one of the rows makes a new group containing both rows and adds the index of one of the rows to the set
- the other row's group points to this list
- if one of them has a group but the other does not, then it adds the groupless row to the group and sets its pointer to point to that group
- if both of them have groups, then the smaller group is appended onto the larger group
- the smaller group is then removed from the set
- the end result is that all of the rows that are equivalent are put in their own grouping
Challenges I ran into
- figuring out how best to group the patients
- I tested the algorithm only by comparing adjacent rows
- I did not put a focus on the end goal; I just wanted to get accuracy
- I realized that the runtime complexity was quadratic and that it was going to be tested against 100s of thousands of records
- if I had realized this sooner, I maybe could have changed my logic to incorporate some sort of hashing
- this would mean being able to cut down on the main bottleneck: the is_same_person() function
- hashing would mean less computation in the long run
- realizing that using numpy arrays were not going to instantly speed it up
- np arrays are useful because they offer direct access to elements with very little overhead
- however, when the dtype of the array is an object, there is a lot of overhead because of the pointer dereferencing that has to happen to get the object
- arrays are much more beneficial when using primitives
What I learned
- I learned a lot about NLP in doing this project
- Learning about the different string similarity algorithms was really interesting for me
- I learned about the importance of randomness in algorithmic design
- I already knew that many quicksort implementations randomize the order of the data before processing, but this project made me realize how beneficial that is
- With the default order, running the
test()function, I was maxing out at 95% accuracy - However, when randomizing the order this number approached 100%, even hitting it at one point
- this was happening because a lot of the changes between different patients were "gradual" in the dataset
- it was as if a gradient was applied, smoothing out sharp changes between rows
- I was able to figure it out with 95% accuracy with the gradient, but randomizing the order of the rows to put distinct rows next to each other bumped up the performance even more
- While this may be a case of overfitting the weights to the data, there were very few sacrifices for accuracy that had to be made to account for edge cases
- Therefore, I believe that if the algorithm did NOT run in quadratic time, it would be very useful in industrial applications
Log in or sign up for Devpost to join the conversation.