What it does

Our project is a solution to OpenAlly's first Challenge: Patient Matching

How I built it

  • There is a is_same_person() function which takes in 2 rows and a logging file pointer, and returns a score
  • The lower the score is, the more likely that the two rows are describing the same person
  • The exact threshold where it decides that the two rows are not equal comes from testing
  • most mistakes in data fields (typos) can be measured using the Damerau-Levenshtein Distance algorithm.
    • Damerau-Levenshtein Distance is an algorithm which measures string similarity as the minimum number of character substitutions, insertions, removals and adjacent character swaps to transform one string to another.
    • I use this algorithm for many of the columns to determine if they are equivalent
    • if the distance is below some threshold for each column, then consider the values for that column equivalent
      • for last and first names, the score modifier scales with HOW different the two strings are
      • for example, "Austin" vs. "Marco" would return a high number, as the names are very different
      • if the names are very different, then it becomes increasingly likely that they are different people
      • for more unpredictable, non-name fields, scores are static
    • because the data had so many inconsistensies, we had to write a lot of logic for each column to normalize its contents before looking at the similarity
    • for example, some of the States entries were postal codes and others were expanded names
    • to solve this, I wrote different cases based on the combination of rows
      • if only one row is abbreviated, abbreviate the other
      • get the distance between the two; if they are the same, subtract 5 from the score
      • if they are only 1 apart, then add 5 to the score
      • if they are 2 apart (entirely different), add 15 to the score
      • if both were abbreviated, use the same logic as before, except add 7.5 if they are 1 apart
      • there is no conversion happening, so it is more likely that they are actually different
      • if neither is abbreviated, get the distance as is; if the distance > 2, add 15 to the score
      • otherwise, subtract 5
    • another clever solution is in handle_names(), the function that takes care of first and last names
    • in the data, there are instances where a patient's name is in its short form in one row and in its long form in another.
    • while this cannot be known with certainty, it must be accounted for to prevent false positives
    • the solution was to see if the difference in length between the two names was more than 2
      • if so, then check if one of the names starts with the other name
      • if it does, then subtract from the score
      • otherwise, if the two names appear to be unique, return a positive score proportionate to how different they are
    • there are more things in the code that I found interesting, but I will leave them out so that this does not become an essay
    • I made a conscious effort to comment all of my code, so I hope that it is readable


  • I thought the way that I got grouping to work pretty interesting
    • a set of indices is maintained to keep track of which rows have unique groups
    • each row is given a pointer to a list of sibling rows
    • when comparing 2 rows, if the rows are deemed equal, then they are put in the same group
    • if neither has a group at that point, then one of the rows makes a new group containing both rows and adds the index of one of the rows to the set
      • the other row's group points to this list
    • if one of them has a group but the other does not, then it adds the groupless row to the group and sets its pointer to point to that group
    • if both of them have groups, then the smaller group is appended onto the larger group
      • the smaller group is then removed from the set
    • the end result is that all of the rows that are equivalent are put in their own grouping

Challenges I ran into

  • figuring out how best to group the patients
    • I tested the algorithm only by comparing adjacent rows
    • I did not put a focus on the end goal; I just wanted to get accuracy
    • I realized that the runtime complexity was quadratic and that it was going to be tested against 100s of thousands of records
    • if I had realized this sooner, I maybe could have changed my logic to incorporate some sort of hashing
    • this would mean being able to cut down on the main bottleneck: the is_same_person() function
    • hashing would mean less computation in the long run
  • realizing that using numpy arrays were not going to instantly speed it up
    • np arrays are useful because they offer direct access to elements with very little overhead
    • however, when the dtype of the array is an object, there is a lot of overhead because of the pointer dereferencing that has to happen to get the object
    • arrays are much more beneficial when using primitives

What I learned

  • I learned a lot about NLP in doing this project
    • Learning about the different string similarity algorithms was really interesting for me
  • I learned about the importance of randomness in algorithmic design
    • I already knew that many quicksort implementations randomize the order of the data before processing, but this project made me realize how beneficial that is
    • With the default order, running the test() function, I was maxing out at 95% accuracy
    • However, when randomizing the order this number approached 100%, even hitting it at one point
    • this was happening because a lot of the changes between different patients were "gradual" in the dataset
    • it was as if a gradient was applied, smoothing out sharp changes between rows
    • I was able to figure it out with 95% accuracy with the gradient, but randomizing the order of the rows to put distinct rows next to each other bumped up the performance even more
    • While this may be a case of overfitting the weights to the data, there were very few sacrifices for accuracy that had to be made to account for edge cases
    • Therefore, I believe that if the algorithm did NOT run in quadratic time, it would be very useful in industrial applications
Share this project: