COVID-19 Scientific Article Clusters

Summary

For the Hackathon for Social Good, our team performed natural language processing analysis on a large open dataset of scholarly articles on COVID-19 and related viruses to reveal hidden relations among the articles. We chose this topic because the COVID-19 pandemic is a pressing worldwide crisis that we want to help fight. Our dataset was the CORD-19 scholarly article collection, which was ideal for easy access to data for analysis. The results of our analysis can help researchers find related groups of articles to speed up literature review and discovery in their efforts to research and combat the pandemic. Our analysis consisted of three stages. First, we performed pre-processing to clean the data from the CORD-19 dataset in order to facilitate natural language processing. Then we conducted kmeans cluster on the cleaned data. Finally, we were able to extract meaningful groupings from the processed data based on key word stems. Our data analysis can be used to find relations between scholarly articles based on prominent word stems.

Challenges I ran into

We had numerous challenges from performance and memory limits in our personal computers when crunching data. In order to remedy these issues, we researched the most efficient data structures for the computations and modified our code to use those.