Understanding the potentials impacts of seasonality and social factors on the breakout of the COVID-19 disease is fundamental for governments to define an effective strategy to return to normality. Scientific research on COVID-19 is very active and a large volume of scientific papers are produced on a daily basis, making it hard to answer the question “what is known about?” a specific topic.
The purpose of the proposed solution is to leverage Natural Language Processing (“NLP”) and graph database technology to facilitate and automate the process of knowledge discovery on a large dataset of scientific papers (about 50k) about the COVID-19 disease.
What it does
Our team worked on the development of a Proof of Value solution where we studied the feasibility of extracting scientific statements contained in a dataset of scientific papers related to COVID-19. We used NLP techniques to extract text in a structured format of type “subject” -> “predicate”-> “object” and we created a prototype knowledge-graph database to support the process of knowledge discovery.
The knowledge graph database allows the user to run queries to return a network of concepts represented as nodes connected by syntactic relationships represented as edges.
The primary objective of the solution is to build large networks of concepts spanning over multiple scientific papers and allowing the discovery of knowledge that would not be possible when only reviewing individual papers.
How I built it
We analysed a public dataset of about 50k scientific papers published on Kaggle.com (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks) and we leveraged an industry standard Python library for NLP named spaCy to extract relevant statement in structured format. We uploaded the structured information into a Neo4j graph database to create an explorative interface for knowledge discovery.
This solution is meant to represent a feasibility study where only simple sentence structures have been analysed. Additional work is required to produce models for the extraction of more complex sentence structures.
Challenges I ran into
Data processing and understanding of NLP techniques represented the main challenges; our team members had very limited or no past experience in NLP. We analysed multiple sentence structures and token dependencies in order to develop a simple - preliminary - model for the extraction of knowledge in structured format. The main complexity was associated with the large number of syntactic structures to be analysed and the design of the rules to extract the relevant concepts.
Accomplishments that I'm proud of
Our team managed to test the feasibility of creating a prototype of Knowledge Graph to support the process of knowledge discovery in regards to the COVID-19 outbreak. None of our team members had significant past experience in NLP techniques; for this reason we are proud of the result we achieved in such a short period of time.
What I learned
The use of the spaCy library for NLP; use of collaborative tools like Slack, Miro;
What's next for The Effect of the Seasons
The project is at early stage; the solution is currently only able to display relations between concepts extracted from a limited set of variants of simple syntactic structures . Additional work is required to create more sophisticated models for the extraction of relevant sentences in structured format.
Currently our analysis is limited to a subset of articles and paragraphs matching a given set of keywords that we identified as part of our analysis; this restriction was applied to facilitate the treatment of the large dataset but will need to be removed for future developments.