Being able to synthesize and rapidly assimilate the exponentially growing biomedical knowledge is becoming an impossible task for scientists. These are either inherently unstructured and non-conducive to current computing paradigms or siloed into structured databases requiring specialized bioinformatics.

Despite the recent renaissance in unsupervised neural networks for deciphering unstructured natural languages and the availability of numerous bioinformatics resources, a holistic application for real-time synthesis of the scientific literature and seamless fusion with deep omic insights and real-world evidence has not been advanced. *This is causing severe slowdowns in identifying new targets in drug discovery research and also causing further slow-downs downstream.

This has become evident in pandemics like COVID-19 and we need a mechanism to be able to quickly synthesize, analyze and make inferences from the ever-growing body of biomedical literature and curated datasets available globally.

What it does

  • The Orpheus application parses, analyzes and extracts biomedical concepts from COVID-19 related scientific literature.
  • Once it extracts out these concepts like Diseases, Proteins and Compounds it also infers implied relationships between these concepts based on the NLP based analysis of the scientific paper itself.
  • Using the inferred relationships of these extracted concepts it then builds a series of 'triples' from each paper.
  • These are then combined with other referential data sources to then infer confidence levels for each 'triple' or association.
  • Finally all of these inferred triples are then added into a graph database to build a probabilistic knowledge graph.
  • This graph is then exposed as an API that can be queried, searched and explored through an intuitive visual web-based graphical interface.
  • As new research on COVID-19 is published the underlying Orpheus pipeline can be re-run as a workflow to keep the graph updated.

How we built it

  • Orpheus supports visual triangulation of insights via statistical enrichments from curated collections of structured databases, with the diseases, biomolecules, drugs, and cells & tissues collections loaded by default.
  • The vectorization (converting texts to a numeric vector) that we will be using. We will be using not only word based vectorization schemes but also used network representations to infer and predict relationships between nodes and allow searching for concepts.
  • We then fused disparate data sources and extracted triples (assertions) into a probabilistic knowledge graph.

Challenges we ran into

  • Being able to correlated and normalize the CORD-19 dataset along with reference databases like Uniprot and Wikipedia amongst others
  • Being able to scale to be able to handle thousands of scientific papers on COVID-19
  • Being able to identify the right NLP models to be able extract the concepts needed
  • Being able to define the algorithm to infer relationships between concepts

What's next for Orpheus - Knowledge Synthesis for COVID-19

  • Identify and include additional datasources like CHEMBL and PDB DBs
  • Perform additional graph based data mining algorithms to reveal hidden links
  • The algorithms we will be running on the fused knowledge graph can enable us to infer/predict additional links in the graph and also to surface other applications inside it.For e.g. algorithms like pagerank but also other custom algorithms that can surface drug repurposing use-cases.
  • Add additional UI features to make it easier to search/explore and navigate the graph.

Built With

Share this project: