Structured, FAIR & Comparable Covid-19 Bioassays in the Open Research Knowledge Graph

Inspiration

Literature search for the relevant scientific articles is a tedious job. Even in a very specific science domain, it is often time-consuming to search and catalogue most pertinent scientific papers, in spite of the existence of structured databases such as Pubmed, Scopus and Web of Science. In this regard, semantic searching protocols offer a more user-tailored approach to catalogue and compare multiple studies/articles based on specific search terms. The Open Research Knowledge Graph (ORKG) is an endeavor to represent scientific articles in a content-dependent manner, which makes the underlying information machine-readable and suitable for automated processing.

In the scientific literature, descriptions of the bioassays are among the most structured contents due to their intrinsic association with the underlying standard protocols and methodologies. However, comparisons across several bioassays is a challenging problem in today’s context-dependent structures, where semantically organized information is critically lacking. This problem gets compounded with the volume of research/assays published/approved, especially in the extremely dynamic research domains such as COVID-19 research.

The COVID-19 pandemic has invoked a response of unprecedented magnitude from the research community across scientific domains. Development of novel bioassays also has not been left behind with the addition of new and innovative diagnostics and therapeutics associated assays everyday. In such a scenario, it is critical for the developers and users of these assays to carefully examine and compare the relevant ones in both pre and post phases of the development, to confer applicational benefit.

Our Solution

Our application of the ORKG method to create a semantical comparison approach for the relevant bioassays in current COVID-19 research, aims to fill the gap of lack of structured content-dependent approaches in querying bioassay information. We believe that the development of such an approach, will further bolster the scholarly literature search for the COVID-19 associated bioassays and contribute to the development and application of appropriate bioassays in this domain.

What it does

Presents structured, semantified, and comparable Covid-19 bioassay protocols in the Open Research Knowledge Graph infrastructure converting these text-based protocols into machine-readable and comparable elements.
Specifically, our expert-curated data within the ORKG presents a tabulated summary comparing bioassays along the specific dimensions of the data property as defined in the Bioassay ontology. Our curated data presents a summarized, state-of-the-art in COVID19 bioassay research.
Further, we also implemented a moodbar feature (ref: https://en.wikipedia.org/wiki/Moodbar) for highlighting values of selected semantified properties in the bioassays contributions comparisons table based on their usage: the more widely used methods/instruments are marked in shades of green while the less common ones are marked in shades of blue. The logic behind this feature is to highlight the most used materials, methods in the creation of bioassays, and conversely the least used ones.

How we built it / What we did in the weekend

The semantification of Covid19 Bioassays in the Open Research Knowledge Graph consisted of two parts - 1) interacting with domain experts to obtain gold-standard semantified Bioassays in accordance with the semantic definitions in the Bioassay ontology; and 2) desgining an automated machine learning system to automatically semantify the data, i.e. play the role of the experts.

For the first part, the annotations were manually entered into the Open Research Knowledge Graph by the domain-expert via its user-friendly interface for adding papers https://www.orkg.org/orkg/. Six bioassays were selected from the PubChem library searching for papers on COVID19. All the assays aim to characterize the inhibitory activity of different compounds against the coronavirus infection. Three assays were chosen due to their enzymatic target: they presented the same fluorescence method but this method was conducted in different conditions testing different molecules against Covid with different substrates and buffers. The other three assays were chosen because they were cell-based assays: in this case their target was not an enzymatic one but was the coronavirus itself, however different types of cells have been chosen to evaluate compounds' activity. These cases additionally presented a different detection method. All the reference papers were chosen because the assays results demonstrated the presence of an active molecule at the end of the study, thus presenting a potential inspiration for future research. The resulting six bioassays data that we structured, semantified, and made comparable are part of our submission here: https://www.orkg.org/orkg/comparison?contributions=R38392,R38323,R38371,R38296,R38344,R38266

For automatic semantification, we trained a neural network classification model based on SciBERT (Beltagy et al., 2019). We treated each (predicate, value) semantic pair as a document label and classified each bioassay as either true or false for the set of all possible labels that we obtained from the training data. Our training dataset comprised the gold-standard curated set of bioassays by Clark et al., 2014 (https://doi.org/10.7717/peerj.524). Our test dataset comprise 124 unique Covid-19 bioassays downloaded from PubChem (https://pubchem.ncbi.nlm.nih.gov/#query=coronavirus&tab=assay). Our final output from the machine learning model was 124 automatically semantified bioassays data.

Overall our team comprised six people. Three domain experts in biochemistry and neuroscience at the PhD and PostDoc level, a software development expert, and two members with a background in Artificial Intelligence and Natural Language Processing at the PhD and PostDoc level. One member curated the six gold-standard bioassays in the ORKG; two members built the machine learning model; another member verified a portion of the automatically annotated data; one member enhanced the Contribution display user interface; and one member handled marketing and pitch creation.

Our code and datasets are publicly available online at this Github link.

Challenges we ran into

For our automatic predictions, of the 124 total unique Bioassays obtained from PubChem, given the limited time of the hackathon, we could only validate our system for 20 Bioassays it automatically annotated. Further, the validations could only be at a coarse-grained level. For the coarse-grained evaluation, the curator scanned the list of predictions made for an assay and marked the predictions overall as one of two categories: associated or unassociated. The curator found 60% of our 20 evaluated bioassays as having associated predictions. This we consider a fairly good result given the complexity of the classification task.
Further, due to technical issues we weren’t able to upload our automatically parsed data into the ORKG within the hackathon deadline. But this can be readily addressed by collaborating with the ORKG core development team.
And finally our moodbar feature for highlighting certain values of the comparisons based on their usage could not be tested for real values although our logic supports the abstraction.

Accomplishments that we’re proud of

Excellent team cooperation
Successfully implementing the idea we envisioned within the timeframe of the hackathon

Many platforms have been developed with the common purpose of presenting the Covid-19 research outcomes. However, an open access platform for sharing information exactly about the experiment lab processes in the battle against Covid-19 does not exist. Researchers should not waste valuable time trouble-shooting for their experiments. In some cases, they may want to implement in their study design using a technique they are not familiar with or which is out of the scope of their particular specialty thus wasting time navigating in literature in order to find the right method or a protocol to follow. This is where ‘Covid-19 Bioassays in the Open Research Knowledge Graph’ come into play, offering the scientific community an open platform to share various techniques for the research undertaken against COVID-19 and access bioasssays on-demand’ and we’re proud to implement this idea during this hackathon.

What we learned

Hackathons are excellent venues to test skills, platforms, and build collaborations.

The necessities in order to continue the project

We have the main platform ready incorporating 6 expert-curated Covid-19 Bioassays semantified in the ORKG using semantic technologies. Winning the hackathon will provide us with the publicity necessary to advertise the platform to the public while underlying it is a reliable space to share and retrieve information about lab protocols currently in use for the COVID-19 pandemic.

What's next for Covid-19 Bioassays in the Open Research Knowledge Graph

We want to promote our idea to critical stakeholders in the development and use of Bioassays for their use of the ORKG to create structured, semantified Bioassays thus enabling a new era in the digital libraries of scholarly publications.

References

Beltagy, Iz, Kyle Lo, and Arman Cohan. "SciBERT: A pretrained language model for scientific text." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

Clark AM, Bunin BA, Litterman NK, Schürer SC, Visser U. 2014. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation. PeerJ 2:e524 https://doi.org/10.7717/peerj.524

Built With

java
jupyter
python
scibert
tensorflow

Submitted to

The European Commission's EUvsVirus Hackathon

Created by

I worked on setting up the Machine Learning model for our project. It was a pleasure to contribute, sharing my expertise and learning new things about text mining and natural language processing.

Marco Anteghini
I screened the Covid-19 related Bioassays in the PubChem library to evaluate the activity of different compounds against coronavirus infection. For the six selected Bioassays to semantify in the ORKG, I leveraged the Bioassay Ontology and from it's hundreds of available semantic classes and properties selected the ones that were more relevant to create the semantified versions of the Bioassay protocols. By populating the ORKG with this data, we show how this method can contribute to a unique and faster literature search and comprehension...it was a great experience!

ANITA MONTEVERDI
PhD student in Neurophysiology at the University of Pavia with a master degree in Medicinal Chemistry
I worked with curating the automated data manually and evaluated the predicted bioassays for their association with COVID-19. Also, I helped with the creation of the written pitch. Overall, this was a great experience to work with this project and its excellent team members! I am happy that I could make a small contribution in this huge battle against COVID-19.

shamik mitra
I co-ordinated all activities in the project and mainly guided the development of our Machine Learning models. It was a great pleasure for me to work with such a diverse and interdisciplinary team to make our project happen!

Jennifer D'Souza
Postdoctoral researcher on the Open Research Knowledge Graph project at Technische Informationsbibliothek (TIB)
I created the final backbone of the written pitch and I also wrote a big part of it. Eventually, I also copy edited the final document for scientific accuracy and language usage. I also edited the marketing pitch for scientific accuracy on the text regarding lab research, bioassays and their usage. I underlined the benefit that the final product may confer to a biomedical scientist/researcher as the main marketing message to come across to the viewer, based on my own scientific background as a lab scientist knowing the obstacles a wet-lab scientist faces regarding finding bioassays related information. Finally, I contributed as a voice-over the final marketing video pitch.

Eirini Papadaki
Dr in Neuroscience (Neurophysiology/Neuroendocrinology), MSc Neuroscience , BSc Biology
I worked on implementing the novel way of visualizing values of properties based on their usage, the idea comes from the moodbar feature that certain music players have.
I have learned a lot, during the event, about data visualization, in general, and had hands-on experience with Pandas, Bokeh frameworks.
It was a pleasure to do a small contribution to this future-proof project.

Florin Balate

Updates

Jennifer D'Souza posted an update — Apr 26, 2020 03:32 AM EDT

Closing in toward the end of this AMAZING EUvsVirus hackathon, here is our update...

Check out our expert-curated semantically-structured comparisons of Covid-19 Bioassays as the Open Research Knowledge Graph here: https://www.orkg.org/orkg/comparison?contributions=R38392,R38323,R38371,R38296,R38344,R38266! Wouldn't this make scanning the scholarly literature for the progress in the discovery of assays so much faster, consequently leading us faster to obtaining a cure to the Covid-19 virus to help stop the pandemic! We are working on enabling both the fully automated and the manually curated semantic structuring of all the Covid-19 Bioassays published

Log in or sign up for Devpost to join the conversation.

Jennifer D'Souza started this project — Apr 23, 2020 01:18 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.