Literature search for the relevant scientific articles is a tedious job. Even in a very specific science domain, it is often time-consuming to search and catalogue most pertinent scientific papers, in spite of the existence of structured databases such as Pubmed, Scopus and Web of Science. In this regard, semantic searching protocols offer a more user-tailored approach to catalogue and compare multiple studies/articles based on specific search terms. The Open Research Knowledge Graph (ORKG) is an endeavor to represent scientific articles in a content-dependent manner, which makes the underlying information machine-readable and suitable for automated processing.
In the scientific literature, descriptions of the bioassays are among the most structured contents due to their intrinsic association with the underlying standard protocols and methodologies. However, comparisons across several bioassays is a challenging problem in today’s context-dependent structures, where semantically organized information is critically lacking. This problem gets compounded with the volume of research/assays published/approved, especially in the extremely dynamic research domains such as COVID-19 research.
The COVID-19 pandemic has invoked a response of unprecedented magnitude from the research community across scientific domains. Development of novel bioassays also has not been left behind with the addition of new and innovative diagnostics and therapeutics associated assays everyday. In such a scenario, it is critical for the developers and users of these assays to carefully examine and compare the relevant ones in both pre and post phases of the development, to confer applicational benefit.
Our application of the ORKG method to create a semantical comparison approach for the relevant bioassays in current COVID-19 research, aims to fill the gap of lack of structured content-dependent approaches in querying bioassay information. We believe that the development of such an approach, will further bolster the scholarly literature search for the COVID-19 associated bioassays and contribute to the development and application of appropriate bioassays in this domain.
What it does
Presents structured, semantified, and comparable Covid-19 bioassay protocols in the Open Research Knowledge Graph infrastructure converting these text-based protocols into machine-readable and comparable elements.
Specifically, our expert-curated data within the ORKG presents a tabulated summary comparing bioassays along the specific dimensions of the data property as defined in the Bioassay ontology. Our curated data presents a summarized, state-of-the-art in COVID19 bioassay research.
Further, we also implemented a moodbar feature (ref: https://en.wikipedia.org/wiki/Moodbar) for highlighting values of selected semantified properties in the bioassays contributions comparisons table based on their usage: the more widely used methods/instruments are marked in shades of green while the less common ones are marked in shades of blue. The logic behind this feature is to highlight the most used materials, methods in the creation of bioassays, and conversely the least used ones.
How we built it / What we did in the weekend
The semantification of Covid19 Bioassays in the Open Research Knowledge Graph consisted of two parts - 1) interacting with domain experts to obtain gold-standard semantified Bioassays in accordance with the semantic definitions in the Bioassay ontology; and 2) desgining an automated machine learning system to automatically semantify the data, i.e. play the role of the experts.
For the first part, the annotations were manually entered into the Open Research Knowledge Graph by the domain-expert via its user-friendly interface for adding papers https://www.orkg.org/orkg/. Six bioassays were selected from the PubChem library searching for papers on COVID19. All the assays aim to characterize the inhibitory activity of different compounds against the coronavirus infection. Three assays were chosen due to their enzymatic target: they presented the same fluorescence method but this method was conducted in different conditions testing different molecules against Covid with different substrates and buffers. The other three assays were chosen because they were cell-based assays: in this case their target was not an enzymatic one but was the coronavirus itself, however different types of cells have been chosen to evaluate compounds' activity. These cases additionally presented a different detection method. All the reference papers were chosen because the assays results demonstrated the presence of an active molecule at the end of the study, thus presenting a potential inspiration for future research. The resulting six bioassays data that we structured, semantified, and made comparable are part of our submission here: https://www.orkg.org/orkg/comparison?contributions=R38392,R38323,R38371,R38296,R38344,R38266
For automatic semantification, we trained a neural network classification model based on SciBERT (Beltagy et al., 2019). We treated each (predicate, value) semantic pair as a document label and classified each bioassay as either true or false for the set of all possible labels that we obtained from the training data. Our training dataset comprised the gold-standard curated set of bioassays by Clark et al., 2014 (https://doi.org/10.7717/peerj.524). Our test dataset comprise 124 unique Covid-19 bioassays downloaded from PubChem (https://pubchem.ncbi.nlm.nih.gov/#query=coronavirus&tab=assay). Our final output from the machine learning model was 124 automatically semantified bioassays data.
Overall our team comprised six people. Three domain experts in biochemistry and neuroscience at the PhD and PostDoc level, a software development expert, and two members with a background in Artificial Intelligence and Natural Language Processing at the PhD and PostDoc level. One member curated the six gold-standard bioassays in the ORKG; two members built the machine learning model; another member verified a portion of the automatically annotated data; one member enhanced the Contribution display user interface; and one member handled marketing and pitch creation.
Our code and datasets are publicly available online at this Github link.
Challenges we ran into
For our automatic predictions, of the 124 total unique Bioassays obtained from PubChem, given the limited time of the hackathon, we could only validate our system for 20 Bioassays it automatically annotated. Further, the validations could only be at a coarse-grained level. For the coarse-grained evaluation, the curator scanned the list of predictions made for an assay and marked the predictions overall as one of two categories: associated or unassociated. The curator found 60% of our 20 evaluated bioassays as having associated predictions. This we consider a fairly good result given the complexity of the classification task.
Further, due to technical issues we weren’t able to upload our automatically parsed data into the ORKG within the hackathon deadline. But this can be readily addressed by collaborating with the ORKG core development team.
And finally our moodbar feature for highlighting certain values of the comparisons based on their usage could not be tested for real values although our logic supports the abstraction.
Accomplishments that we’re proud of
Excellent team cooperation
Successfully implementing the idea we envisioned within the timeframe of the hackathon
Many platforms have been developed with the common purpose of presenting the Covid-19 research outcomes. However, an open access platform for sharing information exactly about the experiment lab processes in the battle against Covid-19 does not exist. Researchers should not waste valuable time trouble-shooting for their experiments. In some cases, they may want to implement in their study design using a technique they are not familiar with or which is out of the scope of their particular specialty thus wasting time navigating in literature in order to find the right method or a protocol to follow. This is where ‘Covid-19 Bioassays in the Open Research Knowledge Graph’ come into play, offering the scientific community an open platform to share various techniques for the research undertaken against COVID-19 and access bioasssays on-demand’ and we’re proud to implement this idea during this hackathon.
What we learned
Hackathons are excellent venues to test skills, platforms, and build collaborations.
The necessities in order to continue the project
We have the main platform ready incorporating 6 expert-curated Covid-19 Bioassays semantified in the ORKG using semantic technologies. Winning the hackathon will provide us with the publicity necessary to advertise the platform to the public while underlying it is a reliable space to share and retrieve information about lab protocols currently in use for the COVID-19 pandemic.
What's next for Covid-19 Bioassays in the Open Research Knowledge Graph
We want to promote our idea to critical stakeholders in the development and use of Bioassays for their use of the ORKG to create structured, semantified Bioassays thus enabling a new era in the digital libraries of scholarly publications.
Beltagy, Iz, Kyle Lo, and Arman Cohan. "SciBERT: A pretrained language model for scientific text." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
Clark AM, Bunin BA, Litterman NK, Schürer SC, Visser U. 2014. Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation. PeerJ 2:e524 https://doi.org/10.7717/peerj.524