Inspiration

The way we discover drugs & vaccines is EXTREMELY inefficient. Something must have to be done here to decrease the future impacts of any kind of pandemic or biological warfare. We are accelerating drug discovery by leveraging machine learning algorithms to generate and create retro-synthesis pathways for drug molecule design and development.

try it out

VAE implementation on potential drug target: VAE implementation on potential drug target:

VAE Implementation on random data: VAE implementation on random data:

lightdock implementation: lighdock implementation:

What it does

We are accelerating drug discovery by leveraging machine learning algorithms to generate and create retro-synthesis pathways for drugs molecule design and development, calculating minimum energy conformations of a potential candidate, calculating descriptors and shortlisting them based on the correlation coefficient, cross-correlation coefficient, dissimilarity distances, cluster analysis & genetic function approach, calculating drug-likeness property, ADME, applying Lipinski’s rule for shortlisting.

How we built it

1. Cheminformatics in Python: Predicting Solubility of Molecules | End-to-End Data Science Project

In this kaggle notebook, we will dive into the world of Cheminformatics which lies at the interface of Informatics and Chemistry. We will be reproducing a research article (by John S. Delaney 1) by applying Linear Regression to predict the solubility of molecules (i.e. solubility of drugs is an important physicochemical property in Drug discovery, design, and development). This idea for this notebook was inspired by the excellent blog post by Pat Walters where he reproduced the linear regression model with a similar degree of performance as that of Delaney. This example is also briefly described in the book Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More.

2. Subtyping, COVID-19 Therapeutic Research Findings

The goal of this exercise is to study this literature provided by the Kaggle COVID-19 challenge organizing team and to subtype the COVID-19 therapeutic research findings. Specifically, we carried out the following four parts of analyses:

Part A. Drugs that have been used in clinical trials for COVID-19. We identified and characterized the drugs in clinical trials by integrating the FDA drug database and PubChem repository. We hand-curated and summarized the reported effectiveness of each drug. We presented the mutual similarity of chemical structures across the drugs used in clinical trials. We categorized the drugs based on their molecular mechanisms, which can facilitate the discovery of related drugs of similar mechanisms and the creation of an effective cocktail treatment. Category 1. RNA mutagens Category 2. Protease inhibitors Category 3. Virus-entry blockers Category 4. Virus-release blockers Category 5. Monoclonal antibodies

Part B. Drugs that have been proposed by computational works. We identified the computational publications for COVID-19 drugs, categorized their approaches into the following categories, and listed their previous applications in other disease domains, and potential limitations. Category 1. Gene-gene network-based algorithms. Category 2. Expression-based algorithms Category 3. Docking simulation of protein structure-based for Category 3.a. Small molecules Category 3.b. Monoclonal antibodies

Part C. Drugs that have been proposed by in vitro experiments of COVID-19 invading human cells. We characterized the chemical structures and analyzed the chemical similarity for this group. For this list, other than literature mining, we carried out a machine learning experiment to prioritize previously unexplored FDA-approved drugs (to circumvent ADMET evaluation) for repurposing. After hand-removing the contaminations, we identified the following top candidates for repurposing: OLUMIANT(Baricitinib) used to treat rheumatoid arthritis, BRIMONIDINE, used to treat glaucoma, EDURANT(rilpivirine) used to treat Human Immunodeficiency Virus-1 (HIV-1), MARPLAN used to treat depression, Corlanor (ivabradine) used to reduce the spontaneous pacemaker activity of the cardiac sinus node. We listed the potential contaminations/biases in this and relevant protein binding-associated approaches.

Part D. Epitope study for vaccines We categorized vaccine studies by their approaches and discussed the background and limitations concerning evolution:

Approach 1. Homology-based with SARS-COV (the 2003 version of SARS), other coronavirus or Ebola.

Approach 2. Immunoinformatics including docking/molecular dynamics/protein structures/antigenicity predictions. We hand-curated a list of 147 epitopes from these publications and their supplementary materials, grouped them by the source virus proteins, human T-cell/B-cell targets and MHC class. We merged all published epitopes into 124 consolidated groups by partial sub-sequence search and 91 unique virus protein sequence regions by BFS search algorithms. We hope the above lists will serve as the 'wisdom-of-the-crowd' reference for vaccine development.

Summary points and future recommended research topics for Phase 2. Conclusion 1. There is not a single drug for which consistent positive response has been reported.

Conclusion 2. There are overlaps between the drugs in clinical trials, proposed by computational analysis and proposed by in vitro experiments. However, some of the overlaps, especially those with computational analysis may come from circularity in the methods.

Conclusion 3. Drug candidates proposed by computation and in vitro screening could be biased towards cancer-related targeted therapy and substantially contaminated by existing literature or sometimes anecdotes. This bias/contamination may affect a significant number of computation-based drug-repurposing studies including our work, and certainly not limited to COVID-19.

Future direction 1. Disagreement in the reported drug response can root from differences in dosage, baseline biometrics, and population groups. With more clinical trial results coming in, the next step is to carry out a meta-analysis to stratify these variables.

Future direction 2. Analyzing vaccine findings at this stage is premature as there is no clinical effectiveness study yet. It will be meaningful to make genome variation and vaccines (or maybe antibodies as well) into the same topic, therefore allowing connecting the genome variations to what fraction of the virus strains that a vaccine could cover.

Future direction 3. We suggest a topic on news (e.g., google news) retrieval for therapeutic development, as many (if not most) treatment responses may not first appear in manuscripts. Finally, we would like to take this opportunity to make one comment: Literature tends to be biassed towards reporting positive results,known biology (e.g., cancer and immune- drugs), and anecdotes, and we should take the results of this exercise and other documents critically.

DISCUSSION LIST

Join our discussion forum on DRUG & VACCINE R&D WITH AI & MACHINE LEARNING

INSTALLATION AND DATA REQUIREMENTS

Check the following datasets, these have been made public by our r&d department and some are already. Feel free to explore the data and augment it. These all data are either have already been included in the kaggle kernels or gets downloaded in that.

  1. Delaney's solubility dataset
  2. COVID-19 Open Research Dataset Challenge (CORD-19)
  3. drugbank.ca dataset
  4. drugbank.ca-chunk dataset
  5. drugbank.ca-csv-chunk dataset

Challenges we ran into

Part A Subtyping drugs currently in clinical trial A.1 Methods: We first counted how many times each FDA drug occured in the documents provided by Kaggle: A.2.1 The number of publications each drug appeared, top ones, >=100 times, are (full list in sorted_alresult): 103 hydrocortisone 106 ritonavir 111 prednisolone A.2.2 the drugs that have been related to coronavirus in literature, and the top ones, >10 times, are (full list in sorted_alresult.coronavirus): 10 times: amoxicillin 10 times: fluorouracil

A.2.4.1 RNA mutagens Viruses need to copy themselves in order to invade the host and transmit (like cancer cells), thus it makes sense that mutagens that block the copying can be used as drugs. Remdesivir: It was studied in many publications related to coronavirus. It was suggested to be highly effective in the control of 2019-nCoV infection in vitro, while their cytotoxicity remains in control A.2.4.2 Protease inhibitors Ritonavir: It was suggested to inhibit proteases and thus block multiplication of the virus. It was reported to deliver a substantial clinical benefit for COVID-19 patients (0562f70516579d557cd1486000bb7aac5ccec2a1.json, and its effectiveness is suggested by computational docking studies.

Lopinavir: Lopinavir is a protease inhibitor. It was reported with substantial benefit for treating COVID-10 patients A.3. Limitations The above analysis has the following limitations: We used a rather earlier version of the literature set (because the searching step took quite a long time), and some popular drugs, e.g. hydroxychloroquine are only discussed but without clear clinical conclusion yet. Literature could be substantially biased towards positive results and by computational methods (discussed below). Part B Subtyping computational approaches that are used to propose drug candidates We then subtyped computational methods developed to repurposing drugs for COVID-19. B.1 Methods During reading the literature curated in Part A, we came across computational studies that focus on predicting drugs suitable for repurposing for COVID-19. These works tend to propose many drugs. B.2 Results B.2.1 Gene-gene network-based approaches Example: https://www.nature.com/articles/s41421-020-0153-3 repurposed drugs by network approaches based on homology analysis to other viruses. The authors proposed 16 potential drugs: Irbesartan, Torernifene, Camphor, Equilin, Mesalazine, Mercaptopurine, Paroxetine, Sirolimus, Carvedilol, Colchicine, Dactinomycin, Melatonin, Quinacrine, Eplerenone, Emodin, Oxymetholone. Background: Network-based drug response has been intensively used in the cancer area and was shown to excel in several benchmarks. B.2.2 Expression-based approaches B.2.3 Docking or structural-based approaches B.2.3.1 Small molecule prediction B.2.3.2 Monoclonal antibody prediction B.3 Limitations Computationally proposed drugs tend to be a lot in a single piece of article, sometimes, hundreds of drugs in a single study. Most of the works adopted methods from other pharmacogenomics field that were previously developed for cancers. We are not aware these approaches have generated hypotheses that are used in real-world clinical trials even in popular fields, e.g. cancer, Alzheimer's. Thus, use them with cautions. Part C. Drugs proposed by in vitro experiments C.1 Methods C.1.1 Data curation Other than the drugs used in clinical trials and computational methods, we found an interesting study that carried out genome-wide in vitro binding screening of the virus proteins and human proteins, and proposed 37 drugs that directly C.1.2 Construction of training set We carried out a machine learning exercise, with the hypothesis that the drugs that will be potentially effective should overlap globally in function of these drug targets. We could extract the chemical structure of 34 of the 37 drugs proposed by the authors, which are used as positive examples. The second positive set is the combination of the first positive set and four other drugs that are currently under clinical trial and whose chemical structure can be extracted: remdesivir, hydroxychloroquine, favipiravir and Vitamin C, and thus 38 in total. The negative training set, which is also the candidate set, is constructed using the FDA approved list, which was downloaded in Oct 2019 from https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm. This list has a total of 7305 drugs, 5596 of which we could obtain the fingerprinting structure. C.1.3 Nested CV to prioritize drug candidates For each round, we randomly selected 80% of the example as training, and 20% as testing. The prediction scores for the test set are recorded in each round. We repeated this process for 20 times ensuring all examples occurred in the test set (100 experiments in total). Then the average of each example was taken as the final prediction score. C.2 Results C.2.1 Top candidates in FDA approved drugs Among the FDA approved drugs, we identified some top candidates that do not exist in the training gold standard. We hand-searched in literature for each of the top candidates with a probability >0.05 (55 in total). Most of them come from contaminations, i.e., overlapping with an example in the training set even though the drug appears with a C.3 Limitations and biases in the finding Drugs proposed by in vitro or computational protein targets/gene-gene network approaches are definitely biased towards targeted therapies in cancers, because these drugs were intensively screened in cell line experiments. This is true for both the above list and probably the original list proposed through the binding experiments, and certainly other studies. Second, low scores only mean the drugs are not similar to others that are being investigated in the study, rather than they are not useful. Remdesivir had a high score of 0.09 (we are not sure if this is an implicit contamination from the training set), the others had low scores, including Vitamin C, hydroxychloroquine and favipiravir. Part D. Epitope study for vaccines D.1 Methods We identied all paragraphs that contain the word vaccine and COVID-19/SARS-COV-2. Then, we looked through each of the abstract. If deemed relevant, we go to the original paper and record down their methods and proposed epitopes D.2 Results D.2.1 Subtyping major approaches in vaccine research D.2.1.1 Homology-based approach D.2.1.1 Immunoinformatics D.2.2 Compiled list of epitopes across the above publications

Accomplishments that we're proud of

D.2.4 Compiling the unique protein regions where epitopes have been identified from various publications by BFS Now let us find out the unique virus protein regions where epitopes have been identified from various publications by partial sub-sequence overlap. The difference between this section and D.2.3 is the following: When epitode A and B overlap, and B and C overlap, but A and C do not overlap substantially, in the previous section, they are considered as separate groups as we were trying to find out non-overlapping peptides, while in this section, they are considered to be in the same group as they are in the same protein regions. These are the unique groups of protein regions where epitopes have so far been identified:

What we learned

Summary points and future recommended research topics for Phase 2. Conclusion 1. There is not a single drug for which consistent positive response has been reported. Conclusion 2. There are overlaps between the drugs in clinical trials, proposed by computational analysis and proposed by in vitro experiments. However, some of the overlaps, especially those with computational analysis may come from circularity in the methods. Conclusion 3. Drug candidates proposed by computation and in vitro screening could be biased towards cancer-related targeted therapy and substantially contaminated by existing literature or sometimes anecdotes. This bias/contamination may affect a significant number of computation-based drug-repurposing studies including our work, and certainly not limited to COVID-19.

What's next for COVID-19 drug and vaccine r&d with AI & Machine Learning

Future direction 1. Disagreement in the reported drug response can root from differences in dosage, baseline biometrics, and population groups. With more clinical trial results coming in, the next step is to carry out a meta-analysis to stratify these variables. Future direction 2. Analyzing vaccine findings at this stage is premature as there is no clinical effectiveness study yet. It will be meaningful to make genome variation and vaccines (or maybe antibodies as well) into the same topic, therefore allowing connecting the genome variations to what fraction of the virus strains that a vaccine could cover. Future direction 3. We suggest a topic on news (e.g., google news) retrieval for therapeutic development, as many (if not most) treatment responses may not first appear in manuscripts. Finally, we would like to take this opportunity to make one comment: Literature tends to be biassed towards reporting positive results,known biology (e.g., cancer and immune- drugs), and anecdotes, and we should take the results of this exercise and other documents critically.

Built With

Share this project:

Updates