Title: Automatic Chemical Design Using Data Driven Molecular Representations

Who: Krishna Tewari(ktewari1), Ray Vecchio(rdelvecc), Cody Sims(csims)

Introduction: This paper attempts to automate the process of designing new molecules and optimizing their chemical properties. This is a major problem within bio-chemistry as traditional techniques to develop new molecules without leveraging deep learning are time consuming and require lots of computational versification from a human-being using a multitude of differing heuristics. Moreover, the use of a data-driven approach to represent molecules and explore desired permutations drastically reduces the time required to create new molecules along with leading to a decrease in computational power needed to parse through existing molecular databases in search of specific optimal properties. Our group decided to implement this paper as it utilizes deep learning methodologies in a relatively new area ie. biochemistry. Moreover, the utility provided by using deep learning techniques within this area are quite vast from both a methodological and result standpoint as there have only been 10^8 molecules ever created even though the possible combinations for drug-like molecules is estimated to be between 10^23 and 10^60 which can now be uncovered with a deep learning based approach. The problem which is attempting to be solved can be mainly characterized as an unsupervised learning task as the researchers utilize VAE’s in order to create a latent representation of existing molecular structures and sample from this representation to create new molecules with certain desirable properties.

Related Work: From a literary significance perspective, this paper is the first paper to leverage the power of a VAE in order to create new molecular structures with optimal properties as the use of deep learning practices within the biochemical ecosystem is a relatively new phenomena. However, a lot of the data which is being used within the model has been frequently used by numerous other researchers within other papers and has been publicly available for a considerable time frame.

An MIT article(linked below) describes the time-intensive nature of developing molecules suitable for drug-discovery and the drastic boost in efficiency towards the development of new drugs due to the paradigm altering methodologies deep learning has provided in the field. The traditional process had a lack of automation as the selection of “lead” molecules with specific advantageous properties which were required for new drug development required human intervention frequently starting from the identification stage and all the way until the verification/validation stage. The use of deep learning techniques enables biomedical researchers to drastically mechanize the process of choosing a robust “lead molecule” which enables researchers to devote more time towards choosing auxiliary molecules when developing new drug formations. Moreover, deep learning within the biomedical context also enables a wider array of molecular combinations to be explored when developing new molecular compounds as compared to traditional industry practices as the VAE model enables rapid parsing and identification of millions of potential molecular combinations which show promise.

Article link: https://news.mit.edu/2018/automating-molecule-design-speed-drug-development-0706

Existing Implementation List: https://github.com/Ishan-Kumar2/Molecular_VAE_Pytorch https://github.com/cxhernandez/molencoder (original) https://github.com/aksub99/molecular-vae https://github.com/HIPS/molecule-autoencoder

Data: Within our implementation of the research paper, we will be using a set of 250,000 molecular graphs which are a part of the ZINC database which is a collection of 3-D molecular structures which are publicly available. In the context of the intersection of deep learning and biomedicine, the usage of a subset of the ZINC database is a common practice amongst the existing literature. The data set is quite large which may lead to long and computationally expensive training times for our model so our group will definitely batch the inputs in order to lighten the computational load and reduce the carbon footprint whilst training the model.

Methods: Since our project is an implementation of a research paper, we will use similar methods to construct and train our model. This means that we will be using a variational encoder on our data to construct the model just like the research paper’s implementation. This means that we will be using an encoder RNN alongside a decoder RNN for our model.

The autoencoder for the ZINC dataset will consist of three 1D convolutional layers with filter sizes of 9,9,10 and 9,9,11 convolution kernels followed by a single fully connected layer of width 196. The decoder has three GNU layers with a hidden dimension size of 488.

The encoder used for the QM6 dataset has 3 1D convolutional layers of filter sizes 2, 2, 1 and 5, 5, 4 convolution kernels. With a single fully-connected layer of width 156. The three recurrent neural network layers will have a hidden dimension size of 500.e For property prediction, two fully connected layers of 1000 neurons will be used to predict properties from the latent representation, with a dropout rate of 0.2. For the algorithm trained on the ZINC dataset, the objective properties will include logP, QED, SAS. For the algorithm trained on the QM9 dataset, the objective properties will be HOMO energies, LUMO energies, and the electronic spatial extent.

Metrics: In our project “success” is achieved when we’re able to discover chemically possible molecules that are similar to a given compound. The original paper ran a few tests to verify that this is working, and we plan to replicate them. First, we will provide Ibuprofen, a very common drug, as an input, and look at what the decoder outputs as molecules similar to it. Figure 2c in the original paper shows all of these molecules, so if ours are the same, similar, or valid molecules that appear to originate from Ibuprofen, we’ll call it a success. In addition to this, we can provide the model with a few other common molecules (benzene, methane, ethanol, etc), examine the outputs, and go through a similar process of verifying that they’re chemically similar and valid to determine success over a variety of different molecules. Further, as discussed in class, the paper also tested their molecule by euclidean interpolation across the latent space (figure 2d). Another way we can verify our model is working as intended is by sampling on a line between two molecules from the latent space, and observing the structures across this vector. If they seem to evolve as a gradient from one to another, we know that our model is learning a correct latent space. Similar to our definition of success, our notion of accuracy could be defined two ways: the amount of output molecules that are chemically valid, and the similarity of output molecules to an input molecule. The first metric can be quantified; if we determine that a structure isn’t valid, we can count it as so, then find the percentage of valid to unvalid outputs. In fact, the only quantified accuracy measure used in the original paper was the “% of 5000 latent space points that decode to valid molecules”. The second measure is purely qualitative, since there’s no way to quantify molecule similarity. In this, we’ll have to judge based on visual features alone over a wide array of input molecules to come to a conclusion on whether or not our model can generally produce the desired similarity in outputs. We plan on using both methods to evaluate the overall accuracy of our project.

Ethics: Deep learning is a wonderful approach to the problem of molecule discovery. As detailed above, in order to find molecules with certain properties, a lot of human time, effort, and energy must be spent on detailed analyses of molecular features. In our model, once trained, we’ll be able to instantly produce molecules with similar features to an input model, providing scientists with a plethora of possibilities to choose from in their research. Since we have a large quantity of known molecules already available, we’ll be able to train our model on them to produce novel ones. By cutting out the time and challenge of identifying candidate molecules, our model will allow scientists to focus on testing their viability rather than searching for them. We believe the largest application of the technology would be drug discovery, or the process of identifying molecules that will serve as amazing drugs to combat a disease. Here, the major stakeholders in this application would be chiefly scientists, but secondarily patients whose drugs are based on this model’s findings. First, the scientist is a stakeholder because they need to test/validate each molecule output by the model. It won’t be perfect, and there is a chance that output molecules are invalid, impossible to create/maintain under standard temperature and pressure, too dangerous to be used as drugs, too expensive to manufacture, and a host of other issues. Thus, researchers are the first people impacted, because they’ll have to study the properties of the model’s compounds before they can be used; because of this, we have an incentive to produce a model that outputs the most accurate molecular representations to limit these problems. After the research phase, patients will be exposed to the molecules in drug form. If a scientist believes one can make a feasible drug, it will enter the clinical research phase, where in-vivo trials will be conducted. Here, researchers never know how a drug will react in a human environment, and there is a large possibility of side effects and harmful outcomes. Selecting drugs from an algorithm seems ethically dubious; however, since it will still be vetted by human researchers, we still believe the tool is safe enough to be used for this purpose. The same concerns may arise if a molecule is used for other purposes: pesticides, dyes, plastics, etc. However, just like drug discovery methods now, our model will act only as a tool for scientists to find potential molecules to work with, and not a definite selector for a given purpose. Ultimately, the technology simply speeds up the discovery process, since it will not be used to identify compounds to begin manufacturing immediately with no testing. If ill effects were to occur, it would be a symptom of poorly conducted research, not an inherent problem with the deep learning model itself.

Division of labor:

  • Cody: Data preprocessing
  • Ray: Developing the model
  • Krishna: Training / Testing
  • All: Data Analysis
  • For this portion, Krishna did Introduction, Related Work, and Data, Ray did Ethics and Metrics, and Cody did Methods.

Built With

Share this project:

Updates