Title: Automatic Chemical Design Using Data Driven Molecular Representations
Who: Krishna Tewari(ktewari1), Ray Vecchio(rdelvecc), Cody Sims(csims)
Where: https://github.com/xrayd/1470-Final-Project
Writeup PDF: https://drive.google.com/file/d/1sLVlZDspmGurxw2XICPlXXzLZVDNFj0p/view?usp=sharing
Introduction: This project attempts to automate the process of designing new molecules and optimizing their chemical properties. This is a major problem within bio-chemistry as traditional techniques to develop new molecules without leveraging deep learning are time consuming and require lots of computational versification from a human-being using a multitude of differing heuristics. Moreover, the use of a data-driven approach to represent molecules and explore desired permutations drastically reduces the time required to create new molecules along with leading to a decrease in computational power needed to parse through existing molecular databases in search of specific optimal properties. Our group decided to implement this paper as it utilizes deep learning methodologies in a relatively new area ie. biochemistry. Moreover, the utility provided by using deep learning techniques within this area are quite vast from both a methodological and result standpoint as there have only been 10^8 molecules ever created even though the possible combinations for drug-like molecules is estimated to be between 10^23 and 10^60 which can now be uncovered with a deep learning based approach. The problem which is attempting to be solved can be mainly characterized as an unsupervised learning task as the researchers utilize VAE’s in order to create a latent representation of existing molecular structures and sample from this representation to create new molecules with certain desirable properties.
Methods: To begin developing and training our model, we first needed to obtain our data: molecules in the SMILES (simplified molecular-input line-entry system) string format, a method commonly used in chemistry to represent a molecule as text. The dataset we used is the CheEMBL 22 dataset, consisting of ~1,300,000 unique molecules. To process, we completed the following steps. First, we isolated a data frame consisting of only the SMILES strings for each molecule. Then, 400,000 were selected at random to use in training. From the strings, a dictionary was created of all characters present in all strings To homogenize the size of our strings, we capped the length of a molecule at 120 characters, and padded any extra characters with whitespace. Then, the dictionary was used to “featurize” each molecule, creating a one-hot encoded array to represent each character in each string. This featurization allows the model to better detect the placement of different characters in a string, thus allowing it to learn how to build new valid molecules in SMILES format. In the end, our final dataset had the shape (400000, 120, dict_length), where 400,000 is the number of molecules, and each (120, dict_length) represents a molecule in its featurized form. This dataset, along with the dictionary, were saved to an h5py file to prevent having to pre-process every time we need to run the model.
Since our project is an implementation of a research paper, we will use similar methods to construct and train our model. This means that we will be using a variational encoder on our data to construct the model just like the research paper’s implementation. Our VAE model consisted of two models within it; an encoder and decoder.
The encoder for the CheEMBL 22 dataset consists of three 1D convolutional layers with kernel sizes of 11, 9, and 9 respectively. After running the input data through the CNN layers, we flatten our output. The decoder consists of 3 dense layers and a gru layer. We first run our input through a dense layer before passing it into the gru layer and 2 more dense layers. Within our VAE, we also have a mu layer and logvar layer which are two dense layers that we use for reparameterization. Our hyperparameters consist of a latent size of 292, and a hidden layer size of 512. For training, we utilize a cyclical learning rate that changes between 0.01 and 0.001 which switches every 10,000 samples. We compare our VAE’s prediction to the data within the CheEMBL 22 dataset to get our VAE loss. After training our model on over 400,000 samples, we can then make predictions. We have a method to help us better interpret the result. Ultimately, when called, our model will output a probability distribution over each character for each position in the SMILES string, which will allow us to generate new molecules.
In our project “success” is achieved when we’re able to discover chemically possible molecules that are similar to a given compound. The original paper ran a few tests to verify that this is working, and we plan to replicate them. First, we will provide Ibuprofen, a very common drug, as an input, and look at what the decoder outputs as molecules similar to it. Figure 2c in the original paper shows all of these molecules, so if ours are the same, similar, or valid molecules that appear to originate from Ibuprofen, we’ll call it a success. In addition to this, we can provide the model with a few other common molecules (benzene, methane, ethanol, etc), examine the outputs, and go through a similar process of verifying that they’re chemically similar and valid to determine success over a variety of different molecules. Further, as discussed in class, the paper also tested their molecule by euclidean interpolation across the latent space (figure 2d). Another way we can verify our model is working as intended is by sampling on a line between two molecules from the latent space, and observing the structures across this vector. If they seem to evolve as a gradient from one to another, we know that our model is learning a correct latent space. Similar to our definition of success, our notion of accuracy could be defined two ways: the amount of output molecules that are chemically valid, and the similarity of output molecules to an input molecule. The first metric can be quantified; if we determine that a structure isn’t valid, we can count it as so, then find the percentage of valid to unvalid outputs. In fact, the only quantified accuracy measure used in the original paper was the “% of 5000 latent space points that decode to valid molecules. The second measure is purely qualitative, since there’s no way to quantify molecule similarity. In this, we’ll have to judge based on visual features alone over a wide array of input molecules to come to a conclusion on whether or not our model can generally produce the desired similarity in outputs. We plan on using both methods to evaluate the overall accuracy of our project.
Results: The model is able to generate unique SMILES strings when we pass in a specific chemical molecule as our input. Moreover, the model is learning specific patterns within the data set such as there are more Carbons towards the beginning of each chemical molecule and spaces are more prevalent at the end of the generated compound. Patterns such as these are reflected within the molecules which the model is generating. Moreover, the model from an accuracy standpoint is performing reasonably well as the chemical compounds which it’s generating are relatively similar to the input molecules but there are still some elemental differences which is a major positive as the model has the ability to generate new molecules which haven’t been explored within the scope of the existing biomedical literature. From a micro-level analysis of our results, the sub-components of the final output are relatively accurate as there are similar clumps of elements which are grouped together which is great as the data set contains a plethora of molecules which contain elements that are double and triple bonded with themselves. The model currently does not seem to be overfitting or underfitting the data as there are enough structural similarities between the input and output molecules along with the output molecules still having unique, desirable chemical properties which the input molecules may not contain. From a loss perspective, the model does a good job of gradually reducing the loss throughout training from a starting point of 600 to 220. Overall, the results are positive with a few minor hiccups along the way which definitely serve as points of improvement within the future.
Challenges: The main challenge which we have faced has been simplifying the model such that we are able to implement the model in an effective manner as the original implementation includes numerous components which were not covered within the scope of this class. The original model relies heavily on custom implementations of certain deep learning concepts such as the construction of a timeDistribution class which is essentially a custom scheduler for the original model along with a ReduceLROnPlateau class which changes the learning rate when the model starts to plateau during training. Moreover, choosing the right simplifications such that we are not sacrificing interpretability of both our results and the model itself has been challenging as even though we are streamlining the model in some areas, we still want to be able to generate molecules in an accurate fashion with certain desirable properties. From our training data, our model seems to be learning that carbon atoms and whitepsaces are important (carbon is the most common element in SMILES strings, and we padded them with whitespace), however, it outputs mostly these two characters. We were able to fix this by sampling over the output character distribution instead of using argmax selection, yet our model still has had some trouble producing valid SMILES strings. To produce a valid SMILES string (one that represents a real possible molecule), every character must be in the correct position with correct syntax. Since our model generates new strings from a probability distribution, there’s a very high chance that one character will be out of place, rendering the output useless. While the model does know the relative probabilities and location of characters after training, it doesn’t know how to put them all together in a valid manner. For example, it often places three lowercase C’s next to each other, “ccc”, which is an invalid aromatic ring. Further, the model doesn’t differentiate or correctly place close/open parenthesis/brackets, which often breaks the output string.
Reflection: Overall, our group was proud of how the project ultimately turned out from both an architectural and result based perspective. The baseline goals which we initially had were met as we wanted to be able to generate SMILES strings which had partial resemblance to molecules which could be used within the biochemical sector. Once we had a structured output which met the SMILES string encoding properties, our group was able to focus on refining the model to ensure that each of the individual elements within the resulting compound were as accurate and versatile as possible ie. We wanted to ensure that the sub-sectional parts of the 120 element output were also accurate and had desirable properties for researchers. From a results perspective, the model does work as we originally intended as it’s capable of generating new molecules based on certain inputs as the current architecture enables researchers to learn the relative probability of element positions ie. spaces should be towards the later portion of the generated compound and Carbon atoms should be closer to the front. The main changes that we made over time were simplifying the original model in an efficient and effective manner as the original implementation has a plethora of deep learning techniques which are hindering the interpretability of the model such as the construction of a timeDistribution class which is a custom scheduler for the original model along with a ReduceLROnPlateau class which changes the learning rate when the model starts to plateau during training. Selecting the most straightforward simplifications such that our group wasn’t sacrificing the interpretability of both our results and the model itself was a big hurdle we had to overcome as we still want to be able to generate accurate molecules with desirable properties alongside streamlining the model. With more time to work on the project, we would have more robust methods for testing the accuracy of sub-components of our compounds ie. We would write methods which would compare a portion of the input and output with one another to ensure that there are some differences between the molecules whilst also ensuring that the structural integrity of the output is similar to that of the input molecule. If we were to redo the project, we would also need to implement a custom scheduler for passing in the molecules within the model such that molecules which consist of similar chemical structures are passed in relatively close to one another such that the model is able to focus on learning the specific chemical properties of the inputs as the current model has too large of the scope. By reducing the scope of the model, the accuracy will increase as we would be able to learn specific chemical properties instead of a general set of features. The biggest takeaway from this project would be that it is difficult to design a model from scratch even if there are existing resources that are available to guide you in the right direction. Deciding what features to keep from the original implementation and which techniques we wanted to add to the existing framework proved to be more challenging and thought-provoking than expected as each one of us had a different vision and ideas for the final version of the project. Moreover, we learned that a lot of deep learning is based on trial and error as there are no guidelines as to what the initial weights should be set to or what learning rate is the most conducive to smooth gradient descent.
Built With
- chembl22
- elbowgrease
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.