Title: OrgoNet: 3D Modeling of Complex Organic Compounds Using a DNN

Who:

Niko Bhatia - nbhatia

Ugo Piovan - upiovan

Manuel Lopez- manuellopez2

Cameron Fiore - cfiore

Introduction:

The motivation for this topic is to speed up the slow process of visually representing molecule structures in Python’s ASE extension. Currently, there are very limited molecules (<50) in the library, so it is more a demo than a useful tool. To create new molecule structures, it guesses initial molecule locations and takes 8+ hours to slowly adjust to the final equilibrium position using intermolecular forces. We believe that a DLNN can be trained to take in the molecule name, output the distances between atoms, and quickly output the 3D molecule structure with Python’s ASE extension.

Our problem can be classified as a Regression problem. We are working with labeled data, and we map the molecule name to a continuous vector representing the 3D positions of all atoms in the molecule.

Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00535-x

The article above discusses the development of a neural machine translation model to convert the International Chemical Identifier (InChI) into International Union of Pure and Applied Chemistry (IUPAC) names, which is a standardized naming system for chemical compounds. The authors explain how they trained their model using a dataset of over 50,000 compounds and achieved high accuracy rates in predicting IUPAC names. Towards the end, they also discuss potential applications for this technology, including improving chemical database searches and aiding in the discovery of new compounds.

Struct2IUPAC -- Transformer-Based Artificial Neural Network for the Conversion Between Chemical Notations: https://www.researchgate.net/publication/347140196_Struct2IUPAC_--_Transformer-Based_Artificial_Neural_Network_for_the_Conversion_Between_Chemical_Notations

Data:

We have developed our own set of data for this project. Given that there wasn’t a repository of IUPAC names and structures large enough to train an ML model, we had to devise a methodology to make our own. Ultimately, 440,000 molecules were created, named and split into 40,000 for testing and 400,000 for training.

The methodology for creating data was split into five portions, the first was creating a molecule without functional groups (composed of just a line of carbons with the external ones having three hydrogens attached and the internal ones having just two). This was done by importing methane or ethane from the atomic simulation environment’s data. For molecules with more carbons than ethane vector operations were utilized to place a carbon in the corresponding space, the three hydrogens were then shifted to be on the new external carbon and two new hydrogens were added to the former external carbon with identical relative positions to the other internal carbon. An illustration of this is given below. This process was abstracted for larger carbon chains.

Figure 1: Vector process for creating molecules

The next portion of data development was adding functional groups. The following groups were allotted: alcohols, bromines, chlorines, fluorines , alkenes, alkynes, aldehydes, carboxylic acids, and ketones. Functions were created to handle each of these. For the simple cases of bromines, chlorines, and fluorines, a hydrogen was fed into each function, changed to be the desired atom, and then the distance to its carbon was adjusted. For alcohol this same process was done, a carbon was changed to an oxygen and the distance to its carbon was adjusted but then a hydrogen was added to the alcohol on the same axis that linked it to the hydrogen at the distance between the original hydrogen and its carbon. Alkenes and alkynes were handled by removing two or four hydrogens from two neighboring carbons. For ketones two hydrogens on the same carbon would be fed into the function, one would be removed and the other would be converted to an oxygen. For aldehydes this same thing would be done but on an external carbon. Finally, for carboxylic acid the same process would be repeated but then the third external hydrogen would be fed into the alcohol function. A few examples are given below.

Figure 2: Alkene

Figure 3: Carboxylic acid

Figure 4: Bromine

The third step to developing data was to create a significant number of molecules with random functional groups in random positions. Although this portion became somewhat convoluted, a simple explanation is that a loop was created for carbon chains ranging from size 1 (methane) to size 10 (decane). The version without functional groups then had its number of atoms, atomic positions, element types, and name (more to come) printed to a text file. Arrays were made to keep track of the number of hydrogens that hadn’t been converted to a functional group, the types of functional groups that were possible for each hydrogen (for example carboxylic acids are only possible on external hydrogens, alkenes are only possible when a hydrogen is on one carbon and on an adjacent carbon) and which functional groups were on each hydrogen (needed for naming). For each carbon size a random functional group was selected from a list of possible groups, then a random hydrogen was selected, the functional group was added to the molecule and its information was added to the output text files. Given that the range of sizes, groups, and hydrogens was random, a large number of molecules could be tested. As such, for the training data this loop was run through 10,000 times and for the test data this was run through 1,000 times. The final portion of this was developing a name for each molecule. As was stated previously, when each molecule is developed a simple array accompanies it that describes which functional groups are present. The array has a length equal to the number of hydrogen atoms on a molecule (two times the number of carbons plus two). The numbers in the array correspond to a functional group. A diagram of the IUPAC naming rules is given below. The code breaks the process into four functions. Each requires only the array described above. The first finds the Greek numbering system for the carbon chain. The number of carbons is converted to “meth”, “eth”, “prop” etc. as necessary, it is also noted if there is an alkene or alkyne group, if not then an “a'' is appended to the prefix.

The next step is the pre suffix, this part is only necessary if there is an alkene\alkyne present, otherwise only “an” is returned. If there are alkenes or alkynes they are counted, numbered and returned with each number representing a position followed by a latin counting system followed by “yn” or “en” depending on which group it is, for example “1-2-dien”. Next is the suffix, only the functional group of the highest priority is affected by this, it applies for only carboxylic acids, aldehydes, ketones, or alcohols in that order for priority. If there is one present it will be removed from the functional groups list and the suffix will be given as “oic acid”, “al”, “one”, or “ol” depending on which type it is, if none of these are present the suffix will be “e”. Finally the prefix is taken by removing the group given to the suffix, then using the same latin prefix and dash system as for alkenes the prefix is found by listing these alphabetically (for example 1-bromo-2-3-dichloro). The name is given by adding the strings found in this order: prefix + greek + pre suffix + suffix.

Figure 5: Nomenclature

The last portion is applying molecular dynamics to the developed atomic model. This involves computing the forces on each atom and allowing the atoms to move in this direction. This process is iterated until the sum of all forces are zero. In theory, this gives the ‘correct’ model. The allotment from the CCV for memory was enough to run this relaxation for molecules smaller than butane, as such, in order to develop enough data for this project atomic relaxation was ignored. In future work, more memory could be purchased to develop a truly accurate model, however, for the scope of this project, the vector operations for developing molecules was deemed accurate enough. The only preprocessing that will be necessary from this data will be removing duplicates such that the testing set is not skewed by the presence of repeats that the model trained on.

Methodology: What is the architecture of your model?

We use a transformer decoder as the body of our model. We are going to separate the molecular name into its various sections (explained above) and embed these sections in high dimensional vector space. This will act as our encoder output. We will then format each molecule into a linear sequence format (i.e. “Carb Carb Hydro Hydro Hydro Hydro Hydro Hydro” would be the sequence representing the Ethane molecule). This would be our decoder input.

Self attention will be calculated for the decoder input sequence, and encoder-decoder attention will be calculated from the molecular name embedding and the decoder input sequence.

The output of the transformer will be fed to a linear layer that regresses the 3D positions of each word (each atom) in the decoder input. Thus, our loss would be the mean-squared distance of the sequence, with each position outputted from the model having some distance from the exact molecular positions of each atom.

We have had to adjust our architecture a few times so far. Our first idea was to have 3 models with shared layers. The first model would handle the IUPAC name input. The second model would be a 1D fully connected CNN to find the number of atoms. The output would be fed to the 3rd model to ultimately find the final 3D vector of all atom positions.

However, we decided to abandon this approach because of the level of influence between many models and the complexity of the loss functions.

We will test our model on molecules of increasing complexity. Our hope is that the model will be able to, at the very least, achieve high precision on small linear molecules. We would ideally like our model to be able to achieve high accuracy on any sized organic molecule with complex shapes and structure.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

The notion of accuracy does apply for our regression task. The details of this accuracy and loss calculation are touched on below.

The accuracy metric we intend to use is MSE loss to assess our model’s performance. The loss will be calculated using the difference between each atom’s predicted and true position.

Our base, target, and stretch goals:

Base: Goal is to have all the data processed and have our model reach X accuracy on the green data set (<10 atom linear molecules)

Target: Goal is to have X accuracy on the blue data set (Linear arbitrary size + 1 aromatic prefix)

Stretch: Goal is to have X accuracy on black data set (All possible Circular, complex geometry/aromatic groups, complex combinations)

Ethics:

Why is Deep Learning a good approach to this problem?

There are two main reasons we believe deep learning is a good approach to use for this problem. The first is that there are truly infinite molecules and combinations of atoms, so there is no limitation on training or testing data. The second reason is that this is currently a time-consuming task, but once the DL model is trained it can save hours of time per molecule.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

Major stakeholders in this problem could be high school, college, graduate, and PhD students who would use these visualizations for their studies and research. Other stakeholders could include researchers studying new molecules in drug or chemical development. There are two ways the algorithm can be incorrect. The first is that the atoms are correct, but the outputted locations are not. The consequences of this are the ASE extension slowly adjusting the locations to their equilibrium positions. This will still lead to the correct result, but not in a swift, efficient manner. The other error is that the correct atoms are not read from the molecule name, which would be problematic and misinform users. This error is likely easier to spot, and can ideally be avoided with the adequate penalization in our loss functions.

Division of labor: Briefly outline who will be responsible for which part(s) of the project.

Manuel: Data collection, creation, and preprocessing

Ugo + Cam + Niko: Building, training, testing model (subdivided later)

All: Write-up/Presentation

Built With

Share this project:

Updates

posted an update

OrgoNet: Project Progress Reflection Introduction: The motivation for this topic is to speed up the slow process of visually representing molecule structures in Python’s ASE extension. Currently, there are very limited molecules (<50) in the library, so it is more a demo than a useful tool. To create new molecule structures, it guesses initial molecule locations and takes 8+ hours to slowly adjust to the final equilibrium position using intermolecular forces. We believe that a DNN can be trained to take in the molecule name, output the distances between atoms, and quickly output the 3D molecule structure with Python’s ASE extension. Our problem can be classified as a Regression problem. We are working with labeled data, and we map the molecule name to a continuous vector representing the 3D positions of all atoms in the molecule. Challenges: Creating our own data for thousands of molecule names and positions has been the hardest part of this project so far. However, the most challenging obstacle we are facing right now is the time it takes for our model to train. This makes it difficult to test small adjustments and retrain and reevaluate. To address this, we have decided to have each team member focus on fine-tuning one hyperparameter so that we can train simultaneously and hopefully reach ideal values faster. Insights: We have trained the current implementation of our model, and it has a loss value of 0.22 Å (down from 6.54 Å at the start of training). While there are many hyperparameters or features to tune, this is a good concrete result to improve on. When we ran our trained model on the testing set, the average difference per element location was 0.38 Å. While this is not an ideal final result, it is a sign that the model is learning and a great step in the right direction. Initially, we intended to create three datasets with incrementally increasing difficulty of training data. However, we realized during the data processing stage that separating the data risked mismatching data and labels, so we decided to train on the entire dataset. We were not certain how the model would perform, so the results so far are promising, and we hope to improve on them.

Plan: We are on track with our project because we have successfully trained the model and have outputted some results by this second check-in with our TA. Our next steps are to put our heads together and focus on how to improve the model. At this point, there are many possible reasons for why the model is not accurately outputting the correct positions. We need to determine if this is because of not enough model depth, some data points that are confusing the model, or the need for an architecture change. We have considered adjusting the learning rate and batch size, adding layers, or making other small model changes. We have also discussed possible data points that could be confusing the model, and we may remove those and retrain. We will continue to explore these options to improve the model's performance.

Log in or sign up for Devpost to join the conversation.