We found this dataset called QM9. It consists of 135k stable molecules made of CHONF atoms. A number of different properties for the molecules are given. Our plan is to convert this dataset to SMILES string types.These basically convert the representation to a set of instructions of how to molecule. These instructions follow the rule of chemistry. This ensures we actually always build valid molecules. It also means we use a Sequence VAE instead of a more complex Graph VAE. Our goal will be to generate feasible novel molecules. We will do this and measure our progress by a. Measuring how many unique molecules we can generate (not repeat generation of same molecules) b. Measuring how many novel molecules we generate (not in dataset) c. Measuring how many of the molecules we generate are feasible given their chemical properties calculated by RDKit. We've read a number of papers about this dataset and the metrics used to calculate molecular validity. We've also looked at what models have already been created to achieve this goal. We've found some on Graph VAEs and some ML approaches. We're taking a different approach using a Sequential VAE, but we've refrained from reading how these other models were implemented in order to work through the problems ourself and come up with a novel approach.

There are two sets of limitations we'll face with this project. The first is related to how we are training our model. Unlike other approaches that actually train on the structure of the molecule, we are using SMILES train a model on sets of instructions for molecule creation. As a result, our latent space is made up of molecule creation patterns, not actual molecules. Nearby points in the latent space will not be similar molecularly and may vary quite significantly in chemical properties. This makes interpolation between different similar molecules very difficult as the latent space isn't smooth with respect to molecular properties. The second is our ability to actually evaluate the molecules we generate. We are using RDKit to calculate some important properties of the molecules but we plan on just seeing that they are in a feasible range. If they are a number of chemically difficult properties our metrics don't show, we don't have the expertise to identify that they would be unlikely/difficult to synthesize. We've also read that property evaluate with RDKit is slow so this may limit our ability to test our molecules' properties

Built With

Share this project:

Updates