I am motivated by a lot of things to create this project. I sit here writing this with less than 24 hours left to submit this project. It is rushed, but it is special and is a proof of concept of the future capability to improve upon this initial work to create more specific molecules that could be used as antiviral drugs against covid19. I am motivated because I want to protect my friends and family, and myself - I currently have unfortunately similar symptoms for covid19 as reported by WHO and I plan to self-isolate.
What it does
The model will generate novel drug-like molecules that are similar to a dataset of 250k drug-like molecules in SMILES format (a format to represent organic molecules as a string). This method can, of course, be applied to other datasets and even larger datasets to produce novel drug-like molecules.
How I built it
This work is a proof-of-concept and an initial piece of work, since my original more complicated idea would not have been ready in time to submit to this competition.
This work builds on Latent-GAN, where they used a heteorencoder VAE to encode randomized SMILES to create a latent space representing the dataset which is then fed into an WGAN. However, in my work, I've replaced the heteorencoder with a VAE-JTNN to theoretically get 100% validity of the created drug-like molecules (as in they're realistic molecules).
Challenges I ran into
Time, good'ol time. I begun this project around the start of March with literature review into the space of Machine learning driven drug discovery, and more specifically lead creation (as in creating candidate drugs). My original idea involved being able to constrain the creation of molecules to be similar to the receptors the protein attach to on a human cell. This would then be molecularly optimized for the docking site on a viruses protein to inhibit its function. Other problems I ran into was computational power, and training time (like I said time was limiting). Due to a late start and having only started training on March 15th (yikes), my model was unable to train for long enough and as well as lacking time to prove 100% validity of the produced molecules and at the time of this submission only remains to be true in theory based on the jtnn papers results.
However, the most annoying challenge that I was unfortunately unable to solve before the deadline is environment issues I ran into when trying to decode my resulting embeddings from the generator. Due to pytorch in python 2.7 only being available on linux machine I was unable to decode the embeddings myself since I have a windows machine. I have implemented the environment file in the github, and I'm fairly confident that the resulting env will allow a user to decode the embeddings back into a SMILES string. Another not so obvious environment issue was that I could not use RDkit within colab, however that is less of an issue for this project and more of an issue for future work.
Accomplishments that I'm proud of
Submitting, and having actually been able to create novel drug-like molecules. I'm also proud that this project integrates part of my undergrad studies (I'm a student in materials engineering), with my passion for inventing and data science. For clarification, my degree has to do with a variety of subjects which include chemistry and biology. I am also proud that I took this project from ideation to finish and that I have future steps planned out to improve my work.
What I learned
I learnt quite a bit, from the best ways to represent a molecules information, to how to work with molecules in machine learning. I learnt how to use GANs which I have no prior experience with. I reinforced my skills in using PyTorch and my ability to read and understand other peoples models and projects built in PyTorch.
What's next for AI for a cure - MolGAN
I hope to complete the more complex architecture that I thought up to I believe in theory can create more tailored molecules to inhibit the activity of viruses. And from there I hope I could potentially get research funding or partner with a lab and contribute to the global efforts related to my motivation for this project, finding a cure or a vaccine for covid19. Since one issue I ran into had to do with environment complication that required me to work around colab to complete my work, I hope with funding I can set up a personal environment to train my models in and decrease the cost incurred to me (I'm a broke student :[ )