DL Final Project

Final Writeup

Title: BlueNose Who: William Hayward (whayward), Christos Michealides (cmichae4), Kaan Ozulkulu (kozulkul) Introduction: Write a deep learning model to classify the smell of a molecule.

We are entering this challenge. We came across the challenge and found the problem really unique. We had never heard of anything like this before, and thought it would be fun to give it a shot. This is a multi-label classification problem. Given a string representation of a molecule (in SMILE format), the goal is to predict three words that describe the smell of the molecule.

Github:

link

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

Papers: This paper also uses GNNs to predict the relationship between a molecule’s structure and its smell. The researches state that machine learning is used widely in areas of vision and hearing but not so much in olfaction. The goal is to predict sensory results given a molecule. Thus, a representation of the molecules in an odor space is needed to learn, where molecules with similar smells are clustered closer to each other. The researchers believe further research in this area can be used to create new odorants that eliminate the need for heavy harvesting.

URLs: https://arxiv.org/pdf/1910.10685.pdf

Data:

The dataset is provided by the challenge. It is broken down into training and testing csv files, where the inputs are SMILE representations of molecules and the labels are a list of strings that describe the smell of that molecule. The training data is 4316 labeled molecules and the testing set is 1079 unlabeled molecules.

Methodology:

We intend to train the model using a GNN with message passing. It is also easy to generate images from SMILE representations of molecules, so we may also explore using CNNs on the image files themselves. Word embeddings and NLP paradigms may also prove useful, due to the nature of the classification. Given the successful performance we’ve seen GNNs achieve on molecule data in HW5, we believe that a GNN with message passing will be the best representation for our model.

Metrics:

Success is defined as correctly categorizing the smell of molecules in the testing set, as measured by the Jaccard Index / Tanimoto Similarity Score.

Our target is to place in the top 25% of the competition. Our reach is to win the competition.

Ethics:

What broader societal issues are relevant to your chosen problem space? The field of molecular modeling using graph neural networks is an especially interesting and widely applicable one. One particularly interesting use for our work could be to discover synthetic alternatives to aromatic products. The cosmetics industry tends to overharvest specific natural products (e.g certain types of flowers etc) in order to produce aromatic products with certain scents. Our work could have a large positive impact on the ecosystem by allowing these producers to go the synthetic route. Moreover, as the field of molecular modeling progresses, there will be heightened abilities to match molecular inputs to all sorts of labels. In the future, this could mean novel drug discovery, cancer discovery, or other such use cases. We believe such work is of extreme importance.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Ultimately, there is no “ground truth” odor for a molecule, because different people may perceive scents in different ways. For instance, people who live in an area where jasmine is not native may classify a scent we classify as jasmine differently. Furthermore, we have no way of testing the training data due to its complexity. This means we are essentially trusting the dataset is not biased. If our project was not focused on odor classification, but instead something more personal, these problems with ground truth and possible dataset bias would be more consequential. Division of labor: At least in the beginning stages, we intend to work on the project together so as to ensure nobody is left behind/doing all the work. As the project progresses, we foresee more defined workload splits arising.