Inspiration

Our team focus is on understanding how a network of proteins contribute to organism-scale phenotypes, such as cancer growth or organism longevity. We sought to address this challenge by leveraging protein embeddings from ESM to predict these phenotypic outcomes, and propagating them over PPIs using graph neural networks (GNNs). By modeling the function from protein sequence to phenotype, we will be better able to understand biological drivers of complex diseases and how we can target them with drugs to improve human health.

What it does

To perform this task, we leveraged three public datasets: DepMap, TCGA, and Species Longevity. Our goal was to develop predictive models using these datasets to infer how protein sequences lead to phenotypic outcomes, such as cancer cell growth.

How we built it

For each dataset, we built multiple models. For example, for DepMap, we developed a baseline model, a cell-line-specific model, and a PPI-informed model leveraging graph neural networks. Each model predicts the effect of protein knockout in a cell line. We made similar models for the longevity and TCGA datasets.

Challenges we ran into

One challenge we focused on was data leakage, which is especially common in protein modeling. To address this, we made sure to use intelligent train/test splits, such as holding out entire branches of the phylogenetic tree for the species longevity dataset. Additionally, to make sure that our model was generalizing, rather than memorizing, we compared our model to a KNN baseline, in which the K nearest neighbors in the training examples were used as the prediction. By doing this, we were able to demonstrate that our models were working.

Accomplishments that we're proud of

We were able to significantly outperform KNN baselines in the DepMap and TCGA datasets. This suggests that our model is able to learn from ESM3 embeddings to predict how cancer mutations lead to disease.

What we learned

We learned a great deal about the nuances of these datasets. For example, for the species longevity project, we had to deal with multiple sequence alignments of proteins across many species. For the DepMap project, we learned a great amount about cancer biology. Lastly, we learned to work together as a team and coordinate across multiple timezones.

What's next for PhenoSeq

We hope to further develop our approach and release it as a research preprint.

Built With

Share this project:

Updates