GeneFormer

Title: GeneFormer

Who:

Tas Rahman (terahman), Connor Flick (cflick), Sahil Gupta (sgupt136), Andrew Ni (awni)

Introduction: What problem are you trying to solve and why?

Our work is based off of the following study: Transfer learning enables predictions in network biology. The goal of the paper was to develop a pre-trained, self-supervised deep learning model that can perform many tasks such as identifying therapeutic targets for disease from limited task-specific data. Conventional approaches require generating and training models for each prediction task when using labeled data, which is incredibly costly and inefficient; but GeneFormer is a pre-trained model leveraging a large corpus of unlabeled transcriptomic data that can then be fine-tuned on a small amount of labeled data for specific prediction tasks (anything from predicting gene sensitivities to chromatin dynamics), yielding more accurate prediction scores and a model that can be applied to many use cases. This is a semi-supervised, transfer-learning problem. We chose this paper since it marks the first model of its kind, and outlines the architecture that we can then replicate.

We chose this problem for several reasons. First, it's incredibly relevant; following the Omics Revolution in the recent decade, there have been large amounts of publicly available data, but for scattered tasks. While this has greatly enabled the study of various biological molecules, including proteins (proteomics), metabolites (metabolomics), DNA, (genomics), and more, the massive amounts of data also necessitate the development of more advanced computational tools/model to interpret and analyze this data. Thus, large-scale, self-supervised models that can connect various datasets for biologically specific tasks are needed. Secondly, our group's own research experiences in computational biology - specifically RNA sequencing and metabolomics - have inspired us to pursue this project. We've experienced first hand what's it like to have limited amounts of unlabeled data, or to have the data, but not have the computational tools with which to interpret it.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

CaSTLe ("classification of single cells by transfer learning") is a similar model that labels cells from single-cell RNA sequences via transfer learning from already-labeled datasets. The researchers utilize an XGBoost classification model to generate high predicton accuracies with limited or small gene datasets, similar to GeneFormer. While the classification architecture is different from GeneFormer, this paper is similar in that it discusses how current labeling/classification methods like manual clustering are burdensome, and turns to the potential of applying transfer learning to transcriptomic datasets to address that challenge.

Data: What data are you using (if any)?

For training data we are using the following two datasets:

Genecorpus-30M: The Genecorpus-30M dataset is publicly available, with 29.9 million human single-cell transcriptomes (found here: Genecorpus-30M). The data is already fully pre-processed and tokenized with a vocabulary size of 25,424 protein-coding or miRNA genes based on Ensembl-annotated genes. The single-cell transcriptomes come from 561 publicly available datasets from original studies cited in the Methods of Theodoris et al, Nature 2023.
immune-c2s: While Genecorpus-30M is useful, this is the same dataset that Geneformer originally used and is somewhat unwieldy to manage at scale. To counter this, we expect to use the immune-c2s dataset, which instead looks exclusively at the immune system and spans only 273,502 rows and was published on Feb. 14, 2024. There are also potential opportunities to augment this data using datasets curated from the Hemberg Lab, but how to go about cleaning and standardizing the data to work with the minimal amount of columns in some of these datasets remains unclear.

For future directions, we can train the model on multiomics to get a true bio foundation model.

Methodology: What is the architecture of your model?

Each single cell transcriptome is encoded into a rank value encoding (i.e. ranking genes according to their expressions, with highly expressed genes that are unique to cells having a higher rank), and then is passed into 6 layers of transformer encoder units (each with the parameters: input size of 2048 tokens, 256 embedding dimensions, 4 attention heads per layer and a feed-forward layer with size 512). Using a masked learning objective approach, we will train our model by "masking" 15% percent of the tokens in each input, and requiring our model to predict those tokens using context from surrounding tokens. This is a common approach in deep learning, specifically in language processing tasks, which encourages the model to learn context-specific relationships between different parts of the input.

The hardest part about implementing this model will likely be the the multiple fine-tuning processes required to generate models specific to different tasks. The paper applies different methods to fine tune the same pre-trained GeneFormer model to at least 4 different tasks, each with separate biological datasets, transformer architectures, and metrics. This will be hard for us to do given the time and compute constraints on our personal devices. We will likely only choose one prediction task to fine-tune our model to.

Metrics: What constitutes “success?”

The paper we are basing our work on applied their model to cardiomyocytes to predict which gene deletions had the most harmful effects on the cells. These results were mirrored in the real data with conditions such as cardiomyopathy. To assess the usefulness of our model, we plan to replicate this experiment to see if we come up with the same therapeutic targets. Additionally, we plan to use newly acquired transcriptomic data from the Lizaragga lab to see which genes play the largest role on autism dysregulation (Tas will be extending this work this summer to see if that gene target is regulated via DEAF1).

Our model is accurate if it correctly predicts the genes contribute to a disease state. Since this is a self-supervised foundation model, their isn't a fundamental idea of accuracy that we can test against, however, we can create specific tasks to well understood diseases to see if our predictions match the current literature understanding of the disease.

Our paper's authors hoped to create a transcriptomic foundation model that can be applied to low-data contexts to help understand disease states. The model is adaptable, and with an added fine tuning layer for each task, they were able to model dosage effects, chromatin and network dynamics.

Our base goal is to reimplement the paper in Keras and run a similar experiment on cardiomyocytes. Our target goal is to apply newly collected transcriptomic data to identify targets of study for Autism research. Our stretch goal is to integrate other types of omics to make the model more comprehensive.

Ethics:

Human pathobiology is complex and oftentimes there aren't enough researchers or data to fundamentally understand disease, especially in the context of rare and clinically inaccessible tissue diseases. Patients are waiting for answers now and deep learning provides a scalable way to understand many forms of disease and provide direction for future research.
Patients are in our opinion the most important stakeholders affected by our model. Mistakingly identifying the wrong target can direct limited research funding in the wrong direction, pushing back the progress of science. This affects researchers who lose out on valuable time, taxpayers who pay for fruitless research, and of course patients who are desperately waiting for answers.