X Marks the Spot: Targeting the X-chr. using Transformers

Contributors

Joseph Aguilera
Connor Staggs
Nigel Anderson
Bryan Natividad

Intro

Precise and coordinated regulation of gene expression during growth and development is essential for the viability of all organisms and to prevent diverse diseases ranging from cancer to neurodegeneration. Chromatin domains coordinate gene expression by concentrating key factors at specific genomic locations in a context-specific way to concordantly activate or repress groups of genes. Before the zygotic genome is activated, chromatin domains have not yet formed and only a few key transcription factors (TFs) are bound which are called pioneer TFs. Pioneer TFs can bind to closed chromatin, recruit chromatin remodelers to open chromatin, and target additional transcription complexes. Furthermore, pioneer TFs are critical for cellular reprogramming and are often reactivated in cancer. Pioneer TFs can regulate the formation of active and repressive chromatin domains which form after the zygotic genome becomes activated. However, the fundamental mechanisms by which chromatin domains are formed at the correct genomic locations across time and space remain unknown.

One example of a conserved active chromatin domain is the hyperactivated single male X-chromosome in Drosophila, which coordinately upregulates thousands of active genes two-fold. The complex responsible for this hyperactivation, the Dosage Compensation Complex (DCC), must undergo precise targeting to specific DNA sequences named MSL Recognition Elements to initiate the spreading of this complex across the entire X-chromosome. However, the complex does not act alone. A pioneer TF named CLAMP has a synergistic relationship with the DCC, assisting in the specification to the X-chromosome. But since CLAMP is bound everywhere in the genome, there must be additional TFs that assist the DCC in targeting the X-chromosome. In addition, with more TFs assisting in X-chromosome specification, there must be more cis-regulatory elements (TF Binding Sequences) to be revealed.

We aim to address this issue by determining the TFs that engage in global X-chromosome targeting while simultaneously revealing the regions of their binding that are significant. By using Transformers, we can achieve both these tasks. Importantly, with the use of Transformers, we can detect short and long range DNA patterns, which previous techniques were not able to capture.

Methodology

Data and Data Preprocessing:

Using the Gene Expression Omnibus (GEO), a public functional genomics data repository, we downloaded and curated over 50 unique datasets from Drosophila melanogaster (fruit fly) S2 and Kc cells. Each dataset corresponded to a unique TF binding profile that was obtained using ChIP-seq or CUT & RUN. The datasets needed moderate preprocessing as the output of these techniques resulted in binding coordinates and not DNA sequence. Fortunately, with the use of bedtools, we extracted the DNA sequence from the coordinates of every binding site. Afterwards, we imported and prepared the DNA sequences using python with the following packages: pandas, numpy, scipy, scikit learn, and tensorflow. We were interested in constructing and implementing two different models: a transformer (long range predictions) and a convolutional neural network (local range predictions). To use our proposed transformer architecture, we transformed our sequences into three different k-mers: 3-mer, 5-mer, and 8-mer DNA sequences. We used the k-mer data for our transformer model. This allowed us to store and document the unique keys in a DNA sequence. We also were interested in using a convolutional neural network (CNN) to be a comparison model against our transformer model. To preprocess data for our CNN, we one-hot-encoded our sequences. So a value of 1 was placed on the real nucleotide and 0’s were placed on the other nucleotides. Our target variable was whether or not the binding sequence was on the X chromosome. So in this way we made our labels true or false. Finally, because our data set was extremely large, we randomly sampled 200 sequences from our 50 transcription factors and concatenated them into one large data set (but still smaller than the full data set). We did this for both the transformer and CNN preprocessed data.

Model Architectures and Metrics:

To begin, for our transformer models, we used a learning rate of 0.001, embedding size of 64, two blocks, four attention heads, a window size of 1000, a batch size of five, and five epochs. We performed three random states (1, 2, and 3) for each of our k-mers. So in total we ran nine transformer models. The simplified architecture of our transformer model consisted of two transformer blocks, two add and normalize layers, followed by a 25% dropout. We then passed the output to a fully connected (Linear/Dense) layer before a final sigmoid activation to convert our weights to predicted class probabilities for whether the given k-mer is located on the X-chromosome or not. The metric used in this model was accuracy with our baseline accuracy being the proportion of the most populous class False, and the loss used was binary cross entropy.

Our CNN model used a kernel size of 20 across two convolutional (conv1D) layers with max pooling. We then passed the flattened output through two linear + leaky ReLU layers followed by a dropout of 25%. Finally, we passed the output through a sigmoid activation function. The metric used in this model was accuracy with our baseline accuracy being the proportion of the most populous class False, and the loss used was binary cross entropy.

Results

We obtained a baseline accuracy of 78.7%, using the majority class for all predictions. We tested our transformer model on DNA k-mers of lengths 3, 5, and 8. Each of these models achieved statistically significant improvements over the baseline accuracy, according to a t-test with 95% confidence. That being said, results from the transformer model for each K-mer length were statistically indistinguishable from one another, with accuracies of approximately 82%. Using an 8-mer achieved the highest absolute accuracy, but again, this result was not appreciably different from using 3-mer or 5-mer sequences. In addition, the CNN model did not achieve statistically significant improvements upon the baseline accuracy and was outperformed by each of the transformer models.

Accordingly, we conclude that a transformer architecture is likely to be the more-appropriate model type for our classification task. Intuitively, this makes sense, given that transformers are able to capture longer-range context across inputs, whereas CNNs are better-optimized for feature extraction across shorter context lengths. In total, while the transformer’s improvement on classification accuracy was relatively small in substantive terms (about 4 percentage points), we believe that this demonstrates a proof-of-concept for the basic utility of the approach.

Challenges

The challenges are manifold when preparing such a massive dataset for training. First, data preprocessing and cleaning were tedious and complex. Datasets like ours that consist of many transcription factors (TFs) and histone marks are typically characterized by a high degree of noise and redundancy. Cleaning the dataset, thus, became a task of paramount importance to ensure the validity of the training process. It required time and computational resources, but most importantly, a deep understanding of the data to discern between useful and redundant or erroneous information. This cleaning task is further complicated by the sheer volume of the data, which hinders our ability to manually inspect and clean the data.

Another challenge was ensuring the proper distribution of our data for training, validation, and testing sets. In machine learning, it is crucial to have these three sets correctly divided to avoid overfitting and underfitting. Overfitting happens when the model learns the training data too well, to the point of memorizing it. This results in excellent performance on the training data but poor generalization to new, unseen data. Conversely, underfitting occurs when the model cannot capture the underlying pattern of the data, resulting in poor performance both on the training and the testing data. The challenge is therefore to strike the right balance, which is not always straightforward given the large size of our dataset. Moreover, the process of splitting the data can be computationally intensive, further necessitating the use of the OSCAR computing cluster.

Finally, dealing with such large datasets requires a considerable amount of storage space and memory, which also poses challenges in terms of data management and infrastructure. The large file size of each dataset makes it difficult to load onto a single computer, requiring more sophisticated solutions such as distributed computing. The OSCAR computing cluster, while a robust solution, brings its own set of challenges including efficient data distribution across the cluster, handling network latency, and debugging across multiple nodes. As we progress with our model, we will need to remain attentive to these issues, ensuring that our approach to handling the dataset is as rigorous and meticulous as the model architecture itself.

Reflection

The project ultimately turned out to be a success and a great learning experience. We were able to implement our planned Transformer and CNN architectures that we proposed. However, we were not able to meet our target and stretch goals. We were not able to meet our target goal because we needed to perform additional preprocessing, matching the sequences with the coordinates while maintaining the same labels. But, we would implement this technique by solely using DNA coordinates from the sequences, rather than including global coordinates. The stretch goal could be met during the summer if we were to find interesting results.

The model did perform the way we expected it to, performing significantly better than the baseline result. However, we were not able to interpret the results due to time. This is a critical step that we will follow up on, so we could determine what the self-attention matrix is focusing on and which sequences/classifier-ids help specify a TF to the X-chromosome.

Originally, we thought about building solely a CNN model, but after learning that Transformers can address the disadvantages of a CNN model, we quickly rerouted to using Transformers. If we could start the project over again, we would spend significantly more time on preprocessing the DNA sequences and BedGraph Files. In addition, we would have built ML pipelines that were designed to be run on OSCAR. With these changes, we could have utilized entire datasets instead of sampling subsets, which could have substantially improved model performance.

If given more time, we would have implemented more than one module. The first module would have been the Transformer, but the second would have utilized BedGraph files. We would have merged TF BedGraph files into 2D data frame, organizing unique TFs as columns and the genome sectioned into small bins as rows. Each element would represent a regional binding score for the respective TF. Using this method, we would be able to detect which combinations of TFs bind at specific loci. Afterwards, we would merge the output of each module for a higher resolution prediction.

The biggest takeaways are that available datasets and computing resources are underestimated bottlenecks. However, it pushed us to optimize our code, so we may run our programs locally. Lastly, towards the end of the project, we began writing and structuring our code in a manner that could be easily parallelized. We will be sure to apply these lessons as we progress through this project and for projects that follow.

Built With

python

Updates

Joseph Aguilera posted an update — May 09, 2023 07:51 PM EDT

Intro

Transcription must be tightly regulated to drive normal organismal development and to prevent the formation of disease states. Mutations in transcription factors (TFs) and chromatin regulators have been identified as driver mutations in diverse diseases including cancer, autoimmunity, neurological disorders, diabetes, and cardiovascular disease. Therefore, it is essential to understand how TFs identify their correct targets within a highly complex and compact genome. In addition, it is critical to reveal the necessary components which are responsible for mediating transient and sustained binding of these TFs.

In traditional transcription factor (TF) enrichment assays, we could identify where a TF is bound everywhere in the genome. However, we do not know whether the TF is acting alone or within a hub of TFs. Additionally, TFs can have multiple context-specific roles in the genome, which can lead to uniquely bound TFs at potentially every site. To address this issue, we are building a deep neural network which can predict the other TFs bound at every binding site obtained from a standard TF enrichment assay, such as ChIP-seq or CUT & RUN.

The deep learning task would be classification. For every binding site, the output would be a probability distribution of the top TFs that could possibly bind to that site.

Related Work

There has been prior work that uses DNA sequence to infer protein binding and gene expression. Additionally, some of the research below successfully uses transformers in their deep learning architecture to make predictions regarding gene expression. In “DNABERT: pre-trained Bidirectional Encoded Representations from Transformers model for DNA-language in genome”, the authors successfully utilize the BERT language model technique to extract more signal out of the DNA sequences. In fact, Ji et al. were able to find critical relationships between non-coding DNA and gene expression.

With the success of DNABERT, we are drawn to the research and application which uses transformers to understand how DNA sequence contributes to TF binding. The research papers below store key information about useful architectures and DNA preprocessing techniques.

Articles: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning | Nature Biotechnology

DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning | Genome Biology | Full Text

Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome | Bioinformatics | Oxford Academic

Data

Using the Gene Expression Omnibus (GEO), a public functional genomics data repository, we have downloaded and curated over 50 unique datasets from Drosophila melanogaster (fruit fly) S2 and Kc cells. Each dataset corresponds to a unique TF binding profile that was obtained using ChIP-seq or CUT & RUN. The datasets need moderate preprocessing as the output of these techniques result in binding coordinates and not DNA sequence. Fortunately, with the use of bedtools, we could extract the DNA sequence from the coordinates of every binding site. Afterwards, we could import and prepare the DNA sequences using python with the following packages: pandas, numpy, scipy, scikit learn, and tensorflow.

Methodology

The simplified architecture of our model would consist of multiple transformer blocks (number of blocks will be experimentally determined) and multiple linear layers with the last using a softmax activation function. This architecture would output a probability distribution of the TFs that most likely bind to that particular DNA sequence. Interestingly, we are not solely interested in the top score, but the top ten scores as this would indicate which TFs are most likely bound and are acting together in that particular region.

For training, every sequence would be labeled with the TF that it represents. In addition, unlike previous NLP tasks that transformers have been used in, DNA building blocks consist solely of four keys: A, C, G, T. Therefore, to use our proposed architecture, we have to construct unique keys which can be done by grouping the nucleotides to make K-mers. For example when K is 1, the resulting DNA sequence would be: ‘A’, ‘T’, ‘C’, ‘G’. But when K is 3, the result would be: ‘ATC’, ‘GTA’, ‘CAC’, ‘CGC’. This allows us to store and document the unique keys in a DNA sequence.

Metrics

The metrics that will be used in the project will be accuracy and perplexity. The base goal will be to implement the architecture described above. The target goal will be to incorporate a module architecture, utilizing both DNA sequence and the coordinates from which they were extracted. The stretch goal would be for us to validate some of our hits in the laboratory.

Ethics

Lastly, we will be addressing the following ethical issues:

Why is Deep Learning a good approach to this problem?
How would our findings translate to the clinic?

Log in or sign up for Devpost to join the conversation.

Joseph Aguilera posted an update — May 03, 2023 02:21 PM EDT

Introduction: Transcription must be tightly regulated to drive normal organismal development and to prevent the formation of disease states. Mutations in transcription factors (TFs) and chromatin regulators have been identified as driver mutations in diverse diseases including cancer, autoimmunity, neurological disorders, diabetes, and cardiovascular disease. Therefore, it is essential to understand how TFs identify their correct targets within a highly complex and compact genome. In addition, it is critical to reveal the necessary components which are responsible for mediating transient and sustained binding of these TFs.

The deep learning task would be classification. For every binding site, the output would be a probability distribution of the top TFs that could possibly bind to that site.

Challenges: The most difficult part of the project has been preparing the dataset for training. For example, we have an extensive dataset consisting of many TFs and histone marks, however, each corresponding file is too large to load onto one computer. Therefore, we will resolve this issue by utilizing the OSCAR computing cluster. To craft our model architecture, we will use a small subset of the dataset. After troubleshooting and hyperparameter tuning, we will use the entire dataset.

Insights: When using ChIP-seq histone marks, the model performance is identical to the baseline accuracy. This was expected as histone marks are known not to be sequence specific. But when using TFs, the model was performing better than the baseline accuracy, indicating that the transformer architecture is effectively classifying the TFs.

Plan: It is going to take a significant amount of time hyperparameter tuning. So, we will spread out the possible variations amongst the group members to test. Specifically, we will determine how many attention heads and transformer blocks to utilize.

Log in or sign up for Devpost to join the conversation.

Joseph Aguilera started this project — Apr 13, 2023 10:36 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.

X Marks the Spot: Targeting the X-chr. using Transformers

Contributors

Intro

Methodology

Data and Data Preprocessing:

Model Architectures and Metrics:

Results

Challenges

Reflection

Links

Notes

Built With

Updates

Intro

Related Work

Data

Methodology

Metrics

Ethics