Contributors
- Joseph Aguilera
- Connor Staggs
- Nigel Anderson
- Bryan Natividad
Intro
Precise and coordinated regulation of gene expression during growth and development is essential for the viability of all organisms and to prevent diverse diseases ranging from cancer to neurodegeneration. Chromatin domains coordinate gene expression by concentrating key factors at specific genomic locations in a context-specific way to concordantly activate or repress groups of genes. Before the zygotic genome is activated, chromatin domains have not yet formed and only a few key transcription factors (TFs) are bound which are called pioneer TFs. Pioneer TFs can bind to closed chromatin, recruit chromatin remodelers to open chromatin, and target additional transcription complexes. Furthermore, pioneer TFs are critical for cellular reprogramming and are often reactivated in cancer. Pioneer TFs can regulate the formation of active and repressive chromatin domains which form after the zygotic genome becomes activated. However, the fundamental mechanisms by which chromatin domains are formed at the correct genomic locations across time and space remain unknown.
One example of a conserved active chromatin domain is the hyperactivated single male X-chromosome in Drosophila, which coordinately upregulates thousands of active genes two-fold. The complex responsible for this hyperactivation, the Dosage Compensation Complex (DCC), must undergo precise targeting to specific DNA sequences named MSL Recognition Elements to initiate the spreading of this complex across the entire X-chromosome. However, the complex does not act alone. A pioneer TF named CLAMP has a synergistic relationship with the DCC, assisting in the specification to the X-chromosome. But since CLAMP is bound everywhere in the genome, there must be additional TFs that assist the DCC in targeting the X-chromosome. In addition, with more TFs assisting in X-chromosome specification, there must be more cis-regulatory elements (TF Binding Sequences) to be revealed.
We aim to address this issue by determining the TFs that engage in global X-chromosome targeting while simultaneously revealing the regions of their binding that are significant. By using Transformers, we can achieve both these tasks. Importantly, with the use of Transformers, we can detect short and long range DNA patterns, which previous techniques were not able to capture.
Methodology
Data and Data Preprocessing:
Using the Gene Expression Omnibus (GEO), a public functional genomics data repository, we downloaded and curated over 50 unique datasets from Drosophila melanogaster (fruit fly) S2 and Kc cells. Each dataset corresponded to a unique TF binding profile that was obtained using ChIP-seq or CUT & RUN. The datasets needed moderate preprocessing as the output of these techniques resulted in binding coordinates and not DNA sequence. Fortunately, with the use of bedtools, we extracted the DNA sequence from the coordinates of every binding site. Afterwards, we imported and prepared the DNA sequences using python with the following packages: pandas, numpy, scipy, scikit learn, and tensorflow. We were interested in constructing and implementing two different models: a transformer (long range predictions) and a convolutional neural network (local range predictions). To use our proposed transformer architecture, we transformed our sequences into three different k-mers: 3-mer, 5-mer, and 8-mer DNA sequences. We used the k-mer data for our transformer model. This allowed us to store and document the unique keys in a DNA sequence. We also were interested in using a convolutional neural network (CNN) to be a comparison model against our transformer model. To preprocess data for our CNN, we one-hot-encoded our sequences. So a value of 1 was placed on the real nucleotide and 0’s were placed on the other nucleotides. Our target variable was whether or not the binding sequence was on the X chromosome. So in this way we made our labels true or false. Finally, because our data set was extremely large, we randomly sampled 200 sequences from our 50 transcription factors and concatenated them into one large data set (but still smaller than the full data set). We did this for both the transformer and CNN preprocessed data.
Model Architectures and Metrics:
To begin, for our transformer models, we used a learning rate of 0.001, embedding size of 64, two blocks, four attention heads, a window size of 1000, a batch size of five, and five epochs. We performed three random states (1, 2, and 3) for each of our k-mers. So in total we ran nine transformer models. The simplified architecture of our transformer model consisted of two transformer blocks, two add and normalize layers, followed by a 25% dropout. We then passed the output to a fully connected (Linear/Dense) layer before a final sigmoid activation to convert our weights to predicted class probabilities for whether the given k-mer is located on the X-chromosome or not. The metric used in this model was accuracy with our baseline accuracy being the proportion of the most populous class False, and the loss used was binary cross entropy.
Our CNN model used a kernel size of 20 across two convolutional (conv1D) layers with max pooling. We then passed the flattened output through two linear + leaky ReLU layers followed by a dropout of 25%. Finally, we passed the output through a sigmoid activation function. The metric used in this model was accuracy with our baseline accuracy being the proportion of the most populous class False, and the loss used was binary cross entropy.
Results
We obtained a baseline accuracy of 78.7%, using the majority class for all predictions. We tested our transformer model on DNA k-mers of lengths 3, 5, and 8. Each of these models achieved statistically significant improvements over the baseline accuracy, according to a t-test with 95% confidence. That being said, results from the transformer model for each K-mer length were statistically indistinguishable from one another, with accuracies of approximately 82%. Using an 8-mer achieved the highest absolute accuracy, but again, this result was not appreciably different from using 3-mer or 5-mer sequences. In addition, the CNN model did not achieve statistically significant improvements upon the baseline accuracy and was outperformed by each of the transformer models.
Accordingly, we conclude that a transformer architecture is likely to be the more-appropriate model type for our classification task. Intuitively, this makes sense, given that transformers are able to capture longer-range context across inputs, whereas CNNs are better-optimized for feature extraction across shorter context lengths. In total, while the transformer’s improvement on classification accuracy was relatively small in substantive terms (about 4 percentage points), we believe that this demonstrates a proof-of-concept for the basic utility of the approach.
Challenges
The challenges are manifold when preparing such a massive dataset for training. First, data preprocessing and cleaning were tedious and complex. Datasets like ours that consist of many transcription factors (TFs) and histone marks are typically characterized by a high degree of noise and redundancy. Cleaning the dataset, thus, became a task of paramount importance to ensure the validity of the training process. It required time and computational resources, but most importantly, a deep understanding of the data to discern between useful and redundant or erroneous information. This cleaning task is further complicated by the sheer volume of the data, which hinders our ability to manually inspect and clean the data.
Another challenge was ensuring the proper distribution of our data for training, validation, and testing sets. In machine learning, it is crucial to have these three sets correctly divided to avoid overfitting and underfitting. Overfitting happens when the model learns the training data too well, to the point of memorizing it. This results in excellent performance on the training data but poor generalization to new, unseen data. Conversely, underfitting occurs when the model cannot capture the underlying pattern of the data, resulting in poor performance both on the training and the testing data. The challenge is therefore to strike the right balance, which is not always straightforward given the large size of our dataset. Moreover, the process of splitting the data can be computationally intensive, further necessitating the use of the OSCAR computing cluster.
Finally, dealing with such large datasets requires a considerable amount of storage space and memory, which also poses challenges in terms of data management and infrastructure. The large file size of each dataset makes it difficult to load onto a single computer, requiring more sophisticated solutions such as distributed computing. The OSCAR computing cluster, while a robust solution, brings its own set of challenges including efficient data distribution across the cluster, handling network latency, and debugging across multiple nodes. As we progress with our model, we will need to remain attentive to these issues, ensuring that our approach to handling the dataset is as rigorous and meticulous as the model architecture itself.
Reflection
The project ultimately turned out to be a success and a great learning experience. We were able to implement our planned Transformer and CNN architectures that we proposed. However, we were not able to meet our target and stretch goals. We were not able to meet our target goal because we needed to perform additional preprocessing, matching the sequences with the coordinates while maintaining the same labels. But, we would implement this technique by solely using DNA coordinates from the sequences, rather than including global coordinates. The stretch goal could be met during the summer if we were to find interesting results.
The model did perform the way we expected it to, performing significantly better than the baseline result. However, we were not able to interpret the results due to time. This is a critical step that we will follow up on, so we could determine what the self-attention matrix is focusing on and which sequences/classifier-ids help specify a TF to the X-chromosome.
Originally, we thought about building solely a CNN model, but after learning that Transformers can address the disadvantages of a CNN model, we quickly rerouted to using Transformers. If we could start the project over again, we would spend significantly more time on preprocessing the DNA sequences and BedGraph Files. In addition, we would have built ML pipelines that were designed to be run on OSCAR. With these changes, we could have utilized entire datasets instead of sampling subsets, which could have substantially improved model performance.
If given more time, we would have implemented more than one module. The first module would have been the Transformer, but the second would have utilized BedGraph files. We would have merged TF BedGraph files into 2D data frame, organizing unique TFs as columns and the genome sectioned into small bins as rows. Each element would represent a regional binding score for the respective TF. Using this method, we would be able to detect which combinations of TFs bind at specific loci. Afterwards, we would merge the output of each module for a higher resolution prediction.
The biggest takeaways are that available datasets and computing resources are underestimated bottlenecks. However, it pushed us to optimize our code, so we may run our programs locally. Lastly, towards the end of the project, we began writing and structuring our code in a manner that could be easily parallelized. We will be sure to apply these lessons as we progress through this project and for projects that follow.
Links
Notes
Check-in reports are in the updates section.
Log in or sign up for Devpost to join the conversation.