Intro
Transcription must be tightly regulated to drive normal organismal development and to prevent the formation of disease states. Mutations in transcription factors (TFs) and chromatin regulators have been identified as driver mutations in diverse diseases including cancer, autoimmunity, neurological disorders, diabetes, and cardiovascular disease. Therefore, it is essential to understand how TFs identify their correct targets within a highly complex and compact genome. In addition, it is critical to reveal the necessary components which are responsible for mediating transient and sustained binding of these TFs.
In traditional transcription factor (TF) enrichment assays, we could identify where a TF is bound everywhere in the genome. However, we do not know whether the TF is acting alone or within a hub of TFs. Additionally, TFs can have multiple context-specific roles in the genome, which can lead to uniquely bound TFs at potentially every site. To address this issue, we are building a deep neural network which can predict the other TFs bound at every binding site obtained from a standard TF enrichment assay, such as ChIP-seq or CUT & RUN.
The deep learning task would be classification. For every binding site, the output would be a probability distribution of the top TFs that could possibly bind to that site.
Related Work
There has been prior work that uses DNA sequence to infer protein binding and gene expression. Additionally, some of the research below successfully uses transformers in their deep learning architecture to make predictions regarding gene expression. In “DNABERT: pre-trained Bidirectional Encoded Representations from Transformers model for DNA-language in genome”, the authors successfully utilize the BERT language model technique to extract more signal out of the DNA sequences. In fact, Ji et al. were able to find critical relationships between non-coding DNA and gene expression.
With the success of DNABERT, we are drawn to the research and application which uses transformers to understand how DNA sequence contributes to TF binding. The research papers below store key information about useful architectures and DNA preprocessing techniques.
Articles: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning | Nature Biotechnology
DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning | Genome Biology | Full Text
Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome | Bioinformatics | Oxford Academic
Data
Using the Gene Expression Omnibus (GEO), a public functional genomics data repository, we have downloaded and curated over 50 unique datasets from Drosophila melanogaster (fruit fly) S2 and Kc cells. Each dataset corresponds to a unique TF binding profile that was obtained using ChIP-seq or CUT & RUN. The datasets need moderate preprocessing as the output of these techniques result in binding coordinates and not DNA sequence. Fortunately, with the use of bedtools, we could extract the DNA sequence from the coordinates of every binding site. Afterwards, we could import and prepare the DNA sequences using python with the following packages: pandas, numpy, scipy, scikit learn, and tensorflow.
Methodology
The simplified architecture of our model would consist of multiple transformer blocks (number of blocks will be experimentally determined) and multiple linear layers with the last using a softmax activation function. This architecture would output a probability distribution of the TFs that most likely bind to that particular DNA sequence. Interestingly, we are not solely interested in the top score, but the top ten scores as this would indicate which TFs are most likely bound and are acting together in that particular region.
For training, every sequence would be labeled with the TF that it represents. In addition, unlike previous NLP tasks that transformers have been used in, DNA building blocks consist solely of four keys: A, C, G, T. Therefore, to use our proposed architecture, we have to construct unique keys which can be done by grouping the nucleotides to make K-mers. For example when K is 1, the resulting DNA sequence would be: ‘A’, ‘T’, ‘C’, ‘G’. But when K is 3, the result would be: ‘ATC’, ‘GTA’, ‘CAC’, ‘CGC’. This allows us to store and document the unique keys in a DNA sequence.
Metrics
The metrics that will be used in the project will be accuracy and perplexity. The base goal will be to implement the architecture described above. The target goal will be to incorporate a module architecture, utilizing both DNA sequence and the coordinates from which they were extracted. The stretch goal would be for us to validate some of our hits in the laboratory.
Ethics
Lastly, we will be addressing the following ethical issues:
- Why is Deep Learning a good approach to this problem?
- How would our findings translate to the clinic?
Log in or sign up for Devpost to join the conversation.