Title: Leveraging a graph neural network that integrates multimodal genomic data to predict dosage compensation complex (DCC) binding sites on the X-chromosome vs. Autosomes.
Introduction: Sex-specific molecular hubs coordinate RNA, DNA, and protein interactions into dense 3D regulatory networks that shape sexual dimorphism, yet the organizational principles that govern their function remain largely unknown [1, 2, 3]. Within these hubs, biomolecules assemble within self-condensated regions in the nucleus to coordinate specific regulatory processes such as transcription. These hubs create localized environments that recruit molecules with shared biology purpose to specific regions in the nucleus, improving the efficiency of gene regulation by coordinating which genes are expressed and how chromatin is organized.
Previous research from the Larschan Lab has identified dosage compensation as a sex-specific regulatory process in Drosophila [4]. The process of dosage compensation in Drosophila exhibits a distinct dosage compensation mechanism regulated by the dosage compensation complex (DCC) that equalizes X-linked expression by upregulating genes on the male’s single X-chromosome, contributing to sexual dimorphism. While the key molecular drivers of this process have been identified and their coordination within specialized hubs is recognized, little is understood about how each process’s distinct sets of RNAs, DNAs, and proteins combinatorially interact in three-dimensional space to facilitate their functions.
Related Work: In Drosophila, the DCC plays a direct role in dosage compensation by binding to the MSL Recognition Elements (MREs) on the male X-chromosome, which act as entry sites for complex assembly and subsequent chromatin upregulation [5]. While the pioneer transcription factor CLAMP (Chromatin-linked adapter for MSL proteins), a member of the DCC complex, facilitates DCC targeting, its widespread genomic binding suggests that additional cofactors or chromatin marks are required for precise localization [6]. I will integrate microC and chIP-seq data into a GNN framework where genomic bins containing chIP-seq data are represented as nodes, and microC connections define the edges. Using deep learning through a classification task, I will be able to identify nodes and edges important in MRE classification on the X-chromosome versus autosomes. Through this approach, I hypothesize that there are specific combinations of transcription factors and histone modifications that facilitate DCC binding to the MREs specificity.
Data: As described above, genomic bins with corresponding ChIP-seq signals represent the nodes, while Micro-C 3D connections serve as the edges. The genomic technique ChIP-seq maps the genome-wide transcription factor and histone modifications, and the full dataset that will be used to represent node features includes both transcription factor and histone modification signals. This dataset includes three transcription factors (CLAMP, GAF, and psq), and eight histone modifications (h3k27ac, h3k27me3, h3k36me3, h3k4me1, h3k4me2, h3k4me3, h3k9me3, h4k16ac) that consist of methylation and acetylation markers. These ChIP-seq datasets were selected because they were previously characterized to capture regulatory features in Drosophila dosage compensation, or have sex-specific function. They are publicly available and can be downloaded from NCBI. The 3D genomic technique Micro-C maps the three-dimensional structure of the genome at high resolution. The dataset used to represent edge features was generated in-house within the Larschan Lab. The ChIP-seq and Micro-C datasets are partitioned at 1 kilobase resolution to ensure high-quality analysis for model interpretation.
Pre-processing will include CPM normalization of ChIP-seq data to mitigate technical and batch effects, and significant chromatin loops at three different levels (0.1, 0.01, 0.001) will be identified using the program FitHiC to reduce the graph to interactions that carry meaningful biological signals [7]. MREs sequences consist of a highly conserved GA-rich motif, which are present on both the autosomes and the X-chromosome. The Drosophila genome was scanned using FIMO, a package to find individual motif occurrences, from the MEME suite to characterize MRE sites throughout the genome. The corresponding 1Kb bins that contain an MRE sequence were labeled as the following: 0 for no MRE sequence, 1 for MRE sequence on an autosome, and 2 for MRE sequence on the X-chromosome. This constitutes our multiclass classification MRE labels.
Methodology: To develop a robust framework for modeling chromatin interaction networks, I will train two baseline graph neural network architectures—the Graph Convolutional Network and the Graph Attention Network. The model architecture for the Graph Attention Network will operate on a k-nearest neighbor graph (k=10) to define the receptive field of each node, to allow the GNN to aggregate information from relevant, closely related nodes rather than from the entire graph. However, because the Drosophila genome is small, this k-NN graph should still be able to capture both short- and long-range interactions. These models will be implemented through the existing GNNTrainerClass, and I will perform systematic hyperparameter sweeps across both model types. Model optimization will occur in three stages. (1A) Model parameters will be tuned to assess the effects of hidden dimension size, number of attention heads, and the number of linear layers on model performance for both GCN and GAT architectures. (1B) Data parameters will vary neighbor sampling strategies (number of layers and number of neighbors per genomic bin) and test the impact of each significance threshold of the micro-C dataset at p-value thresholds, 0.1, 0.01, and 0.001. (1C) Training hyperparameters including learning rate, weight decay, dropout, scheduler, and optimizer will also be optimized. I will be implementing WandB sweeps to perform hyperparameter tuning using the MRE labels for the classification task.
Metrics: The best-performing configurations on each feature combination, will then be compared across GCN and GAT models to determine the most effective framework for downstream biological interpretation. Model performance will be evaluated using the torcheval.metrics package including Multiclass Accuracy, MulticlassPrecision, MulticlassAUROC, and MulticlassAUPRC. My initial training sweeps will include all ChIP-seq factors in training, however I am curious about whether including or excluding specific factors impacts classification. I hope to exclude sex-specific transcription factors such as Psq, CLAMP, and GAF, as well as histone marks directly associated with sex-specific regulation such as H4K16ac. This approach will allow me to capture chromatin interaction patterns capable of distinguishing MRE features on the autosomes vs. X-chromosome, without the confounding influence of sex-specific factors tied to the X-chromosome that might make MRE prediction simple.
Ethics:
Why is Deep Learning a good approach to this problem? Deep learning is a good approach for modeling this biological question for dosage compensation because it can capture complex relationships across genomic data from different modalities. There is no available experimental genomic method that can capture the combinatorial patterns from various factors simultaneously. Additionally, by using a graph neural network, I integrate 3D chromatin interactions with ChIP-seq signals into a unified approach that’s an abridged version of the spatially organized regulatory environment within a cell. How are you planning to quantify or measure error or success? What implications does your quantification have?
If MREs with strong DCC binding show co-occurrence with specific pioneer transcription factors (TFs) and active histone marks from the deep learning model, this would suggest that these chromatin features are important factors for DCC targeting and subsequent gene activation. In this scenario, pioneer TFs may facilitate chromatin accessibility, while active histone marks establish an environment that enables effective recruitment of the DCC complex. Together, these factors would highlight a coordinated mechanism by which DCC is directed toward active chromatin regions to regulate transcriptional output. Experimental scientists can test this interpretation by conducting targeted perturbation experiments on top deep learning features. Observing reductions in DCC binding or altered expression levels following these perturbations would provide strong evidence that these chromatin features functionally support DCC targeting.
Division of labor: Me :D
Github Link: https://github.com/sarahsgu120/CSCI-1410-Final-Project---FlySolo
References:
- Dundr, M. Misteli, T. Biogenesis of nuclear bodies Cold Spring Harb. Perspect. Biol. 2010; 2:a000711
- Pombo, A. ∙ Dillon, N. Three-dimensional genome architecture: players and mechanisms Nat. Rev. Mol. Cell Biol. 2015; 16:245-257
- Strom, A.R. ∙ Brangwynne, C.P. The liquid nucleome - phase transitions in the nucleus at a glance. J. Cell Sci. 2019; 132:jcs235093
- Gelbart, M. E. & Kuroda, M. I. Drosophila dosage compensation: a complex voyage to the X chromosome. Development 136, 1399–1410 (2009)
- Richard L Kelley, Victoria H Meller, Polina R Gordadze, Gregg Roman, Ronald L Davis, Mitzi I Kuroda, Epigenetic Spreading of the Drosophila Dosage Compensation Complex from roX RNA Genes into Flanking Chromatin, Cell, Volume 98, Issue 4, 1999, Page 513-522, ISSN 0092-8674,
- Soruco, Marcela M L et al. “The CLAMP protein links the MSL complex to the X chromosome during Drosophila dosage compensation.” Genes & development vol. 27,14 (2013): 1551-6. doi:10.1101/gad.214585.113
- Kaul, A., Bhattacharyya, S., & Ay, F. (2020). Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2. Nature Protocols, 15(3), 991–1012. https://doi.org/10.1038/s41596-019-0273-0
Built With
- genomics
- python
Log in or sign up for Devpost to join the conversation.