Identifying Transcription Factor Cooperation with Attention

Inspiration

We were inspired to create this project by each of our various different connections to the life sciences and computational biology. Some of our group members specialize in gene regulatory network (GRN) inference, and we hope to use this project to contribute to ongoing research in the area!

What it does

This model, based on recently published software called SATORI, uses a 1D convolutional layer to extract motifs from a given candidate enhancer sequence. From here, the model applies self-attention on the found motifs to address the potential for motif/transcription factor interaction. Finally, the model uses various feed-forward (less biologically grounded) layers, such as a bidirectional LSTM, to classify the inputted candidate as an enhancer (or not) of some preconceived biological activity. While the predictive ability of a model like ours can yield some biological information, the real power comes from the model's self-attention weights, which we can use to infer transcription factor cooperation, a task critical to GRN inference.

How we built it

In terms of our approach, we initially focused on replicating the SATORI model while also exploring additional architectural modifications to improve performance. As we encountered chal- lenges and unexpected outcomes, such as the limited interpretability of the bidirectional LSTM and difficulties in interpreting self-attention weights, we adjusted our approach by prioritizing com- ponents that showed the most promise in enhancing model performance and interpretability. This iterative process of experimentation and adaptation allowed us to make informed decisions and refine our model over time

Challenges we ran into

One of the primary challenges we encountered was interpreting the interaction graph inferred from the self-attention weights. Despite achieving high accuracy in classifying transcription factor binding motifs (TFBMs) on D. melanogaster promoter data, the complexity of the interaction graph remained difficult to interpret accurately. While the self-attention mechanism captures relationships between motifs, extracting meaningful insights from the learned weights proved challenging. As a result, understanding the precise regulatory interactions between transcription factors remained elusive.

Our model struggled with learning short and highly random motifs present in the regulatory regions of D. melanogaster promoters. These motifs, often crucial for gene regulation, posed a sig- nificant challenge due to their limited presence and randomness. The convolutional filters designed to capture these motifs faced difficulties in accurately identifying and representing them, impacting the overall performance of the model.

Despite successfully extracting motifs from the convolutional filters, the interpretability of these filters diminished when employed with numerous filters in the convolutional step. As the number of filters increased, understanding the specific motifs learned by each filter became increasingly com- plex. This lack of filter interpretability hindered our ability to analyze the model’s representations effectively, limiting insights into the regulatory mechanisms underlying transcriptional regulation in D. melanogaster.

While the model demonstrated high accuracy in classifying TFBMs, the model had trouble learning entire interaction graphs. Not all information encoded in the interaction graph was nec- essary for achieving high classification accuracy. Consequently, the model struggled to discern relevant interactions from noise, leading to challenges in accurately representing the regulatory network within D. melanogaster promoters.

Accomplishments that we're proud of

Reflecting on the project in light of our original goals, we feel that we made significant progress and achieved several key milestones. Our base goal of reaching an AUC of 0.85 without the bidirectional LSTM was successfully met, as we surpassed this threshold with our architectural modifications. However, we fell short of replicating the paper’s result of 0.94 AUC with the bidirectional LSTM. Despite this, we gained valuable insights into the impact of different architectural changes on model performance and transcription factor interaction inference.

What we learned

Overall, our biggest takeaways from this project include the importance of iterative experi- mentation, the value of incorporating domain knowledge into model design, and the challenges associated with interpreting complex deep learning models in biological contexts. Despite falling short of some of our stretch goals, we believe that the insights gained from this project will inform future research and contribute to advancements in the field of computational biology.

What's next for Identifying Transcription Factor Cooperation with Attention

If we could do the project over again, we might allocate more time to fine-tuning the model’s hyperparameters and exploring alternative architectures earlier in the process. Additionally, ded- icating more resources to accurately interpreting self-attention weights and exploring the role of self-attention in the prediction process could provide valuable insights into the model’s inner work- ings and improve its overall performance.