Using Deep Learning to Predict Protein Secondary Structure

Digi Poster

Title

Who

Team Echidnas: Hari Dandapani [hdandapa], Helen Zhang [hzhan103], Alex Guo [yguo62]

Introduction

Proteins have 4 different levels of structure: primary, secondary, tertiary, and quaternary structure. The primary structure tells us the sequence of the 20 different amino acids as they appear in order after lots of processing, like splicing. The secondary structure of the amino acids tells us more information about how the amino acids fold on each other. In introductory biology coursework, students are taught that the secondary structure of a sequence of amino acids can be alpha helices or beta pleated sheets. The dataset that we draw from further breaks this down into redundant and non-redundant alpha as well as redundant and non-redundant beta.

This project seeks to take two different approaches [one based on Convolutional Neural Networks and one based on Natural Language Processing] to predict the secondary structure of proteins based on the primary structure of the amino acid chain.

Related Work

For this project, we are drawing primarily on this research paper as inspiration for our project. From the abstract of the paper, "In this study, we present a novel computational method, TMPSS, to predict the secondary structures in non-transmembrane parts and the topology structures in transmembrane parts of αTMPs. TMPSS applied a Convolutional Neural Network (CNN), combined with an attention-enhanced Bidirectional Long Short-Term Memory (BiLSTM) network, to extract the local contexts and long-distance interdependencies from primary sequences." We seek to replicate the same prediction structure with CNNs and LSTM while also trying out a methodology of our own creation that is based off of NLP, potentially using Transformers and RNNs to see if that can also be used to solve this problem.

This paper uses deep learning to predict 8-state secondary protein structures. They use four types of features, "including a position-specific scoring matrix (PSSM), protein coding features, conservation scores, and physical properties" to characterize each residue in a protein sequence. Using a convolutional, residual, and recurrent neural network to address this problem, their model captures local and global features to achieve 71.4% accuracy.

Data

We will extract our data from PDBTM: Protein Data Bank of Transmembrane Proteins. This data has 6499 different transmembrane proteins, of which 5976 are alpha and 472 are beta. The rest are unclassified. We will need to use webscraping to extract the sequences and labels from the database for use in our model.

One issue posed with the dataset is that 92% of the dataset is alpha [though it can further be broken down based on . We will need to find intelligent ways to segregate the data into test and training sets such that we don't just always predict alpha.

As a final aside, we are looking into other ways of obtaining the data, like the protein data bank, to see if we can use their more voluminous quantity of data for our model. We might try to use something like this one, though this would require much more preprocessing and pre-analysis.

Methodology

TMPSS uses arrangements of CNNs and LSTMs in their classification model. We will be expanding the project into exploring the use of NLP methodology to approach similar problems of protein sequence classification.

We will attempt to use a transformer-based model. This might entail any combination of the following, based off the Transformer seq2seq model:

Attention Heads
Position Encoding Layers to add positional embeddings
Transformer blocks for our encoder and decoder

Metrics

We will train and test our data on the labeled redundant and nonredundant alpha/beta protein sequences. Accuracy is a common metric already used to evaluate protein secondary structure predictions. However, since our dataset is skewed towards alpha helices, we will want more nuance in our model analysis and will use metrics such as accuracy, recall, precision, specificity, Mathews Correlation Coefficient, and F-measure to holistically determine model performance. These indicators will help us determine not just how well our model predicts based on its correct results, but also how well it predicts relative to its mistakes. The authors of the original research paper used similar performance metrics, and also compared model performance against past state-of-the-art models using accuracy and Q3.

Base goal: We will have a working model capable of processing input protein sequences and outputting predicted secondary structure labels. Ideally, this model will have overall similar architecture to TMPSS, though the exact layers may be slightly different.

Target goal: We will replicate the model proposed in the paper as well as finalize a working model using our Transformer based approach. Inputs will be similar to those proposed in the paper (one-hot matrix, HHblits profile).

Stretch goal: We will visualize the predicted topology of the protein sequence using our model’s predicted secondary structure labels.

Ethics

Deep Learning can be a useful tool for situations where technical limitations prohibit experimentation. They can also be helpful for identifying new patterns on a larger scale than feasible for humans. However, deep learning models must be extensively validated. They can amplify implicit biases and errors, which can go unnoticed if not under constant objective surveillance.

The use of biological data raises many important questions, especially in regard to human application which introduces many other confounding variables as well as stronger repercussions. Biological data is highly variable due to an inability to control multitudinous factors that may impact samples. The process of sample collection is subject to countless logistical factors. Patient sample collection is subject to societal factors and has the capability to perpetuate societal biases and structural idiosyncrasies. The over/underrepresentation of various groups can contribute to results that are poorly applicable in certain cases. Assumption of the quality of data can negatively impact future research and applications of findings, which can have tremendous effects.