Analyzing Model Performances for Protein Classification

Jennifer Kaplan posted an update — Nov 23, 2020 10:38 PM EST

Hello Deep Learners! We are the Polypeptide Pals. For our final project, we will be reimplementing the paper Evaluating Protein Transfer Learning with TAPE. Protein modeling is an emerging field in deep learning research with enormous potential. The size of protein datasets has been exponentially increasing in recent years due to the advancement of sequencing technologies. However, there is a growing gap between the size of those datasets and their labeled subsets. This is due to the massive amounts of time, experimental equipment and scientific expertise to label and annotate these proteins in meaningful ways. With NLP models, we can automate the extraction of this useful biological information from sequence datasets, leading to potential rapid advances in structural bioinformatics and genomics. Such advances include structure prediction, detection of remote homologs, and protein engineering.

This paper is a self-supervised classification problem that assesses the performance of three models (LSTM, ResNet, and a Transformer) on five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. The objective of the paper is to determine their potential on correctly predicting the biological features of these input sequences. The paper has a GitHub repository holding access to their dataset and model architecture built using PyTorch. We will be reimplementing one of their models in Tensorflow and checking its performance for one of the five biological tasks. We chose to implement their Transformer architecture on the first task: Secondary Structure (SS) Prediction. SS is an important feature in understanding the function of a protein, especially those that are not evolutionarily related to proteins with known structure.

Our greatest challenge that we have encountered so far in our implementation is parsing through the thousands of lines of code and understanding which parts are relevant to the task we are focusing on and the model we are building. The PyTorch repository has the code for all of the models the authors of the paper implemented as well as all of the preprocessing for the different tasks and their relative datasets. Because of this, it has taken some to parse this information and gain an understanding of their codebase and its different modules.

We have just finished downloading the necessary data for our Secondary Structure Prediction task and have preprocessed that data accordingly, so there are no concrete results that we have at this point since we have not finished our model’s implementation.

We planned to have our data downloaded and preprocessed at this point, so we are on track with our project plan. Our biggest worry moving forward is training time. Because of the size of the models, the paper states that each model takes an entire week to train. However, since we are only training our Transformer model on a single task, this time should be reduced. The plan for the next week is to begin the model implementation and hopefully begin training. During this next phase of development, we may end up reducing the size of the model from the paper’s implementation because of time and resource constraints. We need to dedicate more time to training and tweaking our model, so we hope to get an initial implementation finished as soon as possible.

Log in or sign up for Devpost to join the conversation.