DLNA | Devpost

DNA Sequence Classification

Who: Samuel Murk Caya (smurkcay), Gabriel Gallardo (ggallar2), Liam O’Connor (loconno3), Jacques von Steuben (jvonsteu)

Introduction: The paper proposes using convolutional neural networks (CNNs) and related hybrid models for the classification of DNA sequences, with a potentially novel application to COVID, SARS, MERS, dengue, hepatitis, and influenza. The architectures in the paper use convolutional neural networks (CNNs), embedding layers, LSTMs, and more, so the paper integrates many of the concepts we have discussed throughout the semester! We chose the paper because of its relevance and impact to the real world, especially with an ongoing pandemic. Faster and more accurate identification of pathogens is of vital importance. This is a classification problem.

Related Work: “A primer on deep learning in genomics” by Zou et. al. provides an overview of how deep learning techniques and models may be applied to genomics data. First, it provides a workflow: curate data, select an architecture, train, evaluate, and interpret. Then, it gives an overview of different architectures and their uses, specifically feed-forward networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). The authors then provide tips on using deep learning models effectively and interpreting models. Finally, they cover specific applications in genomics and provide further resources.

Data: The data we will use are made up of complete DNA/genomic sequences of viruses: COVID, SARS, MERS, dengue, hepatitis, and influenza. These are obtained from the public nucleotide sequence database: “The National Centre for Biotechnology Information (NCBI)”. The format of the DNA sequence data is a FASTA file. The sequence lengths range from 8 to 37,971 nucleoids. In total, there are 66,153 samples of sequences. There are more samples for certain viruses than others, so we will need to employ SMOTE (Synthetic Minority Oversampling Technique). Synthetic samples for the minority classes like MERS and dengue are therefore generated using the SMOTE algorithm to match the majority class closely.

Label encoding and k-mer encoding are used to encode the DNA sequence.

In label encoding, each nucleotide in the DNA sequence is assigned an index value like (A-1, C-2, G-3, and T-4).
In k-mer encoding, the raw DNA sequence is converted into an English-like statement by generating k-mers for the DNA sequence, so a natural language processing technique can be used to classify the DNA sequence.

Methodology:

CNN Model

Layer (type), Output Shape, Param #
Embedding, (None, 1000, 8), 128
Conv1D_1, (None, 1000, 128), 3200
MaxPooling1D_1, (None, 500, 128), 0
Conv1D_2, (None, 500, 64), 24640
MaxPooling_2, (None, 250, 64), 0
Flatten, (None, 16), 0
Dense1, (None, 128), 2176
Dense2, (None, 64), 8256
Dense3, (None, 6), 390

CNN - LSTM:

LSTM layer with 100 memory units is added after the convolutional layers. We also include dropout layers and regularization techniques to reduce the overfitting problem.

CNN - Bidirectional LSTM:

Uses CNN for feature extraction and bidirectional LSTM for classification

We suspect that preprocessing the data could present difficulties, for example, using the SMOTE algorithm to generate samples. Moreover, it might be tricky to train the model because of the amount and size of the data under consideration. Thankfully, the model layers and their hyperparameters are discussed in the paper, which streamlines the implementation of the model for us somewhat.

Metrics: We plan to recreate the viral genome classification experiments that were done in the paper. We will compare the accuracy metrics and loss convergence from our implementation and previous approaches to this problem. Yes, accuracy is still an appropriate metric, as our project is fundamentally a classification task. The authors of the paper were aiming to create an architecture for DNA sequence classification which would allow the identification and classification of viruses. Through such a model, they hoped to provide a tool to avoid outbreaks like COVID-19. They quantified their results by determining the accuracy of each model on a test dataset, precision, recall, F1 score, sensitivity, and specificity.

Project Goals:

Base: Implement CNN with label encoding with sufficient accuracy for two viruses.
Target: CNN, CNN with LSTM, CNN with bidirectional LSTM, each with label encoding, with sufficient accuracy for all six viruses.
Stretch: CNN, CNN with LSTM, CNN with bidirectional LSTM, each with label encoding and k-mer encoding, with sufficient accuracy for all six viruses and potentially also other viruses not in the paper. We would also like to calculate other metrics, such as precision and the F1 score, that are considered in the paper.

Ethics: Deep learning is indeed a good approach to this problem. As the paper discusses in its introduction, “[A]s the complexity of the data increases, the manual feature selection may lead to many problems like selecting features that do not lead to the best solution or missing out on essential features. Automatic feature selection can be used to overcome this issue. CNN is one of the best deep-learning techniques used to extract key features from the raw dataset.” It is always important to consider whether deep learning is appropriate remedy for a certain task, and it certainly seems the case here. Still, false positives and negatives have potentially life-threatening implications in the world of medicine and are important to consider when we are measuring our error/success. An incorrect classification could result in a patient not receiving the proper treatment and thus prolonging the illness.

Tentative Division of Labor (Liable to Change):

Preprocessing (numpy) - (LEAD) Gabriel, (HELP) Samuel
Architecture (tf, keras) - (LEAD) Liam, (HELP) Jacques
Compiling Results (tf, matplotlib) - (LEAD) Gabriel
Write-Up (LaTeX) - (LEAD) Samuel, (HELP) Liam
Poster/Oral (InDesign, LaTeX) - Samuel, Jacques

---------> Would be happy to take the lead on either Preprocessing or Compiling Results, so that each of us are leading one technical aspect of the project -JACQUES

Reflection

Challenges: We have hit a roadblock in interpreting the preprocessing of the original paper. We assumed the authors utilized an existing data set, but it actually seems that they compiled their data set from a database that is constantly being updated. Therefore, it is impossible to exactly match our data set of DNA sequences to that in the paper. Moreover, the authors used a sequential encoding, but that erroneously introduces a notion of ordering that we do not want, so we are using one-hot encoding instead. Also, we are having issues matching our implementation to that of the authors. The paper lists the output shapes for each layer and the number of parameters. We have managed to have the output shapes of both implementations align, but the number of parameters sometimes differs. We think perhaps there are typos, but we need to investigate this further.

Insights: We don’t have insights about our model’s performance unfortunately, but we have some troubling insights into the paper and its methodology. We are concerned the model might be simply learning by length alone to predict to which virus a DNA sequence belongs. DNA sequences for a virus are submitted to the database in groups, and the groups often have very similar lengths. Moreover, the paper only has three citations, and one of which is in a paper that uses deep learning to surveil migratory birds, which is quite unrelated to the topic of the original paper.

Plan: We are thinking of perhaps creating a baseline model that looks solely at the lengths of the sequences to guess which virus it is. We still hope to re-implement the paper, but we might instead focus on testing the validity of the methodology. We need to dedicate more time toward understanding whether the paper’s overall implementation is valid or not. We have definitely been second-guessing the validity of its implementation, but want to verify whether our suspicions are well founded considering none of us have a background in genomics (and presumably less experience in deep learning compared to those publishing novel research in the field). If it turns out that our suspicions regarding prediction based on DNA length alone are correct, we plan to circumvent this flaw in one of the following ways:

Filter the data by range (e.g., select only genomes between 10,000 and 12,000 length): This could be problematic given that the range we choose might include predominantly one virus genome.
Remove part of the sequences to make them all the same size: This could be problematic because DNA sequences for viruses are circular, and thus we are not quite sure where the sequences start in the database or whether there has been any standardization of that starting point. Moreover, could removing some data from each sequence potentially interfere with our accuracy down the line?