CNN-based Analysis of DNA Sequence Classification

Reflections attached as links to Google Docs down below.

Title

CNN-Based Analysis of DNA Sequence Classification

Who

Group Members: Eurie, Logan, Imran

Introduction

We are addressing the challenge of accurately classifying DNA sequences, a pivotal task in computational biology with implications for understanding genetic information, disease diagnosis, and treatment planning. Implementing the methodology discussed by the authors in the article, we aim to replicate and extend their CNN-based approach using TensorFlow to attempt to achieve superior performance in classifying DNA sequences sourced from the UC Irvine Machine Learning Repository. This project is a classification problem, aiming to categorize DNA sequences as exon-intron or intron-exon boundaries, or to determine if there are no boundaries.

Related Work

Our project draws inspiration from the paper "Analysis of DNA Sequence Classification Using CNN and Hybrid Models" by Hemalatha Gunasekaran et al. In this paper, they demonstrated the effective use of CNN, CNN-LSTM, and CNN-Bidirectional LSTM architectures for DNA sequence classification, achieving high accuracy with their models. However, this paper utilizes multiple architectures, and for the scope of this project, we would like to focus on the CNN and CNN-LSTM models. Nonetheless, this paper is important for our methodology, as it will guide our architectural decisions and benchmarking. For further model comparisons and advancements in DNA sequence classification, we also reviewed related literature, ensuring a more comprehensive understanding of current methodologies.

Data

The data for this project is sourced from the UCI Machine Learning Repository, specifically focusing on splice junction DNA sequence data. The dataset's size and complexity require some preprocessing, including sequence encoding and normalization, to be used in our CNN model.

Methodology

Our approach centers around using Convolutional Neural Networks (CNN) with TensorFlow to classify DNA sequences. Given our previous experience of using CNNs in extracting features from complex datasets, we will employ a similar sequence to that outlined by Gunasekaran et al., with potential adjustments and hyperparameter optimizations to enhance model performance. Key aspects include data preprocessing techniques such as label and K-mer encoding and the exploration of CNN architecture variations for optimal classification accuracy.

Metrics

Success in our project will be quantitatively measured through:

Base Goal: Replicate the existing accuracy levels achieved in the referenced paper.
Target Goal: Improve accuracy by at least 5% through hyperparameter tuning and architectural adjustments
Stretch Goal: Integrate additional data sources or sequence types, aiming for a generalized model with broader applicability.
Accuracy, precision, recall, and F1 score will serve as our primary metrics, aligning with the standards set in the referenced paper.

Ethics

Deep Learning as a Suitable Approach: This project exemplifies deep learning's capability to handle large-scale, complex datasets, offering significant potential for biomedical advancements. However, we recognize the importance of careful model interpretation and the potential for overfitting.
Data Representativeness and Bias: While our data is sourced from a reputable national database, we acknowledge potential biases in data collection and representation, underscoring the need for diverse data to ensure model generalizability across different populations.