GenoNet: Cell type annotation on scRNA-seq data

Single-cell RNA sequencing (scRNA-seq) has transformed genomics by enabling transcriptome analysis at the individual cell level, providing detailed insights into cellular composition, gene expression variations, and dynamic transcriptional processes. Cell type annotation, categorizing cells based on gene expression profiles, is essential but traditionally reliant on expensive experimental methods. Computational techniques have emerged, offering both unsupervised and supervised approaches, with supervised methods demonstrating superior accuracy and scalability. However, their performance hinges on feature selection, model choice, and reference datasets. To address these challenges, this study proposes enhancing Cellcano with conditional variational autoencoder to balance reference data, aiming to improve the precision and efficiency of cell type annotation in scRNA-seq datasets.

Final Submissions

Final Writeup

Github Repository

Presentation Slides

Updates

Check-in #3

Check-in #2

Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of genomics by enabling transcriptome analysis at the individual cell level [1, 2, 3]. Unlike bulk RNA sequencing, which averages gene expression across many cells, scRNA-seq provides detailed insights into the cellular composition of complex tissues, variations in gene expression across individual cells, and dynamic transcriptional processes [4, 5].

Identifying and categorizing individual cells based on their gene expression profiles, a process known as "cell type annotation," is a fundamental and critical task in scRNA-seq studies. Traditionally, cell type annotation was facilitated by experimental techniques such as immunochemistry, fluorescence-activated cell sorting, and morphological methods, which are expensive and impractical for large-scale studies [6]. The advent of computational methods has led to the development of both unsupervised and supervised techniques for cell type annotation, enhancing the precision and efficiency of these analyses [7].

Benchmark studies have shown that supervised cell type annotation methods outperform unsupervised approaches in accuracy, robustness, and scalability [8, 9]. These methods leverage well-annotated reference datasets to train classifiers that can then predict cell types in new datasets. Despite their advantages, the performance of supervised methods heavily depends on the choice of features, the prediction models, and the reference datasets used. To address these limitations, this study proposes the enhancement of Cellcano [10] through the integration of advanced deep learning techniques, adapting it for applications to scRNA-seq data. This approach seeks to improve the accuracy and robustness of cell type annotation in scRNA-seq.

Related Work

Cellcano [10] endeavors to devise a computational method for annotating cell types in single-cell ATAC sequencing (scATAC-seq) data. Employing a two-phase supervised learning approach, Cellcano initially employs a multi-layer perceptron (MLP) trained on a reference dataset to predict cell types within the target dataset. Cellcano identifies well-predicted target cells (termed anchors) from these predictions with relatively low entropies, which are then used to construct a new self reference. In the subsequent phase, Cellcano trains a self-Knowledge Distiller model (KD model) on the anchor cells based on the predicted pseudo labels to update cell types for the remaining non-anchor cells. However, the performance of Cellcano is significantly influenced by the degree of imbalance in the distribution of cell types and the selection of anchors.

Seurat [11] also aims to find anchors to identify the relationship between reference and query datasets, while the anchors are defined as matched pairs of cells, which encode cross-dataset cellular relationships. It first reduces the dimensionality of the datasets jointly using diagonalized CCA, then applies L2-normalization to the canonical correlation vectors. Next, it searches for mutual nearest neighbors (MNN) in this shared low-dimensional representation to find anchors.

Data

We aim to analyze various datasets derived from human peripheral blood mononuclear cells (PBMC) using the 10X sequencing platform. We will first focus on the “FACS” dataset [13], which comprises cells sorted using Fluorescence-activated cell sorting (FACS), widely considered as high-quality data. Additionally, we may explore other datasets related to COVID-19 research.

The “FACS” dataset includes 92,636 cells with 32,738 features. We will adhere to established protocols outlined in the literature to preprocess the data, which may include filtering out low-quality cells and genes, normalization, and feature selection. For other datasets, such as the “Covid CN” dataset [14] containing over one million cells, we may opt for downsampling to create a more manageable subset.

Methodology

We apply the Cellcano framework and make several improvements with deep learning methods to identify cell types from single-cell sequencing data. Initially, we train a multi-layer perceptron (MLP) on the reference FACS dataset and predict cell types in target data. Based on the prediction results, we selects certain target cells deemed well-predicted (referred to as anchors) to form a new training set. Next, we utilize the sampling methods and data augmentation to create balanced data for second-stage training. Finally, we trains a KD model on the anchor cells using the predicted pseudo labels on the balanced data to predict cell types for the remaining non-anchor cells.

Sampling

Sampling methods can be used to reconstruct the reference data and mitigate the degree of imbalance of cell types. General sampling methods include subsampling majority cell types and oversampling minority cell types. Subsampling reduces the number of instances in overrepresented cell types to balance the dataset; oversampling increases the number of instances in underrepresented cell types, possibly using SMOTE (Synthetic Minority Over-sampling Technique) tailored for single-cell data, to create synthetic samples based on the feature space of the RNA-seq data. We apply the sampling methods to balance the reference data before the first round supervised learning of Cellcano and potentially include various cells as anchors.

Data Augmentation

Except for sample methods, we also explore deep learning models to generate artificial data for training during the second stage. We plan to use the variational autoencoder and possibly the generative adversarial networks (GANs) for augmentation to make improvements.

Anchor Selection

The original Cellcano framework selects anchor cells based on entropies. Setting different cutoff quantile of low entropy will result in different size of the anchor cell set, making the final accuracy after KD model varies. In the original paper that proposed Cellcano, the authors tried cutoff in {0.1, 0.2, ..., 0.6}, corresponding to select {10%, 20%,...,60%} of cells in the target dataset. and they suggested a poor performance when the cutoff quantile is 0.1 but a relatively stable performance when the cutoff quantile is over 0.2 in human PBMCs celltyping tasks and mouse brain celltyping tasks. For our task, we will tune this cutoff by grid search.

Metrics

Given that our primary focus is on cell type annotation, a classic classification task, accuracy serves as a suitable metric for evaluating our model. Additionally, we can employ the F1 score, which combines precision and recall, to assess performance comprehensively. Moreover, to delve deeper into class-specific predictions, we may examine class-level accuracy and the macro F1 score, which offers an unweighted average of class-specific F1 scores.

At this stage, our primary objective is to assess the feasibility of jointly applying Ma's and Khan's methodologies [10, 11] effectively within the context of scRNA-seq. Specifically, we aim to investigate whether Ma's approach translates well to scRNA-seq and whether incorporating Khan's method could enhance performance further. Our specific targets include: investigating alternative approaches for defining anchor cells and identifying high-confidence cells; exploring novel methods for leveraging predictions from the initial iteration, such as self-referencing using pseudo labels and data integration. Additionally, our stretch goal involves comparing the efficacy of our proposed model against existing methodologies.

Ethics

Why is Deep Learning a good approach to this problem?

Deep learning fits this problem for several reasons. One main reason is that scRNA-seq data is in high dimensions. For example, the FACS dataset has nearly 100,000 cells with more than 30000 genes. Also, deep learning models can capture complex, non-linear relationships in data, especially the number of features is large. This complex data structure can be hard to model with traditional statistical methods. Moreover, deep learning models have shown its potential in scRNA-seq for various tasks including cell type annotation, clustering, disease diagnosis, etc. Thus, deep learning is a good approach for our project.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

The possible major “stakeholders” are researchers and health providers working with genomic data. They may conduct downstream analysis based on the cell type annotation results. For example, researchers may screen for marker genes for a specific cell type after applying our method to their dataset. Health providers may use our method to identify cells related to diseases for diagnosis. If our algorithm makes mistakes, one direct consequence is that our cell annotation on the target dataset is incorrect. As a result, the following analysis would be biased, leading to false discovery of marker genes and false diagnosis.

Team Members

Xin Wei, xwei13

Haiyue Song, hsong57

Han Ji, hji19

Lanyu Zhang, lzhang27

Division of labor

Xin, Haiyue, Han, and Lanyu: collaborate on data preprocessing, model training, model evaluation, presentation and documentation.

References

[1] Junyue Cao, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M Ibrahim, Andrew J Hill, Fan Zhang, Stefan Mundlos, Lena Christiansen, Frank J Steemers, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature, 566(7745):496-502, 2019.

[2] Mariona Nadal-Ribelles, Saiful Islam, Wu Wei, Pablo Latorre, Michelle Nguyen, Eulalia de Nadal, Francesc Posas, and Lars M Steinmetz. Sensitive high-throughput single-cell rna-seq reveals within-clonal transcript correlations in yeast populations. Nature Microbiology, 4(4):683-692, 2019.

[3] Pingjian Yu and Wei Lin. Single-cell transcriptome study as big data. Genomics, Proteomics and Bioinformatics, 14(1):21-30, 2016.

[4] A Haque, J Engel, SA Teichmann, and T Lonnberg. A practical guide to single-cell rna-sequencing for biomedical research and clinical applications. genome med, 2017; 9 (1): 75.

[5] Byungjin Hwang, Ji Hyun Lee, and Duhee Bang. Single-cell ra sequencing technologies and bioinformatics pipelines. Experimental & molecular medicine, 50(8):1-14, 2018.

[6] Xinlei Zhao, Shuang Wu, Nan Fang, Xiao Sun, and Jue Fan. Evaluation of single-cell classifiers for single-cell ma sequencing data sets. Briefings in bioinformatics, 21(5):1581-1595, 2020.

[7] Luke Zappia, Belinda Phipson, and Alicia Oshlack. Exploring the single-cell ra-seq analysis landscape with the scrna-tools database. PLoS computational biology, 14(6):e1006245, 2018.

[8] Wenjing Ma, Kenong Su, and Hao Wu. Evaluation of some aspects in supervised cell type identification for single-cell ma-seq: classifier, feature selection, and reference construction. Genome biology, 22:1-23, 2021.

[9] Xiaobo Sun, Xiaochu Lin, Ziyi Li, and Hao Wu. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell rna-seq. Briefings in bioinformatics, 23(2):bbab567, 2022.

[10] Wenjing Ma, Jiaying Lu, and Hao Wu. Cellcano: supervised cell type identification for single cell atac-seq data. Nature Communications, 14(1):1864, 2023.

[11] Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. cell, 177(7):1888-1902, 2019.

[12] Sumeer Ahmad Khan, Alberto Maillo, Vincenzo Lagani, Robert Lehmann, Narsis A Kiani, David Gomez-Cabrero, and Jesper Tegner. Reusability report: Learning the transcriptional grammar in single-cell rna-sequencing data using transformers. Nature Machine Intelligence, 5(12):1437-1446, 2023.

[13] Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):14049, 2017.

[14] Xianwen Ren, Wen Wen, Xiaoying Fan, Wenhong Hou, Bin Su, Pengfei Cai, Jiesheng Li, Yang Liu, Fei Tang, Fan Zhang, et al. Covid-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell, 184(7):1895-1913, 2021.