posted an update

Reflection 11/30 - Nitya Thakkar (nthakka3) and Madeline Hughes (mhughe10) Introduction: This can be copied from the proposal. We are implementing an existing paper, and the goal is to identify cell types in single cell data using gene expression data. This is a classification problem, and they accomplish it using a Graph Convolutional Network (GCN). We chose this paper because we are both interested in computational biology and thought GCNs were really interesting (and not something we talked about in class).

Challenges: What has been the hardest part of the project you’ve encountered so far? Preprocessing the data has been really difficult for us. It was hard to find a dataset that worked (that was different from one used in the paper). After finding it, there were many steps we had to take: remove unlabeled cells and cells labelled as debris and doublets remove genes with zero expression values across all cells transform gene expression values into log scale and normalize each dataset by min–max scaling after calculating variances of the genes across all the cells, sort the variances in descending order and choose the top 1000 genes as the input of the classifiers construct gene adjacency network from the selected genes We have not fully finished this yet, and after we complete it we next need to construct the gene adjacency matrix as follows: chose top N genes with highest variances in expression values for training Size is N x N (N = number of genes) elements in matrix represent the confident score between pairs of genes extracted from the gene–gene interaction database Normalize weights by row sums Use this to build a weighted graph where nodes are genes and edges represent the connection between genes and the normalized confidence scores are weights of edges We anticipate this will be the most challenging part of our project. Insights: Are there any concrete results you can show at this point? How is your model performing compared with expectations?

We unfortunately haven’t gotten to this point yet since we’ve been stuck on data pre-processing, but hope to have results soon (our goal is by the end of this week).

Plan: Are you on track with your project? What do you need to dedicate more time to? What are you thinking of changing, if anything?

We are running a bit behind just because it is taking us so long on the data pre-processing side. We are hopeful that once we are done with this, the rest of the modeling will go faster. Our goal is to be done with data preprocessing this week so we can also create the model this week and have results by this weekend. We may have to change how we approach the model, since we were hoping to change it a bit if possible but we may not have time to experiment with that.

Log in or sign up for Devpost to join the conversation.