Inspiration
Cancer possesses heterogeneities in a wide range of paradigms, from gene expression variations all the way to its clinical manifestations. To better understand its variations and correlations, it is important to identify major biomarker drivers of these differences. Identifying relevant features can provide invaluable insights into potential therapeutic strategies and treatments.
What it does
This pipeline utilises data from the TCGA-LIHC set from UCSC Xena to examine variations in gene expression across HBV-positive and non-viral cases. Utilising this, it employs classification algorithms to discriminate between the two sample types and identify relevant genes with maximum variation across these types.
Process
The pipeline utilises four different classification algorithms (logistic regression, random forests, SVM (linear, RBF)) and evaluates the performances of each using a stratified 80/20 split of the dataset based on the gene expression data.
Results
Note: Further details are included in the report document. On the whole, Logistic Regression provided the best overall performance. ROC-AUC : 0.698 F1 : 0.510 Accuracy : 0.653 Several of the top 'influential' genes obtained from both LR (DNER, TNFSF11, ZIC2) and RF (GULP1, APOA2, SPATA18) have been proven to have significant associations in HCC, hence providing another validatory aspect.
Future Directions
I would like to implement additional clinical data into the training and testing processes to identify further factors that play a role in HBV-+ve versus non-viral cases.
Built With
- python
- tcga-lihc
Log in or sign up for Devpost to join the conversation.