Research Problem

Finding the gene that is most related to HCC or HBV-HCC, and see what is the difference.

My code and the datasets

https://github.com/YcxBJ80/TurBiohacks_code_and_datasets.git

Datasets

GSE14520 and GSE121248: datasets that only contain HBV-HCC samples link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse14520 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121248

GSE25097: dataset only contain HCC samples link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25097

Processing the Data

  1. extract data from txt files
  2. convert to csv files
  3. convert the Labels to numeric values (0 represents health or non-tumor, 1 represents HBV-HCC, 2 represents HCC)

This study uses three datasets, including only HCC, only HBV-HCC, and a merged dataset

Machine Learning Models

This study implemented 4 commonly used ML classifiers: Random Forest, Logistic Regression, Support Vector Machine, Decision Tree.

Achieved Performances in Machine Learning

The Dataset that have both HBV-HCC and HCC samples

In the end, Random Forest achieves best performance.

Metrics:

  • Accuracy: 0.9870
  • F1 Score: 0.9870
  • ROC-AUC: 0.9991281397510211

Top 10 Predictive Features

Feature Importance
CXCL14 0.008773
CAP2 0.008492
RACGAP1 0.008257
LIFR 0.007832
CFP 0.007159
SAE1 0.006903
ASPM 0.006617
FCN3 0.006325
TUBG1 0.006258
CLEC1B 0.006166

The Dataset that have only HBV-HCC samples

In the end, support vector machine achieves the best performance

Metrics

  • Accuracy: 0.9664
  • F1 Score: 0.9666
  • ROC-AUC: 0.992808005003127

Top 10 Predictive Features

Feature Importance
HBB 0.018545
HBA1 0.017628
NTS 0.013556
RPS4Y1 0.013492
XIST 0.013305
HAMP 0.012578
SPINK1 0.012172
PDZK1IP1 0.011213
ROBO1 0.011110
PPP1R3C 0.010240

The Dataset that have only HCC samples

In the end, random forest achieves best performance

Metrics

  • Accuracy: 0.9911
  • F1 Score: 0.9911
  • ROC-AUC: 1.0

Top 10 Predictive Features

Feature Importance
NEK2 0.012185
FCN2 0.012018
CLEC4G 0.011943
ECM1 0.011586
CLEC1B 0.010927
CDH19 0.010920
CXCL12 0.010765
PRC1 0.010759
RSPO3 0.010706
HMMR 0.009917

Citation

use cursor, chatgpt to assist me to write the code use Google to search for all the knowledge I am lack of

Built With

Share this project:

Updates