Research Problem
Finding the gene that is most related to HCC or HBV-HCC, and see what is the difference.
My code and the datasets
https://github.com/YcxBJ80/TurBiohacks_code_and_datasets.git
Datasets
GSE14520 and GSE121248: datasets that only contain HBV-HCC samples link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse14520 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121248
GSE25097: dataset only contain HCC samples link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25097
Processing the Data
- extract data from txt files
- convert to csv files
- convert the Labels to numeric values (0 represents health or non-tumor, 1 represents HBV-HCC, 2 represents HCC)
This study uses three datasets, including only HCC, only HBV-HCC, and a merged dataset
Machine Learning Models
This study implemented 4 commonly used ML classifiers: Random Forest, Logistic Regression, Support Vector Machine, Decision Tree.
Achieved Performances in Machine Learning
The Dataset that have both HBV-HCC and HCC samples
In the end, Random Forest achieves best performance.
Metrics:
- Accuracy: 0.9870
- F1 Score: 0.9870
- ROC-AUC: 0.9991281397510211
Top 10 Predictive Features
| Feature | Importance |
|---|---|
| CXCL14 | 0.008773 |
| CAP2 | 0.008492 |
| RACGAP1 | 0.008257 |
| LIFR | 0.007832 |
| CFP | 0.007159 |
| SAE1 | 0.006903 |
| ASPM | 0.006617 |
| FCN3 | 0.006325 |
| TUBG1 | 0.006258 |
| CLEC1B | 0.006166 |
The Dataset that have only HBV-HCC samples
In the end, support vector machine achieves the best performance
Metrics
- Accuracy: 0.9664
- F1 Score: 0.9666
- ROC-AUC: 0.992808005003127
Top 10 Predictive Features
| Feature | Importance |
|---|---|
| HBB | 0.018545 |
| HBA1 | 0.017628 |
| NTS | 0.013556 |
| RPS4Y1 | 0.013492 |
| XIST | 0.013305 |
| HAMP | 0.012578 |
| SPINK1 | 0.012172 |
| PDZK1IP1 | 0.011213 |
| ROBO1 | 0.011110 |
| PPP1R3C | 0.010240 |
The Dataset that have only HCC samples
In the end, random forest achieves best performance
Metrics
- Accuracy: 0.9911
- F1 Score: 0.9911
- ROC-AUC: 1.0
Top 10 Predictive Features
| Feature | Importance |
|---|---|
| NEK2 | 0.012185 |
| FCN2 | 0.012018 |
| CLEC4G | 0.011943 |
| ECM1 | 0.011586 |
| CLEC1B | 0.010927 |
| CDH19 | 0.010920 |
| CXCL12 | 0.010765 |
| PRC1 | 0.010759 |
| RSPO3 | 0.010706 |
| HMMR | 0.009917 |
Citation
use cursor, chatgpt to assist me to write the code use Google to search for all the knowledge I am lack of
Log in or sign up for Devpost to join the conversation.