Inspiration
HBV is one of the leading causes of liver cancer worldwide, and distinguishing HBV-related HCC from non-viral HCC is vital for targeted therapies. I wanted to build a clinically available tool that leverages publicly available data (TCGA-LIHC) to show that even with a small, interpretable set of genes, we can achieve strong classification performance.
What it does
HBVNet predicts whether a patient’s HCC is likely HBV-related or non-viral using:
- Clinical features (age, gender, tumor stage/grade)
- RNA-seq expression data
- A **minimal 10–20 gene signature selected via LASSO for interpretability
It outputs:
- Predicted probability of HBV-related HCC
- Interactive Gradio app for instant predictions
- Figures (ROC curve, confusion matrix, feature importance)
How we built it
- Data Acquisition: Downloaded TCGA-LIHC clinical and expression data (FPKM-UQ).
- Preprocessing:
- Harmonized patient barcodes, merged datasets
- One-hot/ordinal encoded categorical variables
- Imputed missing values
- Harmonized patient barcodes, merged datasets
- Feature Engineering:
- Selected top 500 most variable genes
- Applied L_1-regularized logistic regression to identify minimal gene set
- Selected top 500 most variable genes
- Modeling:
- Balanced classes with SMOTE
- Trained and tuned XGBoost models
- Compared clinical-only, expression-only, combined, and minimal signature modalities
- Balanced classes with SMOTE
- Evaluation: Calculated ROC-AUC, F1-Score, accuracy, plotted ROC curve and confusion matrix
- Deployment: Built an interactive Gradio app with pre-filled mean expression values
Challenges we ran into
- Data harmonization: Matching TCGA barcodes across clinical and expression datasets
- Class imbalance: Required SMOTE and careful evaluation using stratified splits
- Feature dimensionality: >60,000 genes → had to filter & regularize carefully
- Deployment: Ensuring model artifacts (pkl files) match the app’s expectations
Accomplishments that we're proud of
- Built a reproducible ML pipeline in one weekend
- Achieved strong performance (accuracy ≈ 0.85 with minimal signature)
- Reduced to a compact gene signature with minimal loss of accuracy
- Deployed a working web app for real-time inference
- Generated a professional PDF report including plots + feature importance
What we learned
- Feature selection for model interpretability
- Class imbalances in biomedical datasets
- Deploying ML models in a user-friendly interface with Gradio
What's next for HBVNet
- Prospective use: Deploy as a decision support tool for clinicians
- Expand modalities: Integrate mutation data & survival analysis
- Productionization: Wrap as a REST API or Streamlit dashboard for wider access
Built With
- colab
- github
- gradio
- imbalanced-learn
- joblib
- matplotlib
- numpy
- pandas
- python
- reportlab
- scikit-learn
- xgboost
Log in or sign up for Devpost to join the conversation.