Inspiration

HBV is one of the leading causes of liver cancer worldwide, and distinguishing HBV-related HCC from non-viral HCC is vital for targeted therapies. I wanted to build a clinically available tool that leverages publicly available data (TCGA-LIHC) to show that even with a small, interpretable set of genes, we can achieve strong classification performance.

What it does

HBVNet predicts whether a patient’s HCC is likely HBV-related or non-viral using:

  • Clinical features (age, gender, tumor stage/grade)
  • RNA-seq expression data
  • A **minimal 10–20 gene signature selected via LASSO for interpretability

It outputs:

  • Predicted probability of HBV-related HCC
  • Interactive Gradio app for instant predictions
  • Figures (ROC curve, confusion matrix, feature importance)

How we built it

  1. Data Acquisition: Downloaded TCGA-LIHC clinical and expression data (FPKM-UQ).
  2. Preprocessing:
    • Harmonized patient barcodes, merged datasets
    • One-hot/ordinal encoded categorical variables
    • Imputed missing values
  3. Feature Engineering:
    • Selected top 500 most variable genes
    • Applied L_1-regularized logistic regression to identify minimal gene set
  4. Modeling:
    • Balanced classes with SMOTE
    • Trained and tuned XGBoost models
    • Compared clinical-only, expression-only, combined, and minimal signature modalities
  5. Evaluation: Calculated ROC-AUC, F1-Score, accuracy, plotted ROC curve and confusion matrix
  6. Deployment: Built an interactive Gradio app with pre-filled mean expression values

Challenges we ran into

  • Data harmonization: Matching TCGA barcodes across clinical and expression datasets
  • Class imbalance: Required SMOTE and careful evaluation using stratified splits
  • Feature dimensionality: >60,000 genes → had to filter & regularize carefully
  • Deployment: Ensuring model artifacts (pkl files) match the app’s expectations

Accomplishments that we're proud of

  • Built a reproducible ML pipeline in one weekend
  • Achieved strong performance (accuracy ≈ 0.85 with minimal signature)
  • Reduced to a compact gene signature with minimal loss of accuracy
  • Deployed a working web app for real-time inference
  • Generated a professional PDF report including plots + feature importance

What we learned

  • Feature selection for model interpretability
  • Class imbalances in biomedical datasets
  • Deploying ML models in a user-friendly interface with Gradio

What's next for HBVNet

  • Prospective use: Deploy as a decision support tool for clinicians
  • Expand modalities: Integrate mutation data & survival analysis
  • Productionization: Wrap as a REST API or Streamlit dashboard for wider access

Built With

Share this project:

Updates