Precision Phenotyping using Genomic Variants and XGBoost

Google Collab

Inspiration

Alzheimer's is not a single disease—it’s a complex spectrum. Currently, patients undergo expensive PET scans or painful spinal taps just to know which type of cognitive risk they face. We asked: What if a simple blood test could tell you the same thing? Inspired by the concept of "Precision Medicine," we wanted to build an AI that doesn't just say "You have Alzheimer's," but explains exactly which biological pathway is at risk, democratizing access to early diagnosis.

What it does

Our project is a Genomic Risk Classifier. It takes a patient's genetic profile (specifically Single Nucleotide Polymorphisms, or SNPs) and predicts their specific Alzheimer's phenotype. Instead of a vague "High Risk" label, our AI categorizes patients into 9 actionable groups, including:

Pure AD (Alzheimer's Disease)
Vascular/ADRD (Related Dementias)
Cognitive Decline (Early stage risk)
Neuropathological Risk (Brain tissue damage) This allows doctors to tailor treatments to the specific cause of the disease rather than just the symptoms.

How we built it

We utilized the Alzheimer’s Disease Variant Portal (ADVP) dataset (advp.hg38), a high-confidence catalog of genetic associations.

Data Engineering: We rigorously cleaned the data to prevent leakage (removing P-values) and used One-Hot Encoding to translate biological sequences into machine-readable vectors.
Modeling: We trained an XGBoost Classifier, chosen for its superior performance on structured biological data.
Explainability: We integrated SHAP (SHapley Additive exPlanations) to visualize exactly which genes drove the predictions, ensuring our "Black Box" AI is transparent and clinically trustworthy.

Challenges we ran into

The "Data Leakage" Trap: Our initial model achieved 100% accuracy, which seemed too good to be true. We discovered the dataset included statistical columns (like P-values) that gave away the answer. We had to rebuild our pipeline to strip these out and train on only the biological variants.
Class Imbalance: Some phenotypes (like "Neuropathology") were rare compared to "AD." We used stratified splitting and F1-score evaluation to ensure the model didn't ignore these critical minority cases.

Accomplishments that we're proud of

Achieving 98.90% Accuracy on a 9-class classification problem.
Successfully mapping specific genetic loci (like APOE variants) to their correct phenotypes, proving our model learned real biology.
Building a fully reproducible pipeline in Google Colab that any researcher can run in seconds.

What we learned

We learned that "More Data" isn't always better; "Clean Data" is. Understanding the biological context of our features (genes vs. statistical artifacts) was more important than the complexity of the model itself. We also learned how to use SHAP to bridge the gap between computer science and medicine, making complex AI decisions understandable to doctors.

What's next for Precision Phenotyping using Genomic Variants and XGBoost

Validation: Testing our model on raw, whole-genome sequencing data (e.g., from the UK Biobank) to prove its robustnes in the real world.
Expansion: Adding more demographic data to ensure the tool is accurate for non-European populations.
App Integration: Building a simple web interface where a user can upload their "23andMe" raw data file and get a personalized risk report.

Built With

google-collab
python
shap
xgboost

Updates

Syed Shahnizam started this project — Dec 28, 2025 12:32 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.