𧬠Muta-Tech GenomeScan AI
A full-stack, AI-powered genomic variant analysis dashboard utilizing a custom dual-model DNABERT-2 pipeline to predict pathogenicity and disease associations from raw DNA sequences.
π Table of Contents
- Project Overview
- Core Features
- Architecture & Tech Stack
- Machine Learning Pipeline
- Variant Metrics Explained
- Disease Classes
- API Reference
- Project Structure
- Evidence of Training
- How to Run Locally
- Configuration
- Troubleshooting
- Future Roadmap
- Acknowledgements
π¬ Project Overview
Muta-Tech GenomeScan AI is an end-to-end bioinformatics platform that brings the power of transformer-based deep learning directly to genomic variant interpretation. Given a chromosomal position and alternate allele, the system:
- Dynamically slices the Β±64 bp flanking context from the GRCh38 human reference genome
- Passes the resulting sequence through two independently fine-tuned DNABERT-2 models in parallel
- Returns a pathogenicity score, a ranked disease-association profile, key sequence-level metrics, and a nucleotide-resolution attention heatmap β all rendered in a real-time glassmorphism dashboard
The project was built to demonstrate that fine-tuned language models originally developed for natural language can be repurposed for biological sequence understanding, achieving clinically relevant predictions without requiring access to expensive GPU clusters at inference time.
𧬠Core Features
π€ Dual Transformer Pipeline
Runs two independently fine-tuned DNABERT-2 models simultaneously on CPU or GPU. Model 1 handles binary pathogenicity classification; Model 2 handles 15-class multi-label disease association. Both models share the same BPE tokenizer but have separate fine-tuned classification heads.
π΄ Pathogenicity Scoring
Binary classification predicting whether a variant is Benign or Pathogenic, with a calibrated confidence score (0.0β1.0). Scores above 0.5 are flagged as potentially pathogenic, with color-coded severity bands in the UI.
π§ͺ Disease Prediction
15-class multi-label categorization produces a ranked probability vector across disease categories including Cystic Fibrosis, Arrhythmia, Li-Fraumeni Syndrome, and 12 others. Multiple diseases can be flagged simultaneously for pleiotropic variants.
βοΈ Contextual Blending
A unique prior-blending layer intelligently combines AI predictions with established biological knowledge:
- 40% AI model output β raw transformer prediction
- 60% Biological prior β gene-specific probability distributions derived from ClinVar frequency data
This correction step reduces false-positive rates for well-characterized genes where the training set may be sparse, ensuring the system degrades gracefully in data-limited scenarios.
π Variant Metrics
Automatically computed for every query:
| Metric | Description |
|---|---|
| Shannon Entropy | Measures sequence complexity; low entropy may indicate repetitive or low-complexity regions prone to alignment artifacts |
| GC Deviation | Departure from the genome-wide ~41% GC baseline; extreme GC content affects polymerase fidelity |
| CpG Hotspot Detection | Identifies CpG dinucleotide density; CpG sites are methylation targets and mutational hotspots |
| Transition/Transversion | Classifies the variant type (Ti/Tv); the genome-wide Ti/Tv ratio of ~2.1 is a common quality control metric |
π₯ Attention Heatmap
Visualizes the DNABERT-2 model's internal attention weights as a per-nucleotide heat overlay across the Β±64 bp flanking sequence. High-attention positions highlight which nucleotides most influenced the model's prediction β providing human-interpretable explanations for AI decisions.
βοΈ Architecture & Tech Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser / Client β
β Vanilla HTML + CSS + JS (Port 8080) β
β Glassmorphism UI Β· Chart.js Β· Dark/Light Theme β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β REST (JSON)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend (Port 8000) β
β β
β ββββββββββββββββ ββββββββββββββββββββββββββββ β
β β pyfaidx β β Prior Blending Layer β β
β β GRCh38/hg38 βββseqβββΆβ (40% AI / 60% Prior) β β
β β Β±64bp Slice β ββββββββββββ¬ββββββββββββββββ β
β ββββββββββββββββ β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Dual DNABERT-2 Pipeline β β
β β βββββββββββββββββββ ββββββββββββββββββββββββββ β β
β β β Model 1 β β Model 2 β β β
β β β Pathogenicity β β Disease Association β β β
β β β (Binary Clf.) β β (15-class Multi-label)β β β
β β βββββββββββββββββββ ββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Frontend
- Vanilla HTML / CSS / JS β no build step or framework required
- Glassmorphism UI with scroll-driven entrance animations
- Dual Light / Dark theme with persistent user preference
- Chart.js radar charts for disease probability visualization
- Dynamically networked β accessible by LAN IP for mobile device testing
Backend
- FastAPI β async Python web framework with automatic OpenAPI documentation at
/docs - pyfaidx β efficient random-access slicing of the GRCh38 FASTA reference without loading the full genome into memory
- CORS middleware β pre-configured for localhost development; adjust origins for production
Machine Learning
- PyTorch β model inference and tensor operations
- Hugging Face
transformersβ DNABERT-2 model architecture and BPE tokenizer - Byte Pair Encoding (BPE) β subword tokenization adapted for DNA, treating k-mers as vocabulary units
- Fine-tuned on ClinVar variant data (see Evidence of Training)
π§ Machine Learning Pipeline
Base Model: DNABERT-2
DNABERT-2 is a BERT-style transformer pre-trained on the human reference genome using Masked Language Modelling (MLM). Unlike the original DNABERT which uses fixed k-mer tokenization, DNABERT-2 employs Byte Pair Encoding (BPE), allowing it to learn variable-length genomic subwords β making it more robust to insertion/deletion variants.
Fine-Tuning Strategy
Model 1 β Pathogenicity Classifier
- Task: Binary sequence classification (
[CLS]token head) - Dataset: ClinVar SNPs labelled as
PathogenicorBenign/Likely Benign - Loss: Binary Cross-Entropy
- Input: 129-nucleotide sequence (variant position Β± 64 bp context)
Model 2 β Disease Association Classifier
- Task: Multi-label classification (sigmoid output, 15 independent disease heads)
- Dataset: ClinVar SNPs with associated MedGen disease annotations
- Loss: Binary Cross-Entropy with class-weight balancing for rare diseases
- Input: Same 129-nucleotide sequence format
Contextual Prior Blending
For genes with well-established ClinVar pathogenicity profiles (e.g., BRCA1, TP53, CFTR), the raw model output is blended with a gene-specific prior probability vector. This prevents the model from over-confidently predicting benign outcomes for high-risk genes that may be underrepresented in the training fold.
final_score = (0.4 Γ model_output) + (0.6 Γ gene_prior)
π Variant Metrics Explained
Shannon Entropy
Measures the informational complexity of the local sequence window. Calculated as:
H = -Ξ£ p(b) Γ logβ(p(b)) for b β {A, T, G, C}
A perfectly uniform sequence has H = 2.0 bits. Low-complexity regions (e.g., poly-A tracts) have H β 0 and are prone to sequencing errors and alignment artifacts.
GC Deviation
The percentage of G and C nucleotides in the flanking window, reported as deviation from the hg38 genome-wide average (~41%). Extreme GC content can affect PCR amplification efficiency and variant call quality.
CpG Hotspot Score
Counts CpG dinucleotides in the window and normalizes against the expected frequency. CpG islands are common in gene promoters; deamination of methylated cytosines makes CpG sites the most frequent point mutation site in the human genome.
Transition / Transversion Classification
- Transition (Ti): PurineβPurine (AβG) or PyrimidineβPyrimidine (CβT) β chemically similar bases
- Transversion (Tv): PurineβPyrimidine β chemically dissimilar bases; rarer and often more functionally impactful
The genome-wide Ti/Tv ratio of ~2.1 is used as a variant calling quality metric; samples deviating significantly from this ratio may indicate sequencing artifacts.
π₯ Disease Classes
The 15 disease categories predicted by Model 2:
| # | Disease | OMIM Category |
|---|---|---|
| 1 | Cystic Fibrosis | Pulmonary / Metabolic |
| 2 | Arrhythmia | Cardiovascular |
| 3 | Li-Fraumeni Syndrome | Cancer Predisposition |
| 4 | Hereditary Breast & Ovarian Cancer | Cancer Predisposition |
| 5 | Lynch Syndrome | Cancer Predisposition |
| 6 | Marfan Syndrome | Connective Tissue |
| 7 | Hypertrophic Cardiomyopathy | Cardiovascular |
| 8 | Familial Hypercholesterolaemia | Metabolic |
| 9 | Noonan Syndrome | Developmental |
| 10 | Retinitis Pigmentosa | Ophthalmological |
| 11 | Phenylketonuria | Metabolic |
| 12 | Haemophilia A/B | Haematological |
| 13 | Spinal Muscular Atrophy | Neuromuscular |
| 14 | Gaucher Disease | Lysosomal Storage |
| 15 | Neurofibromatosis Type 1 | Neurocutaneous |
π‘ API Reference
Once the backend is running, full interactive documentation is available at http://localhost:8000/docs.
POST /analyze
Analyzes a genomic variant and returns AI predictions, metrics, and attention weights.
Request Body:
{
"chrom": "chr17",
"pos": 43044295,
"ref": "G",
"alt": "A",
"gene": "BRCA1"
}
Response:
{
"pathogenicity": {
"label": "Pathogenic",
"score": 0.87,
"blended_score": 0.91
},
"diseases": [
{ "name": "Hereditary Breast & Ovarian Cancer", "probability": 0.94 },
{ "name": "Li-Fraumeni Syndrome", "probability": 0.31 }
],
"metrics": {
"shannon_entropy": 1.84,
"gc_deviation": "+6.2%",
"cpg_hotspot": true,
"variant_type": "Transition (CβT)"
},
"attention_weights": [0.02, 0.05, 0.18, ...],
"flanking_sequence": "ATCG...VARIANT...GCTA"
}
GET /health
Returns {"status": "ok"} β use for container liveness probes.
GET /docs
Auto-generated Swagger UI for interactive API exploration.
π Project Structure
muta-tech-genomescan/
β
βββ backend/
β βββ server.py # FastAPI application entry point
β βββ requirements.txt # Python dependencies
β βββ models/
β β βββ pathogenicity/ # Fine-tuned DNABERT-2 (Model 1) weights
β β βββ disease/ # Fine-tuned DNABERT-2 (Model 2) weights
β βββ data/
β β βββ hg38.fa # GRCh38 reference genome (user-provided, ~3 GB)
β βββ utils/
β β βββ sequence.py # pyfaidx slicing + variant metrics
β β βββ blending.py # Prior blending logic
β β βββ attention.py # Attention weight extraction
β βββ priors/
β βββ gene_priors.json # Per-gene ClinVar prior distributions
β
βββ frontend/
β βββ index.html # Main dashboard
β βββ style.css # Glassmorphism UI + theme variables
β βββ app.js # API calls, Chart.js, attention heatmap renderer
β
βββ training_evidence/
β βββ DNABERT2_ClinVar_Training.ipynb # Pathogenicity model training notebook
β βββ DNABERT2_Disease_Training.ipynb # Disease model training notebook
β βββ clinvar_disease_snps.csv # Training dataset sample
β
βββ README.md
π Evidence of Training
The training_evidence/ folder contains full transparency into model development:
DNABERT2_ClinVar_Training.ipynbβ Documents the fine-tuning of Model 1 on ClinVar pathogenicity labels. Includes data preprocessing, train/val/test splits, loss curves, and final classification report.DNABERT2_Disease_Training.ipynbβ Documents the fine-tuning of Model 2 on ClinVar disease annotations. Includes class-weight calculation for rare disease balancing, multi-label ROC-AUC curves, and per-class F1 scores.clinvar_disease_snps.csvβ A representative sample of the training data, sourced from NCBI ClinVar, showing variant coordinates, clinical significance labels, and associated disease terms.
Data Source: All training data was sourced from NCBI ClinVar, a freely accessible public archive of human genetic variants and their clinical interpretations.
π How to Run Locally
Prerequisites
- Python 3.9 or higher
- Node.js 16+ (for
http-server) - ~3 GB disk space for the GRCh38 reference genome
- 8 GB RAM minimum (16 GB recommended for CPU inference)
Step 1 β Download the Reference Genome
# Download GRCh38/hg38 from UCSC (requires ~900 MB compressed)
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -P backend/data/
gunzip backend/data/hg38.fa.gz
# Index the genome for fast random access
cd backend/data && samtools faidx hg38.fa
Alternatively, download from Ensembl GRCh38.
Step 2 β Start the Backend API (Port 8000)
cd backend
pip install -r requirements.txt
# Ensure hg38.fa is placed in backend/data/
python server.py
The API will be available at http://localhost:8000. Visit http://localhost:8000/docs for the interactive Swagger UI.
Step 3 β Start the Frontend Dashboard (Port 8080)
# From the project root
npx http-server ./ -p 8080 -c-1
Open http://localhost:8080 in your browser.
π‘ The UI is dynamically networked β you can also access it via your machine's local IP address (e.g.,
http://192.168.x.x:8080) from a mobile device on the same WiFi network.
βοΈ Configuration
Key settings can be adjusted in backend/server.py:
| Setting | Default | Description |
|---|---|---|
REFERENCE_GENOME_PATH |
data/hg38.fa |
Path to the GRCh38 FASTA file |
FLANK_SIZE |
64 |
Nucleotides of context on each side of the variant |
AI_BLEND_WEIGHT |
0.4 |
Fraction of final score from AI model (vs. prior) |
PRIOR_BLEND_WEIGHT |
0.6 |
Fraction of final score from biological prior |
DEVICE |
auto |
cpu, cuda, or auto (auto-detects GPU) |
CORS_ORIGINS |
["*"] |
Allowed frontend origins; restrict in production |
π Troubleshooting
FileNotFoundError: hg38.fa not found
Ensure the reference genome is placed at backend/data/hg38.fa and the .fai index file exists alongside it. Re-run samtools faidx hg38.fa if the index is missing.
RuntimeError: CUDA out of memory
Set DEVICE = "cpu" in server.py. CPU inference is slower (~2β4 seconds per query) but works on any machine.
CORS error in browser
Ensure the backend is running on port 8000 and CORS_ORIGINS in server.py includes your frontend origin.
Slow inference on CPU
Both models run sequentially on CPU by default. For faster results, use a machine with a CUDA-compatible GPU or reduce FLANK_SIZE (note: this will affect prediction accuracy).
Port 8080 already in use
npx http-server ./ -p 8081 -c-1
πΊ Future Roadmap
- [ ] VCF File Upload β Batch analysis of multi-variant VCF files with exportable reports
- [ ] ClinVar Live API Integration β Real-time lookup of existing ClinVar annotations for comparison
- [ ] Model Confidence Calibration β Temperature scaling for better-calibrated probability outputs
- [ ] ACMG/AMP Criteria Overlay β Map AI predictions to ACMG/AMP variant classification criteria
- [ ] Population Frequency Lookup β gnomAD allele frequency integration for context
- [ ] Docker Compose Deployment β One-command containerised setup for reproducible environments
- [ ] User Authentication & History β Save and revisit previous queries per user session
- [ ] Extended Disease Classes β Expand from 15 to 50+ disease categories with additional ClinVar training data
π Acknowledgements
- DNABERT-2 β Genome-scale pre-trained transformer by Ji et al., 2023
- NCBI ClinVar β Public variant database used for fine-tuning: ncbi.nlm.nih.gov/clinvar
- GRCh38/hg38 β Human reference genome provided by the Genome Reference Consortium
- Hugging Face Transformers β Model hosting and inference infrastructure
- pyfaidx β Efficient FASTA random access: github.com/mdshw5/pyfaidx
Built as part of an AI/ML undergraduate research project. Not intended for clinical diagnostic use.
Log in or sign up for Devpost to join the conversation.