🧬 Muta-Tech GenomeScan AI

A full-stack, AI-powered genomic variant analysis dashboard utilizing a custom dual-model DNABERT-2 pipeline to predict pathogenicity and disease associations from raw DNA sequences.

Python PyTorch FastAPI HuggingFace License


πŸ“– Table of Contents


πŸ”¬ Project Overview

Muta-Tech GenomeScan AI is an end-to-end bioinformatics platform that brings the power of transformer-based deep learning directly to genomic variant interpretation. Given a chromosomal position and alternate allele, the system:

  1. Dynamically slices the Β±64 bp flanking context from the GRCh38 human reference genome
  2. Passes the resulting sequence through two independently fine-tuned DNABERT-2 models in parallel
  3. Returns a pathogenicity score, a ranked disease-association profile, key sequence-level metrics, and a nucleotide-resolution attention heatmap β€” all rendered in a real-time glassmorphism dashboard

The project was built to demonstrate that fine-tuned language models originally developed for natural language can be repurposed for biological sequence understanding, achieving clinically relevant predictions without requiring access to expensive GPU clusters at inference time.


🧬 Core Features

πŸ€– Dual Transformer Pipeline

Runs two independently fine-tuned DNABERT-2 models simultaneously on CPU or GPU. Model 1 handles binary pathogenicity classification; Model 2 handles 15-class multi-label disease association. Both models share the same BPE tokenizer but have separate fine-tuned classification heads.

πŸ”΄ Pathogenicity Scoring

Binary classification predicting whether a variant is Benign or Pathogenic, with a calibrated confidence score (0.0–1.0). Scores above 0.5 are flagged as potentially pathogenic, with color-coded severity bands in the UI.

πŸ§ͺ Disease Prediction

15-class multi-label categorization produces a ranked probability vector across disease categories including Cystic Fibrosis, Arrhythmia, Li-Fraumeni Syndrome, and 12 others. Multiple diseases can be flagged simultaneously for pleiotropic variants.

βš–οΈ Contextual Blending

A unique prior-blending layer intelligently combines AI predictions with established biological knowledge:

  • 40% AI model output β€” raw transformer prediction
  • 60% Biological prior β€” gene-specific probability distributions derived from ClinVar frequency data

This correction step reduces false-positive rates for well-characterized genes where the training set may be sparse, ensuring the system degrades gracefully in data-limited scenarios.

πŸ“Š Variant Metrics

Automatically computed for every query:

Metric Description
Shannon Entropy Measures sequence complexity; low entropy may indicate repetitive or low-complexity regions prone to alignment artifacts
GC Deviation Departure from the genome-wide ~41% GC baseline; extreme GC content affects polymerase fidelity
CpG Hotspot Detection Identifies CpG dinucleotide density; CpG sites are methylation targets and mutational hotspots
Transition/Transversion Classifies the variant type (Ti/Tv); the genome-wide Ti/Tv ratio of ~2.1 is a common quality control metric

πŸ”₯ Attention Heatmap

Visualizes the DNABERT-2 model's internal attention weights as a per-nucleotide heat overlay across the Β±64 bp flanking sequence. High-attention positions highlight which nucleotides most influenced the model's prediction β€” providing human-interpretable explanations for AI decisions.


βš™οΈ Architecture & Tech Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Browser / Client                     β”‚
β”‚         Vanilla HTML + CSS + JS (Port 8080)             β”‚
β”‚   Glassmorphism UI Β· Chart.js Β· Dark/Light Theme        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ REST (JSON)
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              FastAPI Backend (Port 8000)                β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  pyfaidx     β”‚        β”‚   Prior Blending Layer   β”‚   β”‚
β”‚  β”‚  GRCh38/hg38 │──seq──▢│   (40% AI / 60% Prior)  β”‚   β”‚
β”‚  β”‚  Β±64bp Slice β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚                   β”‚
β”‚                                     β–Ό                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚           Dual DNABERT-2 Pipeline                β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚  Model 1        β”‚  β”‚  Model 2               β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  Pathogenicity  β”‚  β”‚  Disease Association   β”‚ β”‚   β”‚
β”‚  β”‚  β”‚  (Binary Clf.)  β”‚  β”‚  (15-class Multi-label)β”‚ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Frontend

  • Vanilla HTML / CSS / JS β€” no build step or framework required
  • Glassmorphism UI with scroll-driven entrance animations
  • Dual Light / Dark theme with persistent user preference
  • Chart.js radar charts for disease probability visualization
  • Dynamically networked β€” accessible by LAN IP for mobile device testing

Backend

  • FastAPI β€” async Python web framework with automatic OpenAPI documentation at /docs
  • pyfaidx β€” efficient random-access slicing of the GRCh38 FASTA reference without loading the full genome into memory
  • CORS middleware β€” pre-configured for localhost development; adjust origins for production

Machine Learning

  • PyTorch β€” model inference and tensor operations
  • Hugging Face transformers β€” DNABERT-2 model architecture and BPE tokenizer
  • Byte Pair Encoding (BPE) β€” subword tokenization adapted for DNA, treating k-mers as vocabulary units
  • Fine-tuned on ClinVar variant data (see Evidence of Training)

🧠 Machine Learning Pipeline

Base Model: DNABERT-2

DNABERT-2 is a BERT-style transformer pre-trained on the human reference genome using Masked Language Modelling (MLM). Unlike the original DNABERT which uses fixed k-mer tokenization, DNABERT-2 employs Byte Pair Encoding (BPE), allowing it to learn variable-length genomic subwords β€” making it more robust to insertion/deletion variants.

Fine-Tuning Strategy

Model 1 β€” Pathogenicity Classifier

  • Task: Binary sequence classification ([CLS] token head)
  • Dataset: ClinVar SNPs labelled as Pathogenic or Benign/Likely Benign
  • Loss: Binary Cross-Entropy
  • Input: 129-nucleotide sequence (variant position Β± 64 bp context)

Model 2 β€” Disease Association Classifier

  • Task: Multi-label classification (sigmoid output, 15 independent disease heads)
  • Dataset: ClinVar SNPs with associated MedGen disease annotations
  • Loss: Binary Cross-Entropy with class-weight balancing for rare diseases
  • Input: Same 129-nucleotide sequence format

Contextual Prior Blending

For genes with well-established ClinVar pathogenicity profiles (e.g., BRCA1, TP53, CFTR), the raw model output is blended with a gene-specific prior probability vector. This prevents the model from over-confidently predicting benign outcomes for high-risk genes that may be underrepresented in the training fold.

final_score = (0.4 Γ— model_output) + (0.6 Γ— gene_prior)

πŸ“ Variant Metrics Explained

Shannon Entropy

Measures the informational complexity of the local sequence window. Calculated as:

H = -Ξ£ p(b) Γ— logβ‚‚(p(b))   for b ∈ {A, T, G, C}

A perfectly uniform sequence has H = 2.0 bits. Low-complexity regions (e.g., poly-A tracts) have H β‰ˆ 0 and are prone to sequencing errors and alignment artifacts.

GC Deviation

The percentage of G and C nucleotides in the flanking window, reported as deviation from the hg38 genome-wide average (~41%). Extreme GC content can affect PCR amplification efficiency and variant call quality.

CpG Hotspot Score

Counts CpG dinucleotides in the window and normalizes against the expected frequency. CpG islands are common in gene promoters; deamination of methylated cytosines makes CpG sites the most frequent point mutation site in the human genome.

Transition / Transversion Classification

  • Transition (Ti): Purine↔Purine (A↔G) or Pyrimidine↔Pyrimidine (C↔T) β€” chemically similar bases
  • Transversion (Tv): Purine↔Pyrimidine β€” chemically dissimilar bases; rarer and often more functionally impactful

The genome-wide Ti/Tv ratio of ~2.1 is used as a variant calling quality metric; samples deviating significantly from this ratio may indicate sequencing artifacts.


πŸ₯ Disease Classes

The 15 disease categories predicted by Model 2:

# Disease OMIM Category
1 Cystic Fibrosis Pulmonary / Metabolic
2 Arrhythmia Cardiovascular
3 Li-Fraumeni Syndrome Cancer Predisposition
4 Hereditary Breast & Ovarian Cancer Cancer Predisposition
5 Lynch Syndrome Cancer Predisposition
6 Marfan Syndrome Connective Tissue
7 Hypertrophic Cardiomyopathy Cardiovascular
8 Familial Hypercholesterolaemia Metabolic
9 Noonan Syndrome Developmental
10 Retinitis Pigmentosa Ophthalmological
11 Phenylketonuria Metabolic
12 Haemophilia A/B Haematological
13 Spinal Muscular Atrophy Neuromuscular
14 Gaucher Disease Lysosomal Storage
15 Neurofibromatosis Type 1 Neurocutaneous

πŸ“‘ API Reference

Once the backend is running, full interactive documentation is available at http://localhost:8000/docs.

POST /analyze

Analyzes a genomic variant and returns AI predictions, metrics, and attention weights.

Request Body:

{
  "chrom": "chr17",
  "pos": 43044295,
  "ref": "G",
  "alt": "A",
  "gene": "BRCA1"
}

Response:

{
  "pathogenicity": {
    "label": "Pathogenic",
    "score": 0.87,
    "blended_score": 0.91
  },
  "diseases": [
    { "name": "Hereditary Breast & Ovarian Cancer", "probability": 0.94 },
    { "name": "Li-Fraumeni Syndrome", "probability": 0.31 }
  ],
  "metrics": {
    "shannon_entropy": 1.84,
    "gc_deviation": "+6.2%",
    "cpg_hotspot": true,
    "variant_type": "Transition (C→T)"
  },
  "attention_weights": [0.02, 0.05, 0.18, ...],
  "flanking_sequence": "ATCG...VARIANT...GCTA"
}

GET /health

Returns {"status": "ok"} β€” use for container liveness probes.

GET /docs

Auto-generated Swagger UI for interactive API exploration.


πŸ“‚ Project Structure

muta-tech-genomescan/
β”‚
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ server.py               # FastAPI application entry point
β”‚   β”œβ”€β”€ requirements.txt        # Python dependencies
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ pathogenicity/      # Fine-tuned DNABERT-2 (Model 1) weights
β”‚   β”‚   └── disease/            # Fine-tuned DNABERT-2 (Model 2) weights
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── hg38.fa             # GRCh38 reference genome (user-provided, ~3 GB)
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ sequence.py         # pyfaidx slicing + variant metrics
β”‚   β”‚   β”œβ”€β”€ blending.py         # Prior blending logic
β”‚   β”‚   └── attention.py        # Attention weight extraction
β”‚   └── priors/
β”‚       └── gene_priors.json    # Per-gene ClinVar prior distributions
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ index.html              # Main dashboard
β”‚   β”œβ”€β”€ style.css               # Glassmorphism UI + theme variables
β”‚   └── app.js                  # API calls, Chart.js, attention heatmap renderer
β”‚
β”œβ”€β”€ training_evidence/
β”‚   β”œβ”€β”€ DNABERT2_ClinVar_Training.ipynb     # Pathogenicity model training notebook
β”‚   β”œβ”€β”€ DNABERT2_Disease_Training.ipynb     # Disease model training notebook
β”‚   └── clinvar_disease_snps.csv            # Training dataset sample
β”‚
└── README.md

πŸ“‹ Evidence of Training

The training_evidence/ folder contains full transparency into model development:

  • DNABERT2_ClinVar_Training.ipynb β€” Documents the fine-tuning of Model 1 on ClinVar pathogenicity labels. Includes data preprocessing, train/val/test splits, loss curves, and final classification report.
  • DNABERT2_Disease_Training.ipynb β€” Documents the fine-tuning of Model 2 on ClinVar disease annotations. Includes class-weight calculation for rare disease balancing, multi-label ROC-AUC curves, and per-class F1 scores.
  • clinvar_disease_snps.csv β€” A representative sample of the training data, sourced from NCBI ClinVar, showing variant coordinates, clinical significance labels, and associated disease terms.

Data Source: All training data was sourced from NCBI ClinVar, a freely accessible public archive of human genetic variants and their clinical interpretations.


πŸš€ How to Run Locally

Prerequisites

  • Python 3.9 or higher
  • Node.js 16+ (for http-server)
  • ~3 GB disk space for the GRCh38 reference genome
  • 8 GB RAM minimum (16 GB recommended for CPU inference)

Step 1 β€” Download the Reference Genome

# Download GRCh38/hg38 from UCSC (requires ~900 MB compressed)
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -P backend/data/
gunzip backend/data/hg38.fa.gz

# Index the genome for fast random access
cd backend/data && samtools faidx hg38.fa

Alternatively, download from Ensembl GRCh38.

Step 2 β€” Start the Backend API (Port 8000)

cd backend
pip install -r requirements.txt

# Ensure hg38.fa is placed in backend/data/
python server.py

The API will be available at http://localhost:8000. Visit http://localhost:8000/docs for the interactive Swagger UI.

Step 3 β€” Start the Frontend Dashboard (Port 8080)

# From the project root
npx http-server ./ -p 8080 -c-1

Open http://localhost:8080 in your browser.

πŸ’‘ The UI is dynamically networked β€” you can also access it via your machine's local IP address (e.g., http://192.168.x.x:8080) from a mobile device on the same WiFi network.


βš™οΈ Configuration

Key settings can be adjusted in backend/server.py:

Setting Default Description
REFERENCE_GENOME_PATH data/hg38.fa Path to the GRCh38 FASTA file
FLANK_SIZE 64 Nucleotides of context on each side of the variant
AI_BLEND_WEIGHT 0.4 Fraction of final score from AI model (vs. prior)
PRIOR_BLEND_WEIGHT 0.6 Fraction of final score from biological prior
DEVICE auto cpu, cuda, or auto (auto-detects GPU)
CORS_ORIGINS ["*"] Allowed frontend origins; restrict in production

πŸ›  Troubleshooting

FileNotFoundError: hg38.fa not found Ensure the reference genome is placed at backend/data/hg38.fa and the .fai index file exists alongside it. Re-run samtools faidx hg38.fa if the index is missing.

RuntimeError: CUDA out of memory Set DEVICE = "cpu" in server.py. CPU inference is slower (~2–4 seconds per query) but works on any machine.

CORS error in browser Ensure the backend is running on port 8000 and CORS_ORIGINS in server.py includes your frontend origin.

Slow inference on CPU Both models run sequentially on CPU by default. For faster results, use a machine with a CUDA-compatible GPU or reduce FLANK_SIZE (note: this will affect prediction accuracy).

Port 8080 already in use

npx http-server ./ -p 8081 -c-1

πŸ—Ί Future Roadmap

  • [ ] VCF File Upload β€” Batch analysis of multi-variant VCF files with exportable reports
  • [ ] ClinVar Live API Integration β€” Real-time lookup of existing ClinVar annotations for comparison
  • [ ] Model Confidence Calibration β€” Temperature scaling for better-calibrated probability outputs
  • [ ] ACMG/AMP Criteria Overlay β€” Map AI predictions to ACMG/AMP variant classification criteria
  • [ ] Population Frequency Lookup β€” gnomAD allele frequency integration for context
  • [ ] Docker Compose Deployment β€” One-command containerised setup for reproducible environments
  • [ ] User Authentication & History β€” Save and revisit previous queries per user session
  • [ ] Extended Disease Classes β€” Expand from 15 to 50+ disease categories with additional ClinVar training data

πŸ™ Acknowledgements


Built as part of an AI/ML undergraduate research project. Not intended for clinical diagnostic use.

Built With

Share this project:

Updates