Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for HealthAI

Model Card: Alzheimer's Disease MRI Classification System

Model Name: AlzheimerNet-ResNet18 Version: 1.0 Last Updated: December 20, 2025 Model Type: Convolutional Neural Network (Transfer Learning) Task: Multi-class Image Classification (Medical Imaging)

Model Details

Overview

A deep learning model for automated classification of Alzheimer's disease stages from brain MRI scans. Based on ResNet18 architecture with transfer learning from ImageNet, adapted for grayscale medical imaging.

Intended Use

Primary Use: Automated screening and classification support for Alzheimer's disease diagnosis
Target Users: Healthcare professionals, radiologists, neurologists, clinical researchers
Use Context: Clinical decision support tool for analyzing structural brain MRI scans
NOT INTENDED FOR: Standalone diagnosis without physician oversight, replacement of clinical judgment, or use by non-medical professionals

Model Architecture

Base Architecture: ResNet18 (He et al., 2016)
Pre-training: ImageNet (natural images)
Modifications:
- Input adapted from RGB (3 channels) to grayscale (1 channel)
- Output layer modified for 4-class classification
- All layers fine-tuned on medical data
Parameters: ~11 million trainable parameters
Input: 224×224 grayscale MRI images, normalized to [-1, 1]
Output: Probability distribution over 4 classes (softmax)

Training Details

Training Data: 5,120 labeled brain MRI scans
Validation Data: 1,280 samples (20% holdout)
Test Data: 1,280 unlabeled samples
Data Augmentation: Rotation (±10°), horizontal flip, brightness/contrast jitter
Optimizer: Adam (learning rate: 1e-4, weight decay: 1e-5)
Training Duration: 20 epochs (~90 minutes on Tesla T4 GPU)
Batch Size: 32
Loss Function: Cross-entropy

Performance Metrics

Metric	Value
Validation Accuracy	97.27%
Training Accuracy	99.19%
Precision (weighted avg)	97%
Recall (weighted avg)	97%
F1-Score (weighted avg)	97%

Per-Class Performance:

Class 0 (Non-Demented): 98% F1-score
Class 1 (Very Mild): 100% F1-score
Class 2 (Mild): 98% F1-score
Class 3 (Moderate): 96% F1-score

Limitations

Data Limitations

Limited Sample Size
- Training set of 5,120 samples is modest for deep learning
- May not capture full variability of real-world clinical presentations
- Impact: Reduced generalization to rare or unusual cases
Class Imbalance
- Class 1 (Very Mild) severely underrepresented (only 10 validation samples)
- Class 2 (Mild) dominates dataset (~50% of samples)
- Impact: Model may be less reliable for detecting very mild cases; perfect validation performance (100%) on Class 1 may not generalize
Single Data Source
- All training data appears to come from similar scanner/protocol
- No diversity in acquisition parameters, scanner manufacturers, or imaging protocols
- Impact: May not generalize well to different MRI scanners or imaging centers
Temporal Snapshot
- Uses single time-point imaging only
- No longitudinal progression data
- Impact: Cannot predict disease trajectory or rate of decline

Model Limitations

Black Box Nature
- Deep learning model lacks full interpretability
- Difficult to explain specific predictions to clinicians
- Impact: May reduce trust and clinical adoption
Overfitting Risk
- 99.19% training accuracy suggests potential memorization
- Small train-val gap (2%) is positive, but vigilance needed
- Impact: Performance may degrade on truly novel cases
Binary Thresholding
- Provides discrete class predictions rather than continuous severity scores
- Real disease progression is continuous
- Impact: May miss subtle transitions between stages
No Uncertainty Quantification
- Model doesn't provide confidence intervals or prediction uncertainty
- All predictions treated equally regardless of confidence
- Impact: Cannot flag ambiguous cases for manual review

Technical Constraints

Computational Requirements
- Requires GPU for inference (CPU too slow for clinical deployment)
- Model size (~45MB) manageable but not edge-deployable
- Impact: Limits deployment to well-resourced facilities
Input Format Constraints
- Requires specific image preprocessing (224×224 resize, normalization)
- Sensitive to image quality and artifacts
- Impact: May fail on low-quality or corrupted scans
Single Modality
- Uses structural MRI only
- Ignores functional MRI, PET, CSF biomarkers, genetics, cognitive scores
- Impact: Misses complementary diagnostic information

Bias & Fairness Considerations

Potential Sources of Bias

Demographic Bias (Unknown)
- Issue: Dataset demographics (age, sex, race, ethnicity, socioeconomic status) not documented
- Risk: Model may perform differently across demographic groups
- Example: If training data over-represents Caucasian populations, may underperform on other ethnicities
- Mitigation Needed: Demographic analysis of training data and subgroup performance evaluation
Selection Bias
- Issue: Dataset may not represent general population (e.g., clinical trial participants vs. real-world patients)
- Risk: Higher prevalence of severe cases or younger patients in research datasets
- Impact: May misclassify community-dwelling, less severe cases
- Mitigation: Validate on diverse, real-world clinical populations
Scanner & Protocol Bias
- Issue: Training data likely from limited scanner types/imaging protocols
- Risk: Performance degradation on scans from different equipment or settings
- Impact: Model may favor specific MRI characteristics over disease features
- Mitigation: Multi-site validation with heterogeneous scanners
Labeling Bias
- Issue: Ground truth labels based on clinical diagnosis, which has inherent subjectivity
- Risk: Model learns clinician biases rather than objective disease features
- Impact: May perpetuate diagnostic disparities
- Mitigation: Multiple expert consensus labels, neuropathological confirmation
Socioeconomic Bias
- Issue: Access to MRI scans correlates with socioeconomic status
- Risk: Underrepresentation of lower-income populations in training data
- Impact: May not generalize to underserved communities
- Mitigation: Diverse data collection from community health centers

Fairness Metrics

Current Status: ⚠️ Not Evaluated

Required Analysis:

[ ] Stratified performance by age groups (60-70, 70-80, 80+)
[ ] Stratified performance by biological sex (if known)
[ ] Stratified performance by race/ethnicity (if known)
[ ] Error rate disparity across subgroups
[ ] False positive/negative rate parity
[ ] Equal opportunity metrics

Recommendation: Before clinical deployment, conduct comprehensive fairness audit with demographic-stratified evaluation.

Ethical Considerations

False Positives (Type I Error)
- Impact: Unnecessary patient anxiety, costly follow-up testing
- Current Rate: ~3% overall (varies by class)
- Clinical Consequence: Mild - requires confirmatory testing anyway
False Negatives (Type II Error)
- Impact: Missed early diagnosis, delayed treatment
- Current Rate: ~3% overall (varies by class)
- Clinical Consequence: SEVERE - early intervention critical for AD
- Mitigation: Tune threshold to favor sensitivity over specificity if used for screening
Automation Bias
- Risk: Clinicians may over-rely on model predictions
- Impact: Reduced clinical judgment, missed complex cases
- Mitigation: Emphasize model as decision support, not replacement
Data Privacy
- Risk: MRI scans are protected health information (PHI)
- Impact: HIPAA violations, patient privacy breaches
- Mitigation: De-identification, secure storage, limited access

Interpretability

Current Interpretability: ⚠️ Limited (Black Box)

What We Can Interpret:

Class Predictions
- Model outputs clear class labels (0-3)
- Softmax probabilities indicate relative confidence
- Limitation: Doesn't explain why
Confusion Patterns
- Most errors between Class 2 ↔ Class 3 (adjacent stages)
- Clinically plausible confusion (subtle differences)
- Insight: Model learns clinically relevant feature boundaries
Feature Learning (Abstract)
- Early layers detect edges, textures (brain structure)
- Middle layers detect anatomical patterns (ventricles, cortex)
- Late layers detect disease signatures (atrophy, enlargement)
- Limitation: Specific features not directly visible

What We CANNOT Interpret:

Spatial Attribution
- Which brain regions drive each prediction?
- Are decisions based on hippocampus, cortex, ventricles, or multiple areas?
- Missing: Saliency maps, attention weights, GradCAM visualizations
Decision Boundaries
- What specific features distinguish Class 2 from Class 3?
- How much atrophy is "enough" for severe classification?
- Missing: Feature importance scores, counterfactual examples
Individual Predictions
- Why was this specific patient classified as Class 3?
- Missing: Case-by-case explanations

Recommended Interpretability Enhancements:

High Priority:

GradCAM/GradCAM++ - Highlight influential brain regions
Attention Mechanisms - Built-in interpretability through attention weights
Saliency Maps - Pixel-level importance visualization

Medium Priority:

Feature Visualization - Show what specific neurons detect
Layer-wise Relevance Propagation (LRP) - Trace predictions back to inputs
SHAP Values - Local feature importance

Low Priority (Research):

Concept Activation Vectors - High-level semantic concepts
Prototypical Examples - Show similar training cases

Clinical Interpretability Requirements:

For clinical adoption, we need to provide:

✅ Prediction confidence scores (currently available via softmax)
❌ Brain region heatmaps (NOT IMPLEMENTED)
❌ Comparison to "typical" cases (NOT IMPLEMENTED)
❌ Uncertainty quantification (NOT IMPLEMENTED)
❌ Explanation of decision (NOT IMPLEMENTED)

Status: Model currently unsuitable for clinical deployment without interpretability enhancements.

Out-of-Scope Use Cases

Explicitly NOT INTENDED FOR:

❌ Standalone Clinical Diagnosis
- Model must be used as decision support ONLY
- Requires confirmation by qualified healthcare professionals
- Not a replacement for comprehensive clinical evaluation
❌ Predictive Prognosis
- Cannot predict future disease progression or survival
- Not trained on longitudinal outcome data
❌ Treatment Recommendation
- Does not suggest specific treatments or interventions
- Clinical management decisions require physician expertise
❌ Non-MRI Modalities
- Trained exclusively on structural MRI
- Will fail on CT, PET, ultrasound, or X-ray images
❌ Pediatric or Non-AD Dementia
- Trained on adult Alzheimer's disease only
- Not applicable to frontotemporal dementia, Lewy body dementia, vascular dementia, etc.
❌ Real-Time Critical Decisions
- Not validated for emergency or time-sensitive scenarios
- Requires proper quality control and validation
❌ Consumer/Direct-to-Patient Use
- Requires medical expertise to interpret
- Not designed for self-diagnosis

Caveats & Recommendations

Deployment Considerations

Regulatory Approval Required
- Not FDA-cleared or CE-marked
- Requires validation for medical device classification
- Must comply with local healthcare regulations
Clinical Validation Needed
- External validation on independent datasets
- Prospective clinical trial to assess real-world performance
- Comparison to radiologist performance
Quality Control
- Implement input validation (image quality checks)
- Monitor prediction drift over time
- Regular re-validation as new data emerges
Human Oversight Mandatory
- All predictions require physician review
- System should flag uncertain predictions
- Maintain audit trail of predictions vs. final diagnoses

Safe Use Guidelines

DO:

✅ Use as screening tool to prioritize cases
✅ Validate predictions with clinical assessment
✅ Monitor performance on your local population
✅ Retrain periodically with new data
✅ Document all model decisions

DON'T:

❌ Use without physician oversight
❌ Apply to populations not represented in training data
❌ Ignore model uncertainty or low confidence predictions
❌ Deploy without local validation
❌ Use for legal or financial decisions

Model Versioning & Updates

Current Version: 1.0 (Baseline)

Release Date: December 20, 2025
Training Data Version: Kaggle MRI Alzheimer's Dataset (Dec 2025)
Performance: 97.27% validation accuracy

Planned Updates:

Version 1.1 (Proposed - Q1 2026)

Implement GradCAM interpretability
Add uncertainty quantification
Address Class 1 imbalance with synthetic augmentation

Version 2.0 (Proposed - Q2 2026)

Multi-site validation
Ensemble model for improved robustness
Demographic fairness audit and mitigation

Contact & Feedback

Model Developers: [Your Name/Team] Institution/Organization: [Your Organization] Email: [Contact Email] Issues & Feedback: [GitHub Issues / Email]

Reporting Errors or Concerns

If you encounter:

Unexpected predictions or errors
Bias or fairness issues
Safety concerns
Technical bugs

Please contact us immediately with:

Anonymized case details
Input image characteristics
Expected vs. actual output
Your use context

Acknowledgments

AI for Alzheimer's Hackathon organizers
Dataset providers and contributors
Open-source PyTorch and torchvision communities
Medical imaging research community

License & Terms of Use

License: [To Be Determined - specify open-source or proprietary]

Terms:

Research and educational use permitted
Clinical use requires additional validation and regulatory approval
Commercial use requires separate licensing agreement
No warranties provided - use at your own risk
Users assume all liability for clinical decisions

Changelog

Version 1.0 (December 20, 2025)

Initial release
ResNet18 baseline model
97.27% validation accuracy
4-class Alzheimer's classification
Known limitations documented

This model card follows guidelines from Mitchell et al. (2019) "Model Cards for Model Reporting" and the EU AI Act technical documentation requirements.

Last Updated: December 20, 2025 Next Review: March 20, 2026 (quarterly review)

Built With

numpy
pandas
pil
scikit-learn
seaborn
torch
torchvision

Updates

Nikita Bond started this project — Dec 20, 2025 07:05 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.