Inspiration
What it does
How we built it
Challenges we ran into
Accomplishments that we're proud of
What we learned
What's next for HealthAI
Model Card: Alzheimer's Disease MRI Classification System
Model Name: AlzheimerNet-ResNet18 Version: 1.0 Last Updated: December 20, 2025 Model Type: Convolutional Neural Network (Transfer Learning) Task: Multi-class Image Classification (Medical Imaging)
Model Details
Overview
A deep learning model for automated classification of Alzheimer's disease stages from brain MRI scans. Based on ResNet18 architecture with transfer learning from ImageNet, adapted for grayscale medical imaging.
Intended Use
- Primary Use: Automated screening and classification support for Alzheimer's disease diagnosis
- Target Users: Healthcare professionals, radiologists, neurologists, clinical researchers
- Use Context: Clinical decision support tool for analyzing structural brain MRI scans
- NOT INTENDED FOR: Standalone diagnosis without physician oversight, replacement of clinical judgment, or use by non-medical professionals
Model Architecture
- Base Architecture: ResNet18 (He et al., 2016)
- Pre-training: ImageNet (natural images)
- Modifications:
- Input adapted from RGB (3 channels) to grayscale (1 channel)
- Output layer modified for 4-class classification
- All layers fine-tuned on medical data
- Parameters: ~11 million trainable parameters
- Input: 224×224 grayscale MRI images, normalized to [-1, 1]
- Output: Probability distribution over 4 classes (softmax)
Training Details
- Training Data: 5,120 labeled brain MRI scans
- Validation Data: 1,280 samples (20% holdout)
- Test Data: 1,280 unlabeled samples
- Data Augmentation: Rotation (±10°), horizontal flip, brightness/contrast jitter
- Optimizer: Adam (learning rate: 1e-4, weight decay: 1e-5)
- Training Duration: 20 epochs (~90 minutes on Tesla T4 GPU)
- Batch Size: 32
- Loss Function: Cross-entropy
Performance Metrics
| Metric | Value |
|---|---|
| Validation Accuracy | 97.27% |
| Training Accuracy | 99.19% |
| Precision (weighted avg) | 97% |
| Recall (weighted avg) | 97% |
| F1-Score (weighted avg) | 97% |
Per-Class Performance:
- Class 0 (Non-Demented): 98% F1-score
- Class 1 (Very Mild): 100% F1-score
- Class 2 (Mild): 98% F1-score
- Class 3 (Moderate): 96% F1-score
Limitations
Data Limitations
Limited Sample Size
- Training set of 5,120 samples is modest for deep learning
- May not capture full variability of real-world clinical presentations
- Impact: Reduced generalization to rare or unusual cases
Class Imbalance
- Class 1 (Very Mild) severely underrepresented (only 10 validation samples)
- Class 2 (Mild) dominates dataset (~50% of samples)
- Impact: Model may be less reliable for detecting very mild cases; perfect validation performance (100%) on Class 1 may not generalize
Single Data Source
- All training data appears to come from similar scanner/protocol
- No diversity in acquisition parameters, scanner manufacturers, or imaging protocols
- Impact: May not generalize well to different MRI scanners or imaging centers
Temporal Snapshot
- Uses single time-point imaging only
- No longitudinal progression data
- Impact: Cannot predict disease trajectory or rate of decline
Model Limitations
Black Box Nature
- Deep learning model lacks full interpretability
- Difficult to explain specific predictions to clinicians
- Impact: May reduce trust and clinical adoption
Overfitting Risk
- 99.19% training accuracy suggests potential memorization
- Small train-val gap (2%) is positive, but vigilance needed
- Impact: Performance may degrade on truly novel cases
Binary Thresholding
- Provides discrete class predictions rather than continuous severity scores
- Real disease progression is continuous
- Impact: May miss subtle transitions between stages
No Uncertainty Quantification
- Model doesn't provide confidence intervals or prediction uncertainty
- All predictions treated equally regardless of confidence
- Impact: Cannot flag ambiguous cases for manual review
Technical Constraints
Computational Requirements
- Requires GPU for inference (CPU too slow for clinical deployment)
- Model size (~45MB) manageable but not edge-deployable
- Impact: Limits deployment to well-resourced facilities
Input Format Constraints
- Requires specific image preprocessing (224×224 resize, normalization)
- Sensitive to image quality and artifacts
- Impact: May fail on low-quality or corrupted scans
Single Modality
- Uses structural MRI only
- Ignores functional MRI, PET, CSF biomarkers, genetics, cognitive scores
- Impact: Misses complementary diagnostic information
Bias & Fairness Considerations
Potential Sources of Bias
Demographic Bias (Unknown)
- Issue: Dataset demographics (age, sex, race, ethnicity, socioeconomic status) not documented
- Risk: Model may perform differently across demographic groups
- Example: If training data over-represents Caucasian populations, may underperform on other ethnicities
- Mitigation Needed: Demographic analysis of training data and subgroup performance evaluation
Selection Bias
- Issue: Dataset may not represent general population (e.g., clinical trial participants vs. real-world patients)
- Risk: Higher prevalence of severe cases or younger patients in research datasets
- Impact: May misclassify community-dwelling, less severe cases
- Mitigation: Validate on diverse, real-world clinical populations
Scanner & Protocol Bias
- Issue: Training data likely from limited scanner types/imaging protocols
- Risk: Performance degradation on scans from different equipment or settings
- Impact: Model may favor specific MRI characteristics over disease features
- Mitigation: Multi-site validation with heterogeneous scanners
Labeling Bias
- Issue: Ground truth labels based on clinical diagnosis, which has inherent subjectivity
- Risk: Model learns clinician biases rather than objective disease features
- Impact: May perpetuate diagnostic disparities
- Mitigation: Multiple expert consensus labels, neuropathological confirmation
Socioeconomic Bias
- Issue: Access to MRI scans correlates with socioeconomic status
- Risk: Underrepresentation of lower-income populations in training data
- Impact: May not generalize to underserved communities
- Mitigation: Diverse data collection from community health centers
Fairness Metrics
Current Status: ⚠️ Not Evaluated
Required Analysis:
- [ ] Stratified performance by age groups (60-70, 70-80, 80+)
- [ ] Stratified performance by biological sex (if known)
- [ ] Stratified performance by race/ethnicity (if known)
- [ ] Error rate disparity across subgroups
- [ ] False positive/negative rate parity
- [ ] Equal opportunity metrics
Recommendation: Before clinical deployment, conduct comprehensive fairness audit with demographic-stratified evaluation.
Ethical Considerations
False Positives (Type I Error)
- Impact: Unnecessary patient anxiety, costly follow-up testing
- Current Rate: ~3% overall (varies by class)
- Clinical Consequence: Mild - requires confirmatory testing anyway
False Negatives (Type II Error)
- Impact: Missed early diagnosis, delayed treatment
- Current Rate: ~3% overall (varies by class)
- Clinical Consequence: SEVERE - early intervention critical for AD
- Mitigation: Tune threshold to favor sensitivity over specificity if used for screening
Automation Bias
- Risk: Clinicians may over-rely on model predictions
- Impact: Reduced clinical judgment, missed complex cases
- Mitigation: Emphasize model as decision support, not replacement
Data Privacy
- Risk: MRI scans are protected health information (PHI)
- Impact: HIPAA violations, patient privacy breaches
- Mitigation: De-identification, secure storage, limited access
Interpretability
Current Interpretability: ⚠️ Limited (Black Box)
What We Can Interpret:
Class Predictions
- Model outputs clear class labels (0-3)
- Softmax probabilities indicate relative confidence
- Limitation: Doesn't explain why
Confusion Patterns
- Most errors between Class 2 ↔ Class 3 (adjacent stages)
- Clinically plausible confusion (subtle differences)
- Insight: Model learns clinically relevant feature boundaries
Feature Learning (Abstract)
- Early layers detect edges, textures (brain structure)
- Middle layers detect anatomical patterns (ventricles, cortex)
- Late layers detect disease signatures (atrophy, enlargement)
- Limitation: Specific features not directly visible
What We CANNOT Interpret:
Spatial Attribution
- Which brain regions drive each prediction?
- Are decisions based on hippocampus, cortex, ventricles, or multiple areas?
- Missing: Saliency maps, attention weights, GradCAM visualizations
Decision Boundaries
- What specific features distinguish Class 2 from Class 3?
- How much atrophy is "enough" for severe classification?
- Missing: Feature importance scores, counterfactual examples
Individual Predictions
- Why was this specific patient classified as Class 3?
- Missing: Case-by-case explanations
Recommended Interpretability Enhancements:
High Priority:
- GradCAM/GradCAM++ - Highlight influential brain regions
- Attention Mechanisms - Built-in interpretability through attention weights
- Saliency Maps - Pixel-level importance visualization
Medium Priority:
- Feature Visualization - Show what specific neurons detect
- Layer-wise Relevance Propagation (LRP) - Trace predictions back to inputs
- SHAP Values - Local feature importance
Low Priority (Research):
- Concept Activation Vectors - High-level semantic concepts
- Prototypical Examples - Show similar training cases
Clinical Interpretability Requirements:
For clinical adoption, we need to provide:
- ✅ Prediction confidence scores (currently available via softmax)
- ❌ Brain region heatmaps (NOT IMPLEMENTED)
- ❌ Comparison to "typical" cases (NOT IMPLEMENTED)
- ❌ Uncertainty quantification (NOT IMPLEMENTED)
- ❌ Explanation of decision (NOT IMPLEMENTED)
Status: Model currently unsuitable for clinical deployment without interpretability enhancements.
Out-of-Scope Use Cases
Explicitly NOT INTENDED FOR:
❌ Standalone Clinical Diagnosis
- Model must be used as decision support ONLY
- Requires confirmation by qualified healthcare professionals
- Not a replacement for comprehensive clinical evaluation
❌ Predictive Prognosis
- Cannot predict future disease progression or survival
- Not trained on longitudinal outcome data
❌ Treatment Recommendation
- Does not suggest specific treatments or interventions
- Clinical management decisions require physician expertise
❌ Non-MRI Modalities
- Trained exclusively on structural MRI
- Will fail on CT, PET, ultrasound, or X-ray images
❌ Pediatric or Non-AD Dementia
- Trained on adult Alzheimer's disease only
- Not applicable to frontotemporal dementia, Lewy body dementia, vascular dementia, etc.
❌ Real-Time Critical Decisions
- Not validated for emergency or time-sensitive scenarios
- Requires proper quality control and validation
❌ Consumer/Direct-to-Patient Use
- Requires medical expertise to interpret
- Not designed for self-diagnosis
Caveats & Recommendations
Deployment Considerations
Regulatory Approval Required
- Not FDA-cleared or CE-marked
- Requires validation for medical device classification
- Must comply with local healthcare regulations
Clinical Validation Needed
- External validation on independent datasets
- Prospective clinical trial to assess real-world performance
- Comparison to radiologist performance
Quality Control
- Implement input validation (image quality checks)
- Monitor prediction drift over time
- Regular re-validation as new data emerges
Human Oversight Mandatory
- All predictions require physician review
- System should flag uncertain predictions
- Maintain audit trail of predictions vs. final diagnoses
Safe Use Guidelines
DO:
- ✅ Use as screening tool to prioritize cases
- ✅ Validate predictions with clinical assessment
- ✅ Monitor performance on your local population
- ✅ Retrain periodically with new data
- ✅ Document all model decisions
DON'T:
- ❌ Use without physician oversight
- ❌ Apply to populations not represented in training data
- ❌ Ignore model uncertainty or low confidence predictions
- ❌ Deploy without local validation
- ❌ Use for legal or financial decisions
Model Versioning & Updates
Current Version: 1.0 (Baseline)
- Release Date: December 20, 2025
- Training Data Version: Kaggle MRI Alzheimer's Dataset (Dec 2025)
- Performance: 97.27% validation accuracy
Planned Updates:
Version 1.1 (Proposed - Q1 2026)
- Implement GradCAM interpretability
- Add uncertainty quantification
- Address Class 1 imbalance with synthetic augmentation
Version 2.0 (Proposed - Q2 2026)
- Multi-site validation
- Ensemble model for improved robustness
- Demographic fairness audit and mitigation
Contact & Feedback
Model Developers: [Your Name/Team] Institution/Organization: [Your Organization] Email: [Contact Email] Issues & Feedback: [GitHub Issues / Email]
Reporting Errors or Concerns
If you encounter:
- Unexpected predictions or errors
- Bias or fairness issues
- Safety concerns
- Technical bugs
Please contact us immediately with:
- Anonymized case details
- Input image characteristics
- Expected vs. actual output
- Your use context
Acknowledgments
- AI for Alzheimer's Hackathon organizers
- Dataset providers and contributors
- Open-source PyTorch and torchvision communities
- Medical imaging research community
License & Terms of Use
License: [To Be Determined - specify open-source or proprietary]
Terms:
- Research and educational use permitted
- Clinical use requires additional validation and regulatory approval
- Commercial use requires separate licensing agreement
- No warranties provided - use at your own risk
- Users assume all liability for clinical decisions
Changelog
Version 1.0 (December 20, 2025)
- Initial release
- ResNet18 baseline model
- 97.27% validation accuracy
- 4-class Alzheimer's classification
- Known limitations documented
This model card follows guidelines from Mitchell et al. (2019) "Model Cards for Model Reporting" and the EU AI Act technical documentation requirements.
Last Updated: December 20, 2025 Next Review: March 20, 2026 (quarterly review)
Built With
- numpy
- pandas
- pil
- scikit-learn
- seaborn
- torch
- torchvision
Log in or sign up for Devpost to join the conversation.