Multimodal Precision for Skin Disease Detection
About the Project
Skin rashes and lesions are notoriously difficult to diagnose. Many conditions, including dermatitis, psoriasis, melanoma, and systemic lupus erythematosus (SLE), present with overlapping visual features, even for experienced clinicians.
To improve diagnostic precision, we pursued a novel approach that pairs visual symptoms with molecular signals from the blood.
Rather than relying on images alone, we built a multimodal AI system that integrates skin photographs with blood-based transcriptomic measurements to predict whether a patient has a skin-related disease.
Proof-of-Concept Website: click here!
How We Built It
Image Understanding with Foundation Models
We leveraged a powerful pretrained dermatology vision-language model, MONET, built on a CLIP ViT-L/14 architecture. MONET was trained on over 105,000 dermatological images paired with medical text descriptions, enabling it to:
- Recognize dermatologic concepts with dermatologist-level accuracy
- Provide interpretable visual representations
- Maintain transparency throughout the AI pipeline
This provided a strong visual encoder capable of extracting meaningful disease-related features from skin images.
Transcriptomics Integration
To complement the image data, we incorporated blood gene expression profiles.
Rather than using the full transcriptome, which is high-dimensional and often noisy, we:
- Performed multicohort meta-analysis across multiple skin disease datasets
- Identified differentially expressed genes associated with disease
- Restricted the transcriptomic input to biologically relevant genes
Mathematically, instead of learning from:
\( X \in \mathbb{R}^{n \times G} \)
where \(G\) represents the full set of measured genes, we trained on:
\( X_{\text{filtered}} \in \mathbb{R}^{n \times g}, \quad g \ll G \)
This improved both model efficiency and biological signal quality.
Multimodal Fusion
We integrated:
- Visual embeddings derived from MONET
- Gene expression features from blood transcriptomics
into a joint predictive model.
We then deployed the system through a user-friendly web interface that allows clinicians to upload images, input transcriptomic data, and receive disease predictions.
Challenges We Faced
Transcriptomic Data Quality
One of the primary challenges was identifying usable public gene expression datasets.
Many microarray datasets were:
- Poorly normalized
- Inconsistent across samples
- Not directly comparable across studies
Reprocessing raw data from scratch exceeded the time constraints of TreeHacks.
To address this, we prioritized datasets that were already properly processed and validated, enabling rapid integration while maintaining data quality.
Additionally, because our imaging and transcriptomic data were not paired at the patient level, we carefully designed the neural network architecture to effectively learn multimodal representations despite this limitation.
What We Learned
- Each member ventured to learn a new data implementation approach in this project. We describe below the novelty of multimodal datasets in clinical applications as the use of mRNA-based gene expression data is biologically rich yet undervalued in diagnostics.
- Multimodal models capture complementary biological signals more effectively than single-modality approaches
- Data preprocessing quality is critical to downstream model performance
- Unpaired data samples still hold predictive power at the cost of describing biomedical features separately and distinctively.
- Model architecture is critical in extracting meaningful topological features. We derive inspiration from the architecture of a recently described EHR-Omics prediction model [Matarso Nat Mach Intell, 2024] for Vision-Omics.
- Omics remains highly valuable. We believe future diagnostics will enhance the predictive ability of diagnosing correct disease conditions.
Impact and Vision
Most existing skin disease classifiers focus primarily on melanoma or skin cancer and rely exclusively on images. Rare diseases and autoimmune conditions remain underrepresented and difficult to diagnose using single data modalities.
This diagnostic challenge often leads to prolonged diagnostic timelines spanning several years for many patients.
Our integrative approach:
- Combines visual phenotypes with molecular biomarkers
- Improves diagnostic precision in ambiguous cases
- Provides a scalable framework for future multimodal medical AI systems
We envision this platform as a clinical decision-support tool that enables more accurate, faster diagnoses and advances precision medicine.
What’s Next for LUNA
Our prototype demonstrates the promise of multimodal learning, but substantial opportunities remain to expand both the dataset and modeling capabilities.
First, we will incorporate transcriptomic datasets that were excluded due to poor normalization. Since raw expression files are mandatory uploads in public repositories, we can retrieve and properly renormalize these data to ensure consistency across cohorts.
Second, we will expand beyond microarrays, which measure gene expression via fluorescent probes, by integrating RNA sequencing (RNA-seq) data. RNA-seq directly quantifies mRNA transcript abundance and is increasingly prevalent in public datasets, enabling improved biological resolution and statistical power.
In parallel, we will continue curating larger and more diverse skin image datasets.
From a modeling perspective, we plan to conduct systematic hyperparameter tuning and explore architectural enhancements to improve multimodal fusion.
Finally, we aim to translate this platform into a formal research study and potential clinical product, rigorously evaluating whether integrating skin imaging with blood transcriptomics significantly outperforms single-modality diagnostic approaches.
Log in or sign up for Devpost to join the conversation.