Epigenetic Markers of Memory Loss: Revolutionary Early Detection Platform for Alzheimer's Disease
Motivation
While searching for datasets to develop a research project, we came across studies that used DNA methylation data from brain tissue to distinguish between individuals with and without Alzheimer's disease. However, since brain samples can only be obtained post-mortem or through costly invasive procedures, this approach is impractical for diagnosing living patients. At the same time, we read about blood tests being explored for Alzheimer's diagnostics, which sparked a key question: If DNA methylation is associated with Alzheimer's disease, could blood-derived methylation data serve as a diagnostic tool? This motivated us to develop a web app that could make early detection available to healthcare providers worldwide.
Abstract
Epigenetics involves the changes in gene expression without altering the DNA sequence and can be passed down through generations. It is regulated by chemical modifications which alter the DNA molecule to inhibit or express a gene. One such mechanism is called DNA methylation which adds a methyl group to the CpG dinucleotide of the DNA molecule. Both environmental factors and genetics can shape DNA methylation patterns and whether or not a gene is expressed. Alzheimer’s disease, a neurodegenerative condition affected by both genetic factors and environment, may therefore be studied through these epigenetic signatures. By analyzing DNA methylation profiles, we aim to monitor cognitive decline, identify disease-associated genes and CpG sites, and predict Alzheimer’s risk using machine learning. In this project, DNA methylation profiles extracted from blood samples of individuals with Alzheimer’s disease, those experiencing cognitive impairment, and healthy controls will be used to create a dataset for training the machine learning model. This model will predict the likelihood of Alzheimer’s as well as monitor cognitive decline based on methylation patterns. Furthermore, by analyzing feature importance scores from the trained model, we can identify the specific methylation sites and genes most strongly associated with Alzheimer’s, offering valuable insights into disease mechanisms, potential biomarkers, and cognitive impairment.
How it Works
The machine learning model was trained using DNA methylation data derived from blood samples though LightGBM. The dataset included approximately 850,000 methylation sites across 823 samples. An Epigenome-Wide Association Study (EWAS) was conducted to identify statistically significant methylation sites. These sites were then used for feature selection, helping to remove noise from the dataset and improve the overall performance of the machine learning model.
A web application was developed to host the trained machine learning model to allow users to upload their own .csv files and get predictions from the model. The user would also receive insights on which methylation sites contributed the most through SHAP (Shapley Additive exPlanations) values.
Frontend Architecture:
• Next.js Application: Built with React, Tailwind CSS, shadcn/ui, and Radix UI components for modern, accessible user interface
• Secure File Upload: Intuitive file uploading with real-time progress tracking for seamless preprocessing of high-dimensional methylation data
• Interactive Dashboard: User-friendly navigation with progress bars and explainable AI insights through SHAP visualizations, allowing clinicians to understand model decisions and identify key biomarkers
Backend Architecture:
• FastAPI Backend: Robust RESTful API endpoints with CORS configuration for secure data handling, serving our dual-model system with efficient data processing
Technical Challenges
Our dataset contains around 850,000 CpG sites across 828 samples. Feature selection proved challenging, as we had to experiment with multiple techniques to optimize model performance. Such as statistical filtering methods, dimensionality reduction approaches, and machine learning–based selection strategies.
Uploading and processing large CSV methylation files required building an efficient data transfer pipeline between the Next.js/React frontend and the FastAPI backend, maintaining real-time updates and consistent data handling. This portable architecture ensures accessibility across different healthcare environments while maintaining data integrity and security.
What We Learned
The machine learning model built with LightGBM produced promising results, demonstrating strong predictive performance after feature selection was performed using an Epigenome-Wide Association Study (EWAS). By identifying the most statistically significant methylation sites, EWAS helped reduce noise and dimensionality in the dataset, enabling the model to focus on biologically relevant features and improving both interpretability and overall accuracy.
Using cross-validation taught us the importance of reproducibility and statistical rigor when working with biomedical data, ensuring our models can generalize to new patient populations.
What's Next
Clinical validation with larger patient cohorts, better hyperparameter tuning, and expanded biomarker visualizations with correlation analysis for enhanced feature identification. These advancements will improve usability for healthcare professionals while improving machine learning performance.
Impact & Commercial Opportunity
We have developed a blood-based DNA methylation platform for early Alzheimer’s detection using advanced machine learning on 823 samples with 850,000+ CpG sites. Unlike costly brain scans or invasive procedures, our simple blood test makes early screening affordable, scalable, and accessible.
Reduces cost compared to current diagnostic methods requiring expensive brain imaging or invasive procedures.
Provides interpretable biomarker discovery for research and clinicians through feature importance analysis, UMAP visualizations, and explainable AI with SHAP visualizations that reveal key CpG sites and genes driving Alzheimer's risk classification.
By merging machine learning, software engineering, bioinformatics and healthcare, our solution offers a scientifically grounded, clinically relevant approach that can reduce costs, improve patient outcomes, and accelerate adoption in precision medicine.
Technologies Used: Next.js, React, Tailwind CSS, FastAPI, PyTorch, XGBoost, Python, TypeScript, shadcn/ui, Radix UI
Categories: Healthcare, Machine Learning, Data Science, Bioinformatics, Clinical Research, Precision Medicine
Built With
- fastapi
- lightgbm
- next.js
- pytorch
- radix-ui
- react
- shadcn/ui
- tailwindcss
- typescript
Log in or sign up for Devpost to join the conversation.