UHSP Datathon 2026

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for UHSP Datathon 2026

Based on our discussion regarding your EEG project and the UHSP Datathon 2026, here is a complete draft for your Devpost project description.

Inspiration Mental health diagnostics often rely on subjective self-reporting and clinical interviews, which can lead to delayed or inconsistent treatment. We were inspired by the potential of Neurotechnology to provide a "biological ground truth." By analyzing Resting-state EEG signals from the pre-frontal cortex, we wanted to see if machine learning could bridge the gap between subjective experience and objective physiological data in Major Depressive Disorder (MDD).

What it does Our project is a diagnostic pipeline that classifies individuals as either Healthy Controls or MDD patients using only three electrodes: Fp1, Fpz, and Fp2. The system processes raw neurological signals, extracts complex features like Hurst Exponents, Theta/Beta ratios, and Wavelet Energy, and utilizes a gradient-boosted tree model (XGBoost) to identify the unique neural signatures associated with depression.

How we built it Data Processing: We handled a dataset of 50 patients, using synthetization to expand the data to 5,000 samples for robust model training.

Feature Engineering: We implemented an automated selection process, removing columns with >85% null values and pruning highly correlated features (>90%) to reduce multicollinearity.

The Model: We used XGBoost as our core classifier. We optimized the model using RandomizedSearchCV with a 3-fold Stratified K-Fold cross-validation to find the ideal hyperparameters for learning rate and tree depth.

Selection: We utilized Gain-based feature importance, focusing on the "Vital Few" features that contributed to 80% of the total model gain.

Challenges we ran into The biggest challenge was the "Perfect Metric" trap. Initially, our model returned a 1.0 AUC/Accuracy score. Instead of taking the win, we investigated the "why." We discovered the complexities of Temporal Leakage and Subject Fingerprinting,where a model learns to recognize a specific person's brainwave pattern rather than the underlying pathology. Navigating the fine line between synthetic data expansion and data leakage was our steepest learning curve.

Accomplishments that we're proud of Successfully extracting high-level EEG biomarkers that align with neurological research.

Building a pipeline that is "lean"-we identified that a small subset of features accounts for the vast majority of predictive power.

Maintaining intellectual honesty by identifying and debugging data leakage rather than settling for "perfect" but unrealistic results.

What we learned We learned that EEG data is incredibly personal; a signal recorded at 1:00 is almost identical to one at 1:01. We gained a deep understanding of why Subject-Independent Validation (keeping a person's entire data out of the training set) is the only way to build a model that works in a real clinic. We also mastered the use of SHAP and Gain values to peek inside the "black box" of boosted trees.

Built With

python
randomforest
xgboost

Updates

Nino Godoradze started this project — Apr 18, 2026 06:19 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.