NSMP Endometrial Cancer Survival Calculator

What Inspired Us

The inspiration for this project came from a critical gap in clinical decision-making for endometrial cancer patients. While molecular profiling has revolutionized cancer treatment, there's a significant subset of patients, those with No Specific Molecular Profile (NSMP), who don't fit into the standard molecular classification system. These patients represent a clinical challenge: how do we predict their survival outcomes and guide treatment decisions when traditional molecular markers aren't informative?

We were driven by the vision of creating a tool that could help clinicians make more informed decisions for these patients. By leveraging clinical and pathological features that are routinely available, we could build predictive models that complement molecular testing and provide actionable insights at the point of care.

The goal was ambitious: develop a web-based calculator that could take a patient's clinical profile and provide real-time predictions for both Overall Survival (OS) and Recurrence-Free Survival (RFS), complete with risk stratification and survival probability estimates.

What We Learned

This project was a deep dive into survival analysis, a statistical domain we had only encountered in theory before. Here are the key lessons we took away:

Survival Analysis Fundamentals

We learned that survival analysis is fundamentally different from standard classification or regression. The challenge isn't just predicting an outcome, but modeling time-to-event data where many patients are "censored" (we don't observe their event because they're still alive or haven't recurred at last follow-up). The Cox Proportional Hazards model became our tool of choice because it can handle this censoring naturally while modeling the effect of multiple risk factors simultaneously.

Data Preprocessing in Clinical Settings

Working with real clinical data taught us that medical datasets are messy in ways we hadn't anticipated:

Missing data is the norm, not the exception: Some features had 80%+ missing values, requiring careful feature selection
Mixed data types everywhere: The same variable might be encoded as text ("G1", "G2", "G3"), numeric (1, 2, 3), or even have both formats in the same column
Multicollinearity is a real problem: Clinical features are often highly correlated (e.g., preoperative and final staging), requiring variance inflation factor (VIF) analysis to identify redundant predictors
Small sample sizes: With only 161 patients, every data point mattered, and overfitting was a constant concern

Feature Engineering for Survival Models

We discovered that feature selection for survival models requires different considerations than for classification:

Temporal relationships matter: Preoperative features (available before surgery) are more clinically useful than postoperative features for early decision-making
Domain knowledge is essential: Understanding which molecular markers (PMS2, MSH2, MSH6, MLH1) are relevant for endometrial cancer guided feature selection
Encoding order matters: We learned the hard way that encoding categorical variables before feature selection prevents data leakage and ensures consistent preprocessing

Building Clinical Interfaces

Creating the Dash web application taught us that user experience in clinical tools is critical:

Clinicians need interpretability: Not just a risk score, but understanding why a patient is high-risk (feature contributions)
Visualization matters: Kaplan-Meier-style survival curves are familiar to clinicians and help build trust in the model
Input validation is crucial: Medical data has constraints (e.g., BMI ranges, staging categories) that must be enforced
Responsive design: The interface needed to work on tablets and desktops in clinical settings

How We Built the Project

The project evolved through several distinct phases, each building on lessons learned from the previous:

Phase 1: Exploration and Understanding

We started with exploratory data analysis (EDA) to understand the dataset structure. The Excel file contained 161 patients with over 100 potential features, ranging from demographics to molecular markers. We created a comprehensive feature documentation (key_features.md) tracking missing data percentages and clinical significance.

Key discovery: The dataset had two primary survival outcomes:

Recurrence-Free Survival (RFS): Time from diagnosis to recurrence or last follow-up
Overall Survival (OS): Time from diagnosis to death or last follow-up

These required separate models because the risk factors differed between recurrence and death.

Phase 2: Preprocessing Pipeline Development

The preprocessing pipeline (02_preproccesing.py) became the foundation of the project. This was where we encountered and solved most of the technical challenges:

Data Translation: The original data was in Spanish, requiring a comprehensive variable mapping dictionary to translate to English for consistency.
Survival Time Calculation: We wrote functions to calculate survival times from date columns, handling edge cases like:
- Patients who died before recurrence (for RFS, they're censored at death)
- Patients with missing follow-up dates
- Negative time differences (data quality issues)
Encoding Strategy: We implemented a multi-step encoding process:
- Ordinal variables (tumor grade, risk groups) → numeric encoding preserving order
- Categorical variables → one-hot encoding with drop_first=True to avoid multicollinearity
- Continuous variables → kept as-is for Cox models (which handle continuous features well)
Feature Selection: We used multiple criteria:
- Missing data threshold (<50% missing)
- Correlation analysis to remove highly correlated features
- VIF analysis to detect multicollinearity
- Clinical relevance (prioritizing features available preoperatively)
Train-Test Splitting: Used stratified splitting to ensure both sets had similar event rates, critical for small datasets.

Phase 3: Model Development

We built a modular modeling framework (src/modeling.py) centered around a CoxPHModel class that wrapped the lifelines library:

Key design decisions:

Regularization: Used L2 penalization (penalizer=1.0) to prevent overfitting with the small sample size
Feature standardization: Added an option to standardize features, which improves coefficient interpretability (coefficients represent effect of 1 standard deviation change)
Model persistence: Implemented save/load functionality using joblib for deployment

Model Performance:

Overall Survival: C-index = 0.77 (moderate discrimination)
Recurrence-Free Survival: C-index = 0.85 (good discrimination)

The RFS model performed better, likely because recurrence is more directly related to tumor characteristics, while OS is influenced by many non-cancer factors (comorbidities, age, etc.).

Phase 4: Validation Framework

We implemented cross-validation (src/validation.py) to assess model stability and prevent overfitting. This was crucial given the small sample size. The framework:

Used stratified K-fold cross-validation respecting survival structure
Evaluated both discrimination (C-index) and calibration
Generated visualizations comparing model performance across folds

Phase 5: Web Application Development

The Dash application (app/dash_app.py) was the most user-facing component. We designed it with several key features:

Dual Model Support: Users can select OS, RFS, or both predictions
Dynamic Input Forms: Different input fields for OS vs. RFS models (e.g., PMS2 for OS, lymphovascular invasion for RFS)
Real-time Predictions: Instant risk scores and survival probabilities
Visualizations:
- Risk distribution histograms
- Survival probability curves
- Feature contribution bar charts
- Hazard ratio visualizations
Risk Stratification: Automatic classification into Low/Medium/High risk groups

The interface uses Bootstrap styling for a professional, clinical appearance and includes extensive input validation.

Phase 6: Addressing Class Imbalance

One challenge we discovered was class imbalance in survival outcomes. Most patients didn't experience events (recurrence or death), which could bias the models. We experimented with SMOTE (Synthetic Minority Oversampling Technique) for survival data (02b_oversampliong.py), though ultimately the Cox model's ability to handle censoring made this less critical than in classification problems.

Challenges We Faced

Challenge 1: The Encoding Order Bug

The Problem: Our initial preprocessing pipeline performed feature selection before encoding categorical variables. This caused a critical bug: the selected features were categorical column names, but the model expected encoded (one-hot) features.

The Solution: We restructured the pipeline to encode first, then select features. This required rewriting significant portions of the preprocessing code but resulted in a more robust pipeline.

Lesson: In machine learning pipelines, the order of operations matters critically. Always encode before feature selection when working with mixed data types.

Challenge 2: Multicollinearity in Clinical Features

The Problem: Clinical features are inherently correlated. For example, preoperative staging, final staging, and risk groups all measure similar concepts. High multicollinearity causes unstable coefficient estimates and makes interpretation difficult.

The Solution: We implemented VIF (Variance Inflation Factor) analysis to identify and remove highly correlated features. We set a threshold of VIF > 10 to flag problematic features and removed them iteratively.

Lesson: Domain knowledge helps, but quantitative methods like VIF are essential for detecting subtle multicollinearity that isn't obvious from correlation matrices.

Challenge 3: Small Sample Size

The Problem: With only 161 patients, the risk of overfitting was constant. Standard machine learning approaches (like complex neural networks) weren't viable.

The Solution:

Used regularization (L2 penalization) aggressively
Kept the feature set small (16 features for OS, 20 for RFS)
Used cross-validation extensively to assess generalization
Accepted that some uncertainty is inherent with small datasets

Lesson: Sometimes the best model is the simplest one that generalizes well. A well-regularized Cox model with 16 features outperformed more complex approaches on this dataset.

Challenge 4: Missing Data Strategy

The Problem: Missing data percentages ranged from 0% to 87%. Some important features (like CA-125, a tumor marker) had >80% missing values, making them unusable.

The Solution:

Set a 50% missing threshold for feature inclusion
Used complete-case analysis (dropping rows with missing values in selected features) rather than imputation, since imputation assumptions are hard to validate in survival analysis
Documented missing data patterns to inform clinical interpretation

Lesson: In clinical data, missingness is often informative (e.g., a test wasn't ordered because it wasn't indicated). Imputation can introduce bias, so sometimes complete-case analysis is preferable.

Challenge 5: Building an Intuitive Clinical Interface

The Problem: Clinicians need to trust the tool, which requires both accuracy and interpretability. The interface needed to be fast, clear, and provide context for predictions.

The Solution:

Added feature contribution visualizations showing which factors drive high/low risk
Included confidence intervals and risk stratification (not just point estimates)
Used familiar visualizations (Kaplan-Meier curves) that clinicians recognize
Provided both numeric outputs (survival probabilities) and visual outputs (curves)

Lesson: A model is only as good as its usability. Spending time on the user interface and interpretability features is as important as model performance.

Challenge 6: Handling Censored Data in Predictions

The Problem: Survival models predict survival functions, not single values. For the web app, we needed to extract specific survival probabilities at clinically relevant time points (1, 3, 5 years).

The Solution: Used the predict_survival_function method from lifelines, which returns survival probabilities at specified time points. We then formatted these for display in the web interface.

Lesson: Survival analysis requires thinking in terms of functions and probabilities, not just binary outcomes. This is more complex but also more informative for clinical decision-making.

Reflection

This project taught us that building clinical decision support tools requires a unique combination of skills:

Statistical rigor: Understanding survival analysis, regularization, and validation
Data engineering: Handling messy, real-world clinical data
Software engineering: Building robust, maintainable pipelines
User experience design: Creating interfaces that clinicians will actually use
Domain knowledge: Understanding the clinical context to make informed modeling decisions

The final tool represents a synthesis of these skills. While the models aren't perfect (C-index of 0.77-0.85), they provide valuable risk stratification that can complement clinical judgment. The interactive web interface makes these predictions accessible at the point of care.

Most importantly, this project reinforced that in healthcare applications, interpretability and usability are as important as raw performance. A slightly less accurate model that clinicians understand and trust is more valuable than a black-box model with marginally better metrics.

Future Directions

If we were to continue this project, we would:

External validation: Test the models on an independent dataset to assess generalizability
Temporal validation: Validate on more recent patients to ensure the models remain relevant as treatment practices evolve
Feature updates: Incorporate new molecular markers as they become standard in clinical practice
Model calibration: Improve calibration so that predicted probabilities match observed event rates more closely
Clinical integration: Work with clinicians to integrate the tool into electronic health record systems

The journey from raw Excel data to a functional clinical calculator was challenging but deeply rewarding. Each obstacle taught us something new about both the technical aspects of survival modeling and the practical realities of working with clinical data.