Hospital-LOS-Predictor

Inspiration

In 2022, I began my machine learning journey with Data Science Nigeria (DSN) at my university. My mentors assigned this hospital length-of-stay prediction project as a learning exercise. Years later, I revisited it for this hackathon—but this time, I challenged myself to build it using real-world production standards, not just as an academic exercise.


Problem Statement

Hospital length of stay (LOS) prediction addresses critical needs from two perspectives:

1. Patient Perspective

When admitted to a hospital, patients and families need to know how long the stay will last. This information impacts:

  • Financial planning for medical bills
  • Work and family arrangements
  • Psychological preparation and peace of mind

2. Hospital Management Perspective

Healthcare administrators face constant pressure to optimize limited resources. Accurate LOS predictions enable:

  • Efficient bed allocation – Anticipate availability for incoming patients
  • Optimized staff scheduling – Align nursing and physician resources with demand
  • Cost management – Reduce operational waste and improve financial planning
  • Enhanced patient care – Better discharge planning and post-acute care coordination

In overcrowded hospitals, knowing when beds will become available can literally save lives by ensuring critical patients get timely care.


What It Does

Hospital-LOS-Predictor is a full-stack machine learning web application that predicts patient length of stay in New York State hospitals with 95% confidence intervals.

Core Features:

  • Interactive map-based workflow – Select from 62 NY counties and 200+ hospitals using Leaflet.js
  • Clinical assessment form – 13 easily accessible features (age, gender, admission type, diagnosis, severity, insurance type, etc.)
  • Real-time ML predictions – XGBoost model trained on 2.4M patient records from NY SPARCS dataset
  • Risk factor analysis – Identifies clinical factors contributing to predicted LOS with impact quantification
  • Professional healthcare UI – Clean, responsive design suitable for hospital staff use

Example Output:

Predicted LOS: 4 days, 5 hours (confidence interval: 3.1–5.4 days)
Contributing factors: High clinical severity (+2-4 days), Emergency admission (+1-3 days)


How I Built It

1. Data Engineering & Model Development

  • Dataset: 2.4M patient discharge records from NY.gov SPARCS database
  • Data cleaning: Handled missing values, outliers (120+ day stays), and imbalanced classes
  • Feature engineering:
    • Selected 13 accessible features that patients/relatives would know at admission
    • Created target-encoded features (LOS_per_MDC, LOS_per_severity)
    • One-hot encoded categorical variables → 312 final features
  • Model training: XGBoost Regressor with hyperparameter tuning via GridSearchCV
  • Pipeline design: Built reusable scikit-learn preprocessing pipeline saved as .pkl for production use

2. Web Application Development

Frontend: (with Claude AI assistance)

  • Vanilla HTML5, CSS3, JavaScript (no frameworks for lightweight performance)
  • Leaflet.js for interactive county/hospital maps with GeoJSON data
  • Custom state management using URL parameters (bookmarkable, shareable links)
  • Client-side form validation with real-time error feedback

Backend:

  • Python Flask REST API
    • /api/predict – Handles feature preprocessing and model inference
    • /api/health – Health check endpoint
    • Risk factor identification logic based on clinical domain knowledge
  • Preprocessing integration: Custom HospitalDataCleaner class handles:
    • Diagnosis group mapping (MDC codes)
    • Target encoding for high-cardinality features
    • Column alignment to training set (312 features)

3. Deployment Journey

  • Microsoft Azure App Service – Initial deployment using Gunicorn WSGI server
    • Configured App Service Plan, WSGI handlers, and environment variables
    • Learned Azure-specific quirks (Kudu deployment, Oryx build system)
    • Migrated away due to cost concerns (free credits expiring)
  • Render.com – Current production deployment (free tier, 750 hours/month)
    • Simpler deployment workflow with native Flask support
    • Better suited for ML apps with large model files

Challenges I Ran Into

1. Production-Ready ML Pipeline Design (Biggest Challenge)

Creating a flexible, maintainable pipeline that could:

  • Serialize the entire preprocessing workflow (not just the model)
  • Handle unknown categories in production (e.g., new diagnosis codes)
  • Be swappable – if I train a better model, just replace the .pkl file

Solution: Built custom HospitalDataCleaner transformer inheriting from scikit-learn's BaseEstimator and TransformerMixin, enabling full pipeline serialization with joblib

2. Debugging AI-Generated Code

Claude AI accelerated frontend development, but debugging required:

  • Reading through unfamiliar JavaScript patterns
  • Understanding Leaflet.js quirks with GeoJSON rendering
  • Tracing state management bugs across multiple HTML pages

Lesson learned: AI is a great accelerator, but you must deeply understand the code it generates

3. Azure Deployment Configuration

  • Struggled with Kudu deployment failures (NullReferenceException)
  • Learned the difference between traditional Flask servers and serverless functions
  • Had to configure WSGI handlers, startup commands, and environment variables correctly

Solution: Switched to Azure's native Python support instead of custom deployment scripts

4. Feature Engineering for Real-World Constraints

Unlike Kaggle competitions where you have all features, I constrained myself to:

  • Only features a patient/family would know at admission (no lab results, no retrospective data)
  • Balanced predictive power with practical usability
  • This forced creative feature engineering (target encoding, interaction features)

Accomplishments That I'm Proud Of

Built a production-ready ML system – Not just a Jupyter notebook, but a full web app with API, frontend, and deployment
Successfully deployed to two cloud platforms – Learned Azure App Service and Render.com deployment workflows
Created reusable ML pipeline – Can swap models without touching application code
Designed professional healthcare UI – Looks like a real analytics dashboard, not a student project
Handled 312-feature encoding pipeline in production – Complex preprocessing works reliably in live environment
Real-world dataset – 2.4M records with messy, imbalanced data (not clean Kaggle data)

This project demonstrates the complete ML lifecycle: data cleaning → feature engineering → model training → API development → frontend design → production deployment.


What I Learned

Technical Skills

  • MLOps fundamentals: Serializing entire pipelines, versioning models, handling schema drift
  • Production ML considerations: Feature availability at inference time, latency optimization, error handling
  • Full-stack development: Flask backend, vanilla JS frontend, RESTful API design
  • Cloud deployment: Azure App Service, Render.com, WSGI servers, environment configuration

Most Valuable Lesson: Business Context Matters More Than Model Metrics

In academic ML, we optimize for R² or RMSE. In production, I learned to prioritize:

  • Feature accessibility – Can the user actually provide this data?
  • Interpretability – Can hospital staff trust and understand predictions?
  • Operational impact – Does a 0.5-day prediction error actually matter for bed allocation?

Example: I could've achieved higher R² by including lab results and vital signs, but those aren't available at admission time. A slightly less accurate model that's actually usable is far more valuable.

Another insight: The confidence interval matters more than the point estimate. Telling staff "3.1–5.4 days" lets them plan conservatively, whereas "4.2 days" gives false precision.


What's Next for Hospital-LOS-Predictor

Short-term Improvements

  • [ ] Mobile responsiveness – Optimize UI for tablets/phones used by hospital staff
  • [ ] Model performance – Experiment with CatBoost, LightGBM, or ensemble methods
  • [ ] Feature importance visualization – Add SHAP values to explain individual predictions
  • [ ] Unit tests – Add pytest coverage for API endpoints and preprocessing pipeline

Medium-term Enhancements

  • [ ] User authentication – Add login system for real hospital use (HIPAA compliance considerations)
  • [ ] Prediction history tracking – Log predictions to database for monitoring and improvement
  • [ ] Admin dashboard – Monitor model performance over time, detect drift
  • [ ] Multi-state support – Expand beyond NY to other states' hospital data

Long-term Vision

  • [ ] Model retraining pipeline – Automated retraining with new SPARCS data releases
  • [ ] A/B testing framework – Compare model versions in production
  • [ ] Integration with EHR systems – Real hospital workflow integration
  • [ ] Explainable AI – Provide clinically interpretable feature importance for each prediction

Tech Stack

Machine Learning

  • Python 3.10
  • XGBoost
  • Scikit-learn 1.
  • Pandas
  • NumPy
  • Joblib

Backend

  • Flask
  • Flask-CORS
  • Gunicorn (WSGI server)

Frontend

  • HTML5
  • CSS3
  • Vanilla JavaScript
  • Leaflet.js (maps)
  • GeoJSON

Data

  • NY SPARCS dataset (2.4M patient records)

Deployment

  • Render.com (current)
  • Microsoft Azure App Service (previous)

Tools

  • Git
  • Claude AI (development assistant)
  • Jupyter Notebooks (EDA)

Project Links


Kaggle Notebooks

Exploratory data analysis Notebook

Model Tuning notebook

Pipeline Fitting notebook

Installation & Local Development

Prerequisites

  • Python 3.13+
  • pip

Setup

  • Clone the Repository
git clone https://github.com/metrosmash/Hospital_LOS_Predictor
cd Hospital_LOS_Predictor 
  • Create Virtual Environment
    • Windows
python -m venv venv
venv\Scripts\activate
  • macOS/Linux
python3 -m venv venv
source venv/bin/activate
  • Install Dependencies
pip install -r requirements.txt
  • Verify Model Files

Ensure these files exist in assets/pkl_files/:

✅ xgb_modelv1.pkl
✅ xgb_hospital_full_pipeline.pkl
✅ feature_names.pkl
✅ mdc_mapping.pkl
✅ severity_mapping.pkl
✅ mdc_conversion_mapping.pkl

Note: If model files are missing, you'll need to train the model first. See Training the Model below.

  • Verify Data Files

Check assets/data/:

✅ ny_counties.geojson
✅ hospital_location_geojson1.geojson

Built With

Share this project:

Updates

posted an update

Improvement

  • updated the index,county_map,prediction_form,prediction_result html pages to use tailwind for a more responsive webapp with better styling
  • worked on the bug that shows page not found when clicking the (counties > county_name)

Log in or sign up for Devpost to join the conversation.