Hospital-los-predict app thumbnail
Interactive map showing all 62 NY counties with click-to-select functionality
13-feature clinical assessment form with real-time validation
Detailed prediction with confidence interval and risk factor analysis
Zoom into selected county showing all hospitals as map markers

Hospital-LOS-Predictor

Inspiration

In 2022, I began my machine learning journey with Data Science Nigeria (DSN) at my university. My mentors assigned this hospital length-of-stay prediction project as a learning exercise. Years later, I revisited it for this hackathon—but this time, I challenged myself to build it using real-world production standards, not just as an academic exercise.

Problem Statement

Hospital length of stay (LOS) prediction addresses critical needs from two perspectives:

1. Patient Perspective

When admitted to a hospital, patients and families need to know how long the stay will last. This information impacts:

Financial planning for medical bills
Work and family arrangements
Psychological preparation and peace of mind

2. Hospital Management Perspective

Healthcare administrators face constant pressure to optimize limited resources. Accurate LOS predictions enable:

Efficient bed allocation – Anticipate availability for incoming patients
Optimized staff scheduling – Align nursing and physician resources with demand
Cost management – Reduce operational waste and improve financial planning
Enhanced patient care – Better discharge planning and post-acute care coordination

In overcrowded hospitals, knowing when beds will become available can literally save lives by ensuring critical patients get timely care.

What It Does

Hospital-LOS-Predictor is a full-stack machine learning web application that predicts patient length of stay in New York State hospitals with 95% confidence intervals.

Core Features:

Interactive map-based workflow – Select from 62 NY counties and 200+ hospitals using Leaflet.js
Clinical assessment form – 13 easily accessible features (age, gender, admission type, diagnosis, severity, insurance type, etc.)
Real-time ML predictions – XGBoost model trained on 2.4M patient records from NY SPARCS dataset
Risk factor analysis – Identifies clinical factors contributing to predicted LOS with impact quantification
Professional healthcare UI – Clean, responsive design suitable for hospital staff use

Example Output:

Predicted LOS: 4 days, 5 hours (confidence interval: 3.1–5.4 days)
Contributing factors: High clinical severity (+2-4 days), Emergency admission (+1-3 days)

How I Built It

1. Data Engineering & Model Development

Dataset: 2.4M patient discharge records from NY.gov SPARCS database
Data cleaning: Handled missing values, outliers (120+ day stays), and imbalanced classes
Feature engineering:
- Selected 13 accessible features that patients/relatives would know at admission
- Created target-encoded features (LOS_per_MDC, LOS_per_severity)
- One-hot encoded categorical variables → 312 final features
Model training: XGBoost Regressor with hyperparameter tuning via GridSearchCV
Pipeline design: Built reusable scikit-learn preprocessing pipeline saved as .pkl for production use

2. Web Application Development

Frontend: (with Claude AI assistance)

Vanilla HTML5, CSS3, JavaScript (no frameworks for lightweight performance)
Leaflet.js for interactive county/hospital maps with GeoJSON data
Custom state management using URL parameters (bookmarkable, shareable links)
Client-side form validation with real-time error feedback

Backend:

Python Flask REST API
- /api/predict – Handles feature preprocessing and model inference
- /api/health – Health check endpoint
- Risk factor identification logic based on clinical domain knowledge
Preprocessing integration: Custom HospitalDataCleaner class handles:
- Diagnosis group mapping (MDC codes)
- Target encoding for high-cardinality features
- Column alignment to training set (312 features)

3. Deployment Journey

Microsoft Azure App Service – Initial deployment using Gunicorn WSGI server
- Configured App Service Plan, WSGI handlers, and environment variables
- Learned Azure-specific quirks (Kudu deployment, Oryx build system)
- Migrated away due to cost concerns (free credits expiring)
Render.com – Current production deployment (free tier, 750 hours/month)
- Simpler deployment workflow with native Flask support
- Better suited for ML apps with large model files

Challenges I Ran Into

1. Production-Ready ML Pipeline Design (Biggest Challenge)

Creating a flexible, maintainable pipeline that could:

Serialize the entire preprocessing workflow (not just the model)
Handle unknown categories in production (e.g., new diagnosis codes)
Be swappable – if I train a better model, just replace the .pkl file

Solution: Built custom HospitalDataCleaner transformer inheriting from scikit-learn's BaseEstimator and TransformerMixin, enabling full pipeline serialization with joblib

2. Debugging AI-Generated Code

Claude AI accelerated frontend development, but debugging required:

Reading through unfamiliar JavaScript patterns
Understanding Leaflet.js quirks with GeoJSON rendering
Tracing state management bugs across multiple HTML pages

Lesson learned: AI is a great accelerator, but you must deeply understand the code it generates

3. Azure Deployment Configuration

Struggled with Kudu deployment failures (NullReferenceException)
Learned the difference between traditional Flask servers and serverless functions
Had to configure WSGI handlers, startup commands, and environment variables correctly

Solution: Switched to Azure's native Python support instead of custom deployment scripts

4. Feature Engineering for Real-World Constraints

Unlike Kaggle competitions where you have all features, I constrained myself to:

Only features a patient/family would know at admission (no lab results, no retrospective data)
Balanced predictive power with practical usability
This forced creative feature engineering (target encoding, interaction features)

Accomplishments That I'm Proud Of

✅ Built a production-ready ML system – Not just a Jupyter notebook, but a full web app with API, frontend, and deployment
✅ Successfully deployed to two cloud platforms – Learned Azure App Service and Render.com deployment workflows
✅ Created reusable ML pipeline – Can swap models without touching application code
✅ Designed professional healthcare UI – Looks like a real analytics dashboard, not a student project
✅ Handled 312-feature encoding pipeline in production – Complex preprocessing works reliably in live environment
✅ Real-world dataset – 2.4M records with messy, imbalanced data (not clean Kaggle data)

This project demonstrates the complete ML lifecycle: data cleaning → feature engineering → model training → API development → frontend design → production deployment.

What I Learned

Technical Skills

MLOps fundamentals: Serializing entire pipelines, versioning models, handling schema drift
Production ML considerations: Feature availability at inference time, latency optimization, error handling
Full-stack development: Flask backend, vanilla JS frontend, RESTful API design
Cloud deployment: Azure App Service, Render.com, WSGI servers, environment configuration

Most Valuable Lesson: Business Context Matters More Than Model Metrics

In academic ML, we optimize for R² or RMSE. In production, I learned to prioritize:

Feature accessibility – Can the user actually provide this data?
Interpretability – Can hospital staff trust and understand predictions?
Operational impact – Does a 0.5-day prediction error actually matter for bed allocation?

Example: I could've achieved higher R² by including lab results and vital signs, but those aren't available at admission time. A slightly less accurate model that's actually usable is far more valuable.

Another insight: The confidence interval matters more than the point estimate. Telling staff "3.1–5.4 days" lets them plan conservatively, whereas "4.2 days" gives false precision.

What's Next for Hospital-LOS-Predictor

Short-term Improvements

[ ] Mobile responsiveness – Optimize UI for tablets/phones used by hospital staff
[ ] Model performance – Experiment with CatBoost, LightGBM, or ensemble methods
[ ] Feature importance visualization – Add SHAP values to explain individual predictions
[ ] Unit tests – Add pytest coverage for API endpoints and preprocessing pipeline

Medium-term Enhancements

[ ] User authentication – Add login system for real hospital use (HIPAA compliance considerations)
[ ] Prediction history tracking – Log predictions to database for monitoring and improvement
[ ] Admin dashboard – Monitor model performance over time, detect drift
[ ] Multi-state support – Expand beyond NY to other states' hospital data

Long-term Vision

[ ] Model retraining pipeline – Automated retraining with new SPARCS data releases
[ ] A/B testing framework – Compare model versions in production
[ ] Integration with EHR systems – Real hospital workflow integration
[ ] Explainable AI – Provide clinically interpretable feature importance for each prediction

Tech Stack

Machine Learning

Python 3.10
XGBoost
Scikit-learn 1.
Pandas
NumPy
Joblib

Backend

Flask
Flask-CORS
Gunicorn (WSGI server)

Frontend

HTML5
CSS3
Vanilla JavaScript
Leaflet.js (maps)
GeoJSON

Data

NY SPARCS dataset (2.4M patient records)

Deployment

Render.com (current)
Microsoft Azure App Service (previous)

Tools

Git
Claude AI (development assistant)
Jupyter Notebooks (EDA)

Project Links

Live Demo: Live on Render
GitHub Repository: Github Repo
Dataset Source: NY.gov SPARCS Database

Kaggle Notebooks

Exploratory data analysis Notebook

Model Tuning notebook

Pipeline Fitting notebook

Installation & Local Development

Prerequisites

Python 3.13+
pip

Setup

Clone the Repository

git clone https://github.com/metrosmash/Hospital_LOS_Predictor
cd Hospital_LOS_Predictor

Create Virtual Environment
- Windows

python -m venv venv
venv\Scripts\activate

macOS/Linux

python3 -m venv venv
source venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Verify Model Files

Ensure these files exist in assets/pkl_files/:

✅ xgb_modelv1.pkl
✅ xgb_hospital_full_pipeline.pkl
✅ feature_names.pkl
✅ mdc_mapping.pkl
✅ severity_mapping.pkl
✅ mdc_conversion_mapping.pkl

Note: If model files are missing, you'll need to train the model first. See Training the Model below.

Verify Data Files

Check assets/data/:

✅ ny_counties.geojson
✅ hospital_location_geojson1.geojson

Built With

azure
css
flask
html
javascript
jupyter
kaggle
python
render
scikit-learn
xgboost

Updates

Ajiboye Toluwalase posted an update — Feb 13, 2026 04:37 AM EST

Improvement

updated the index,county_map,prediction_form,prediction_result html pages to use tailwind for a more responsive webapp with better styling
worked on the bug that shows page not found when clicking the (counties > county_name)

Log in or sign up for Devpost to join the conversation.

Ajiboye Toluwalase started this project — Feb 10, 2026 04:07 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.