Developer Post: Optimizing Patient Survival Prediction with Logistic Regression

Introduction

As part of a recent hackathon challenge, our team embarked on developing a predictive model to determine patient survival based on a dataset provided by a hospital. This dataset included a variety of attributes such as race, age, and more. Among various algorithms, we chose logistic regression due to its appropriateness for binary classification tasks like ours - predicting whether a patient would "survive" or "not survive." Here's a deep dive into our approach and findings.

Why Logistic Regression?

Logistic regression is fundamentally designed for binary classification problems - a perfect match for our "survive" or "not survive" scenario. Here's why we preferred it over other algorithms:

Linear Regression: Inappropriate for our case as it's best suited for predicting continuous outcomes, not binary classifications.
Decision Trees: While effective, they can easily overfit, especially with numerous features.
Support Vector Machines (SVMs): SVMs are powerful but can be computationally intensive and less interpretable, especially for teams new to data science like ours.

Logistic regression, in contrast, provides a good balance of efficiency, interpretability, and suitability for binary outcomes.

Ensuring Model Robustness

Our primary objective was to develop a model that was neither overfitting nor underfitting. To assess this, we split our dataset, training our model on the first half and testing it on the second. This approach allowed us to compare performance across both sets, looking for high and comparable accuracy as indicators of a well-rounded model. Our results were promising:

Training Data Accuracy: 85.8%
Testing Data Accuracy: 86.23%

The similar, high accuracy levels in both sets suggest a balanced model that generalizes well beyond the training data.

Model Optimization and Feature Analysis

In pursuing even greater accuracy, we examined the impact of individual attributes on model performance. By systematically removing each attribute and observing the changes in accuracy, we aimed to identify and exclude irrelevant features. Our strategy involved:

Excluding Low-Impact Features: Removing attributes that changed accuracy by less than 1%. This strategy led to a small decrease in training data accuracy (by about 0.3%) but increased testing data accuracy by 0.2%.
Removing Detrimental Variables: Eliminating features that, when removed, improved model accuracy. This adjustment boosted training data accuracy by 0.4% and testing data accuracy by 0.7%.

Remarkably, discarding 16 attributes with less than 1% impact on accuracy suggested these were likely irrelevant, but it also hinted at a complex interplay among patient attributes, a potential area for future exploration.

Challenges and Learning

For our beginner-level team, the challenges were multifaceted:

Selecting the Right Algorithm: With limited experience in classification algorithms, determining the best fit for our problem was the first hurdle.
Data Parsing and Preparation: Even basic tasks like data cleaning and preparation were learning opportunities for us.

Conclusion

Employing logistic regression proved effective in predicting patient outcomes with our dataset. Through careful analysis and optimization, we enhanced the prediction accuracy by around 1%, a significant improvement in large datasets. As we continue to explore and understand the nuances of predictive modeling in healthcare, we look forward to uncovering more insights and refining our approaches. Our journey in this hackathon has been enlightening, providing a strong foundation for our future endeavors in data science and machine learning.