Analyzing Data from AIDS Study

Inspiration

This project was inspired by a desire to understand how early antiretroviral treatments influence immune recovery and long-term outcomes in patients with HIV/AIDS. Using data from the AIDS Clinical Trials Group (ACTG) Study 175, we explored how treatment choice, baseline immune health, and patient characteristics interact, and how data science can uncover clinically meaningful patterns from real-world trial data.

What We Built

We conducted an end-to-end analysis combining statistical methods, machine learning, and model interpretability to study treatment effectiveness and predict clinical failure.

Our workflow included:

Exploratory analysis of immune markers (CD4, CD8, and CD4/CD8 ratio)
Comparison of treatment regimens over time
Survival analysis using Kaplan–Meier curves
Predictive modeling of treatment failure risk

We trained two main models:

Logistic Regression as an interpretable baseline
Random Forest to capture non-linear relationships and feature interactions

To ensure transparency, we used SHAP values to explain which features most strongly influenced model predictions.

Key Insights

Combination therapies (ZDV + ddI, ZDV + Zal) consistently outperformed monotherapies in immune recovery and survival probability.
CD8 inflammation levels were more predictable than CD4 recovery, suggesting different biological dynamics.
Prior treatment exposure significantly increased failure risk, highlighting the impact of drug resistance.
Counterfactual analysis showed that remaining on treatment could substantially reduce predicted failure risk for high-risk patients.

What We Learned

Through this project, we learned how to:

Handle clinical datasets responsibly while avoiding data leakage
Balance interpretability and performance in healthcare models
Evaluate models beyond accuracy, focusing on minority-class risk
Use explainable AI tools to turn black-box models into actionable insights

Challenges

Class imbalance made treatment failure prediction difficult and required careful metric selection.
Ensuring model outputs aligned with clinical intuition was non-trivial.
Feature engineering had to be done cautiously to preserve validity and interpretability.

Takeaway

This project demonstrates how data science can be used not just to predict outcomes, but to understand them. By combining statistical analysis, machine learning, and explainability, we show how historical clinical trial data can still inform better treatment insights and patient risk stratification.

Built With

matplotlib
numpy
pandas
preprocessing
python
random-forest-models
scikit-learn
seaborn
shap

Updates

Aiperi Akzholtoeva started this project — Jan 25, 2026 04:46 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.