What it does

This project aims to develop a machine learning model to predict probable drug-binding sites in human proteins. The model will be trained on a dataset provided by CYCLICA, which includes features derived exclusively from AlphaFold2 protein structures and labels indicating whether each residue is a 'drug binding' or 'non-drug binding' site. The final model will be evaluated on its ability to reliably predict drug-binding sites and the features most important for drug binding will be demonstrated. This project will contribute to the advancement of drug discovery and the development of new medicines.

Technologies used

I have tried AdaBoost, XGBoost, Stochastic GradientBoost and Random Forest model. In the first try with random hyperparameters, the model performance of Random Forest rules all the other models. Therefore, Random Forest was picked to do hyperparameter fine tuning. The final outcome was predicted by a fine tuned Random Forest model.

Challenges faced

  • The dataset is quite imbalanced and the rows are not independent of each other. Part of the rows are for one protein and another part is for another protein. How do I split the data into train and test set while preserving the complete structure of the protein is question I must solve.
  • How to determine the best evaluation metrics for the imbalanced dataset
  • Significant run time for hyperparameter tuning on a high dimensional grid when using Grid Search

Overall outcomes

ROC AUC SCORE - 0.89 PR AUC - 0.47 RECALL - 0.15 F1-SCORE - 0.26

Important Features are feat_PSI, feat_SCSASA, feat_pLDDT, feat_THETA, coord_Z, feat_TAU, entry_index, coord_X, coord_Y, feat_BBSASA and DSSP_6-13

Accomplishments that I am proud of

This is my first datathon that I compete alone. Having the chance to go through each stage of a machine project provides me a more holistic view on what I should be improving on and some challenges I wouldn't encounter if I fight in a team. No matter what the result is like. I am proud of myself having the motivation and self-discipline to complete a project in a short 2 weeks span.

What I leant

  • Leave sufficient or even more than sufficient time for hyperparameter tuning.

What's next for Predicting Drug Binding Sites in Human Proteins

Deploy the model with a web API so that anyone can easily test the model with new data instead of diving into my code

Built With

Share this project:

Updates