Our project predicts binding affinity in terms of pIC50 (-log10(IC50)) given a drug molecule represented as a SMILES string and a protein amino acid sequence, allowing researchers to evaluate potential candidates before moving to costly experiments. We built a hybrid machine learning pipeline that combines cheminformatics and protein language models by using RDKit to generate Morgan fingerprints that capture molecular structure and ProtT5 to generate dense embeddings from protein sequences that encode biological information. We then applied PCA to reduce dimensionality while preserving most of the variance and trained an XGBoost regressor to predict pIC50 values from the combined feature space. Along the way we faced challenges such as the high computational cost of generating protein embeddings, handling invalid molecular inputs, ensuring consistency between training and testing pipelines, resolving feature dimension mismatches, and dealing with complex environment dependencies. Despite these obstacles we successfully built an end to end pipeline from raw biochemical data to predictions and demonstrated that combining domain specific chemical features with modern protein embeddings can produce meaningful results. We learned the importance of strong feature engineering, careful data preprocessing, and robust debugging in real world machine learning systems. Moving forward we could optimize performance by exploring more advanced models that directly learn drug protein interactions, improve validation using cross validation, incorporate richer structural information, and other more precise.

Built With

Share this project:

Updates