Abstract:
Kinase-Inhibitor Binding Affinity Prediction Using XGBoost Regression
Understanding the interaction between kinases and inhibitors is crucial for drug discovery and biomedical research. Our project aims to predict dissociation constants for kinase-inhibitor pairs using a regression-based machine learning approach. By leveraging feature extraction techniques and an optimized XGBoost model, we provide a predictive tool that enables researchers to assess kinase-inhibitor binding affinity with high precision.
Exploration & Data Understanding
The dataset consists of three primary tables:
- Table 1: 442 Kinases, including attributes such as Accession Number, Gene Symbol, Kinase Name, Mutations, and Kinase Group.
- Table 2: 60 Inhibitors, with details on chemical structures (SMILES), binding modes, and selectivity scores.
- Table 3: Dissociation Constants for all kinase-inhibitor pairs, forming a 442 x 60 matrix. A lower dissociation constant indicates a stronger binding affinity.
Total kinase-inhibitor interactions: 26,520.
Strategy & Solution
We opted for a regression-based approach rather than classification since binding affinity is continuous rather than categorical. Regression allows for precise dissociation constant predictions, enabling flexible threshold selection and easier integration of new kinase-inhibitor data.
Feature Engineering
A feature matrix was created to combine both kinase and inhibitor characteristics. Key feature selections included:
- Kinase Amino Chains: Converted into numerical embeddings using ProteinBert, capturing biological significance.
- Kinase Mutations: Encoded numerically to account for structural and electrostatic changes affecting binding.
- Kinase Groups: Label-encoded to reflect shared binding patterns among kinases.
- Inhibitor Binding Modes: Categorized as Type 1 (active site binding) or Type 2 (allosteric binding) and label-encoded.
- SMILES Strings: Processed using RDKit to generate molecular fingerprints, representing chemical structures numerically.
Dimensionality Reduction with PCA
ProteinBert embeddings produced ~16,000 features per kinase, requiring dimensionality reduction for efficiency. Principal Component Analysis (PCA) reduced dimensionality to 300 features while preserving 100% variance, optimizing performance for the machine learning model.
Machine Learning Model - XGBoost
We selected XGBoost due to its:
- High-speed training and inference.
- Compatibility with regression tasks using gradient-boosted trees.
- Robustness in handling structured data with high-dimensional features.
Evaluation Metrics
We utilized Mean Absolute Error (MAE/L1) as our primary evaluation metric. MAE directly measures the accuracy of predicted dissociation constants, ensuring precise binding affinity assessments.
Impact & Future Applications
Our approach provides a scalable solution for kinase-inhibitor affinity prediction, aiding drug discovery efforts. The model can be expanded to new kinases and inhibitors, offering a valuable tool for pharmaceutical research and bioinformatics applications.
This project demonstrates the power of integrating biological data with machine learning for real-world applications in precision medicine.
Log in or sign up for Devpost to join the conversation.