🛰 Inspiration
Predicting molecular properties from SMILES strings remains a challenging task due to the complex structure-activity relationships in chemical compounds. We were inspired to combine the representational power of Graph Neural Networks (GNNs) with the robustness of ensemble machine learning to create a more accurate and generalizable solution.
⚙️ What it does
Our model takes SMILES (Simplified Molecular Input Line Entry System) strings and transforms them into graph representations using RDKit. These are then processed by a GNN to capture molecular-level information. The GNN outputs are combined with tabular features and passed to an ensemble of classifiers—Random Forest, XGBoost, LightGBM, and Logistic Regression—stacked using Scikit-learn’s StackingClassifier for final prediction.
🔧 How we built it
We used RDKit to convert SMILES into molecular graphs and trained a GNN to extract learned embeddings. These GNN-based features were concatenated with traditional tabular descriptors. The combined feature set was then fed into an ensemble of models implemented using Scikit-learn, XGBoost, LightGBM, and Logistic Regression. We standardized features using StandardScaler and leveraged cross-validation with stacking for robust model fusion.
🚧 Challenges we ran into
- Training stability and convergence issues in logistic regression during stacking.
- Handling and aligning multiple feature sources (graph-based and tabular).
- Preventing overfitting due to high-dimensional features among different series
🏆 Accomplishments that we're proud of
We are particularly proud of our performance in Task 2, where our model achieved high predictive accuracy, showcasing the effectiveness of combining GNNs with classical ML ensembles. Integrating GNN outputs meaningfully into an ensemble was a key technical milestone.
📚 What we learned
- How to engineer and integrate deep graph representations with traditional ML models.
- The practical challenges and nuances of stacking models and ensemble design.
- The importance of feature scaling and overfitting control in hybrid systems.
🚀 What's next for Galactico Final
We aim to:
- Explore advanced regularization and dropout techniques to further reduce overfitting.
- Experiment with other meta-models beyond Logistic Regression for stacking.
- Investigate self-supervised or contrastive learning for better GNN representations.
Log in or sign up for Devpost to join the conversation.