Inspiration
For this project, we are inspired by the use of the results with Cyclica company and their explanation on Github on how to select more useful features and explore different techniques to transform the data.
What it does
It takes the characteristics of different protein structures, such as the type of amino acids, protein chain bonding angles, the solvent accessible surface area, AlphaFold2 residue-level prediction confidence value, secondary structure assignments by [DSSP], other backbone structural features describing backbone hydrogen. bonding networks, also assigned by DSSP, etc., and try to predict if the residue (row) belongs to a known binding site or not.
How we built it
We first checked missing values: there are no missing values except column annotation_atomrec, which we will later process using one-hot encoding to impute.
We then examined the response variable to predict on, and find that it is highly imbalanced. We tried to use downsampling to deal with it.
After that, we attempted to find important variables using correlation map and density plot. We didn't find any variables that are highly correlated, so we will try to use all the variables except categorical variables to predict the class. To model the data, we split the downsampled data into 80% training and 20% testing.
The models we tried are as below: Logistic Regression, Decision Tree, SVM, XGBoost, Random Forest, LightGBM.
We use eval_model function with ROC_AUC and PR_AUC, which is provided by Cyclica, and choose Random Forest since it has the highest ROC_AUC score and is further trained with the entire dataset. The best score is ROC_AUC = 0.88; PR_AUC = 0.37 on the 20% test set split from the labeled data.
Challenges we ran into
There are not much informative variables we can work on. Also, since the data is highly imbalanced, we cannot use all of them to build the model, which result in a lost of information and probably lower performance than if we have a more balanced data set.
Accomplishments that we're proud of
From the start, we had an ROC_AUC score of around 57% from the sample Notebook. But after our efforts to transform the data and trying different models, we successfully increased that by over 50% to over 88%. We also explored how to build different models and adjust hyperparameters, which proved our ability to work on machine learning tasks and learn new technologies.
What we learned
We learned on how to apply different models to the data and how they work, as well as the impact on the performance of various types of algorithms on this type of data, which would adds on our experience of dealing with imbalanced and highly noisy data. It is a valuable addition to our skill set.
What's next for M3 biubiubiu Cyclica
We would try to research and explore different models, such as transforming and use Convolutional Neural Network and Graph Neural Network to model the data.