Cyclica Prediction Model

What it does

This files takes 49 features related to residue and proteins as input and predict if the residue (row) belongs to a known binding site or not.

used python and thanks to the jupyterhub of Uwaterloo, my computer survives during the training process.
used sklearn and imblearn to process data and train the model

checking shape of the dataset (relatively large) (497166, 50)
notice that the label is binary -> check if imbalanced False: 479912 True: 17254
basic correlation graph drawn
find null values
```
'annotation_atomrec'
```

model	logistic regression	XGBoost	Random Forest	Neural Network(too slow)	LightGBM	Voting Classifier
f1	0.68414	0.852539	0.8700565	NA	0.772295	0.860123
roc_auc	0.85543	0.956948	0.9664487	0.832424	0.92824222	0.9581390
pr_auc	NA	0.893542	0.90847	NA	0.834910	0.90321

Random Forest performs the best so we select Random Forest as the prediction model

I find column annotation_sequence and annotation_atomrec are exactly the same besides that annotation_atomrec has na values. => drop annotation_atomrec col

check numerical data, most of them are not norm distributied => using minMaxScalar()

don't use one hot encoding because the categories for the categorical varaible is too much => use LabelEncoder()

performed SMOTE and RandomUnderSample to
tunning considering run time efficiency, overfitting and roc_auc pr_auc [a, b] a is the sampling strategy for SMOTE and b is the sampling strategy for RandomUndersample

sample_strategy	0.1, 0.5	0.1, 0.3	0.2, 0.5	0.3, 0.6	0.5, 0.8	0.6, 0.9	0.8, 0.9
roc_auc	0.96861	0.972335	0.9895	0.99462	0.9975907	0.998134	0.99893
pr_auc	0.90295	0.87756	0.94928	0.968687	0.9830687	0.98615	0.9900327

To avoid too much samples and runtime too slow, choose [0.5, 0.8] Counter({False: 299945, True: 239956})

imblanced = {false:1, true: 1.2}

parameters	class_weight=imbalanced, bootstrap=False	class_weight="balanced", bootstrap=False	class_weight=imblanced, bootstrap=False, max_features="log2"	class_weight=imbalanced, bootstrap=False, n_estimators=50	class_weight=imbalanced, bootstrap=False, n_estimators=80, max_features="log2"	class_weight=imblanced, bootstrap=False, min_samples_split=4, max_features="log2"	class_weight=imblanced, bootstrap=False, min_samples_split=3, max_features="log2"	class_weight=imblanced, bootstrap=False, min_samples_split=3, max_features="log2", max_depth=20	class_weight=imblanced, bootstrap=False, min_samples_split=3, max_features="log2", min_samples_leaf=5
roc_auc	0.9958320	0.995803	0.995837	0.995396	0.99573	0.99576	0.995837	0.99258643	0.995076
pr_auc	0.985214	0.985180	0.985201	0.98509	0.985180	0.98520	0.98531	0.9768451	0.984155
f1	NA	NA	NA	NA	NA	NA	NA	0.965574	0.97235