Our selection method consisted of selecting features based on their correlation to the final CM content, and based on the class that was provided (*sufficient/insufficient). The output data, CM content, given in the dataset was continuous (ranging between 50-100), but we set a threshold of 90 to distinguish between sufficient and insufficient classes. The correlation was determined using a correlation matrix, which helped us choose the most relevant features. In another approach, we also used Principal Component Analysis (PCA) to reduce the dimensionality of our input data from 102 to 10, therefore removing bias and improving the training time, and we test the models on both datasets (PCA and feature selection). Our approach consisted of passing the data through dense neural networks, random forest regressor, and logistic regression models. To evaluate performance, we considered precision, recall, accuracy, F1 score (classification), and R2 score (for regression) We found that logistic regression has the best metrics, with an accuracy of 72.2%. Additionally, since the data was unbalanced (there were more entries labeled "insufficient" than "sufficient"), we performed SMOTE (Synthetic Minority Over-sampling Technique). We also experimented with variational autoencoders and generals GANs to generate synthetic data (not fully functional yet).
Built With
- sklearn
- tensorflow
Log in or sign up for Devpost to join the conversation.