In this study, we aimed to predict the content of cardiac myocytes (CMs) differentiated from human induced pluripotent stem cells (hiPSCs) using three different boosting algorithms: XGBoost, RandomForest, and AdaBoost. Our approach involved data processing, feature selection, classification, and evaluation.
For data processing, we converted labels > 90 to 0 (sufficient) and labels <= 90 to 1 (insufficient). We then used three feature selection techniques, namely SelectKBest, SelectFromModel, and SequentialFeatureSelection, to select the most informative features. Next, we ran each of the three boosting algorithms with the selected features and evaluated their performance using accuracy, precision, recall, and Matthews correlation coefficient (MCC).
To further validate our approach, we employed two cross-validation techniques: K-fold and Monte Carlo cross-validation, as well as Leave-One-Out (LOO) cross-validation. We also plotted the individual cross-validated performance by the number of features selected to identify the subset of features with the highest cross-validated score.
Our results showed that XGBoost had the highest overall performance, with an accuracy of 88% and MCC of 0.69. RandomForest and AdaBoost also performed well, with accuracies of 84% and 79%, respectively. Our approach could potentially aid in the prediction of CM content from hiPSCs, which could have implications in disease modeling and drug development.