Phyla Challenge #2 - RandomForest Approach

A multi-label classification model to classify different diseases based on themicroorganisms in the gut microbiome - with minimal code, minimal complexity, minimal performance cost.

Comment

Confusion matrix' heatmap

Performance

F1 Score - Micro: 0.780331373597007
F1 Score - Macro: 0.5981815091402464
F1 Score - Weighted: 0.7443181899051569
Cohen’s kappa score: 0.6355790715803891

Inspiration

Alexis Nolin-Lapalme's intro to Machine Learning

One of the biggest problems in machine learning is overfitting, but most of the time this won’t happen thanks to the random forest classifier. If there are enough trees in the forest, the classifier won’t overfit the model.

Source: https://builtin.com/data-science/random-forest-algorithm

[...] the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%.

Source: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-1004-8

What it does

Classify different diseases using a multi-label classification model based on the microorganisms in the gut microbiome.

How we built it

Python
scikit-learn - Machine Learning
seaborn - Heatmaps
pandas - Data Analysis

Challenges we ran into

Separating 'Disease-2' from 'Healthy' samples

Accomplishments that we're proud of

Built a functional classifier with minimal code, minimal complexity, minimal performance cost

What we learned

Random Forest Classifier
Using scikit-learn for Machine Learning

What's next?

Use sklearn-onnx to export our model in a standard format
Include potential future q-values -> FDR
Tuning to separate disease-2 from our healthy samples

Built with

Python
scikit-learn - Machine Learning
seaborn - Heatmaps
pandas - Data Analysis

Built With

Updates

Soso S started this project — Mar 27, 2022 10:18 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.