one-algorithm-to-rule-them-all

Inspiration

There is an implicit paradigm when it comes to any machine learning classification task: first, try a variety of models (Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosting, Artificial Neural Nets, etc.). Then, depending on your criteria (accuracy, speed, need for interpretability, etc.) you pick the best one. This feels extremely unsatisfying. The entire process seems bloated, arbitrary, and contrived. Shouldn’t there by a more theoretical, statistical way of determining which model would do best?

What it does

Through experimentation, analyzing model derivations, and reading research papers, we defined metrics that, when applied to the dataset, could determine which of Random Forest or Logistic Regression should be applied. We limited the possibilities to those two models because they are fundamentally quite different and possess different properties, and are used in the same context. These metrics used the distribution of data within features as well as the size of samples compared to that of the feature space. Our presentation contains more details (https://github.com/jmather625/one-algorithm-to-rule-them-all/blob/master/one-algorithm-to-rule-them-all.pdf).

How we built it

This project was built in the Jupyter Notebook. The majority of the time was spent going through different datasets and trying to find differences in the accuracies of Random Forest and Logistic Regression. Once this was identified, we tried to understand why these differences were occurring and how they corresponded to a failure within one of the Machine Learning Models to correctly classify.

Challenges we ran into

Coming up with a theoretically-grounded approach that can identify strengths/weaknesses for a model in a dataset is not easy. The existing literature we found was not substantial and was clearly an area of interest. A lot of our fixes felt very ad-hoc, but we believe the core integrity of our idea remains intact.

Accomplishments that we're proud of

The algorithm correctly identified the ideal model for use within many of the datasets we were working with throughout the hackathon.

What we learned

Many of these models we are using have such a deep fundamental backing, and our appreciation for the models' simple assumptions yet powerful predictive power was interesting to pursue.