TAMU-datathon-2019

Inspiration

Solving the ConocoPhillips sensor challenge

What it does

It takes in the data as a CSV and drops columns containing more than 79% ratio of 'na' entries. The data frame is then used to train a random forest model. We then check the accuracy using cross fold validation and use some feature visualization techniques.

How I built it

Mostly trial and error. We started out by trying to find intelligent methods for data cleaning, such as replacing all 'na' with -1, dropping all columns above a certain 'na' entry threshold, standard and robust normalizing the set, etc. We then tried a ton of models such as random forest, SVM, hyper parameter tuning for random forest, adaBoost and matrix factorization.

Challenges I ran into

Typos and time constraints. The hardest issue we faced was minority class imbalance. There where only about 1000 samples of the minority class within the dataset, so finding a smart approach to get around this issue was the bane of our challenge.

Accomplishments that I'm proud of

We were top 10 on our initial submission, we got a pretty good accuracy relative to other submissions in my opinion. Were also proud to have tried some novel implementations, and overall feel good about learning some new models and methods.