Inspiration

Solving the ConocoPhillips sensor challenge

What it does

It takes in the data as a CSV and drops columns containing more than 79% ratio of 'na' entries. The data frame is then used to train a random forest model. We then check the accuracy using cross fold validation and use some feature visualization techniques.

How I built it

Mostly trial and error. We started out by trying to find intelligent methods for data cleaning, such as replacing all 'na' with -1, dropping all columns above a certain 'na' entry threshold, standard and robust normalizing the set, etc. We then tried a ton of models such as random forest, SVM, hyper parameter tuning for random forest, adaBoost and matrix factorization.

Challenges I ran into

Typos and time constraints. The hardest issue we faced was minority class imbalance. There where only about 1000 samples of the minority class within the dataset, so finding a smart approach to get around this issue was the bane of our challenge.

Accomplishments that I'm proud of

We were top 10 on our initial submission, we got a pretty good accuracy relative to other submissions in my opinion. Were also proud to have tried some novel implementations, and overall feel good about learning some new models and methods.

What I learned

Data processing, Feature Visualization techniques, Up-sampling, Matrix Factorization, Embedding spaces, Hyper-parameter tuning

What's next for TAMU-datathon-2019

Hopefully more challenges and novel solutions!

Built With

Share this project:

Updates