Inspiration
Solving the ConocoPhillips sensor challenge
What it does
It takes in the data as a CSV and drops columns containing more than 79% ratio of 'na' entries. The data frame is then used to train a random forest model. We then check the accuracy using cross fold validation and use some feature visualization techniques.
How I built it
Mostly trial and error. We started out by trying to find intelligent methods for data cleaning, such as replacing all 'na' with -1, dropping all columns above a certain 'na' entry threshold, standard and robust normalizing the set, etc. We then tried a ton of models such as random forest, SVM, hyper parameter tuning for random forest, adaBoost and matrix factorization.
Challenges I ran into
Typos and time constraints. The hardest issue we faced was minority class imbalance. There where only about 1000 samples of the minority class within the dataset, so finding a smart approach to get around this issue was the bane of our challenge.
Accomplishments that I'm proud of
We were top 10 on our initial submission, we got a pretty good accuracy relative to other submissions in my opinion. Were also proud to have tried some novel implementations, and overall feel good about learning some new models and methods.
What I learned
Data processing, Feature Visualization techniques, Up-sampling, Matrix Factorization, Embedding spaces, Hyper-parameter tuning
What's next for TAMU-datathon-2019
Hopefully more challenges and novel solutions!
Built With
- jupyter-notebook
- python
Log in or sign up for Devpost to join the conversation.