Inspiration
Our team’s model is built around our belief that a reasonable and systematic construction of data processing, whether filling in missing data or changing categorical variables to numerical, is a building block of any sales prediction models using machine learning algorithms. Therefore, we focused on accurately cleaning and processing variables such as SIC number, parent countries and global ultimate countries for the most optimized and unbiased result.
What it does
After carefully reviewing numerous machine learning tools, we believed that the gradient boosting algorithm was the most optimal for predicting domestic economic sales. When information on the variables provided in the training data is provided, our model will predict the sales for each test data.
How we built it
First, we discarded some variables that had too many missing values or were logically insignificant to the decision variable (domestic sales). Then, we utilized linear regression to fill in the predicted values for the missing data in the variables that we picked. Next, we processed categorical data into numerical data via one hot and label encoding. Next, with the now processed data, we ran various machine learning tools to find the most powerful one. Finally, we cross validated our end result for increased accuracy.
Challenges we ran into
One of the most challenging part of constructing our model was processing categorical data to numerical data. For variables such as parent country and global ultimate country, our group was torn between utilizing label or target encoding. We decided to test both outcomes, where we reviewed that target encoding resulted in the danger of overfitting. After careful consideration, we decided to utilize label encoding for both variables.
Accomplishments that we're proud of
Our team is generally satisfied with the R squared values and MSE values produced in our final model. We believe that our model chose the most appropriate machine learning tool to predict domestic sales. Furthermore, the numerous trial and errors while processing categorical to numerical data means that we have taken extra consideration our model overfitting or underfitting to the training data.
What we learned
Our most significant takeaway from this experience was mastering the art of collaboration. Throughout our discussions, each team member readily contributed creative ideas and engaged in thoughtful conversations on how these concepts could be integrated into our project. Despite our status as beginners, occasional frustrations stemming from limited knowledge were overcome through passionate and enthusiastic discussions that propelled us to successful outcomes.
What's next for Predicting Domestic Sales Utilizing Data Analytics
As the datathon concludes, our commitment to refining our models does not waver. Beyond this project, we are actively seeking new opportunities where we can apply the skill sets honed during this endeavor. Our journey is far from over; in fact, it's just beginning. We eagerly anticipate the next chapter in our ongoing pursuit of excellence!
Log in or sign up for Devpost to join the conversation.