Inspiration

I wanted to create a wine recommender system, but that was definitely out of my reach for a first machine learning project. However, I still liked the idea of combining my interest for wine and natural language processing, so I set myself to investigate if expensive wine was really better than "cheap" wine.

What it does

The model takes as input the description of a wine (in the form of a review) and outputs a price category for the wine ($0-20, $20-40, $40-150, $150-300, $300-100 or more than $1000).

How I built it

I built a lot of functions to strip the text from punctuation, stopwords, and to lemmatize it. I applied a TFIDF vectorizer to the reviews in the set, then separated the data into a train and test set. I fed the vectorized text into different models. I tried a Naive Bayes Classifier, a Random Forest Classifier, a Logistic Regression Classifier, a Support Vector Machine and a neural network. I found that the SVM offered the best performance and ability to generalize.

Challenges I ran into

I had a lot of trouble implementing the neural network, but I think that this approach could yield the best results long term. In the end, I was not able to use it efficiently because I did not master PyTorch to the fullest. I spent a lot of time on this, rather than optimizing my other models.

Accomplishments that I'm proud of

I'm super happy about having built skills that will stick with me. More than just specific tools, I now understand more deeply the whole process of developping machine learning apps.

What I learned

I learned how to use sklearn models and how to create a PyTorch Neural Network, which is very cool in my view. I also learned what are the steps to process data before using it in a machine learning model, and gained experience in the domain of natural language processing. Finally, I learned that expensive wine might not be worth the price we are paying for it (further experiments will confirm or infirm this statement).

What's next for Wine Price Classifier

I will try to implement a recurrent neural network, maybe more suited to NLP purposes. I will also try to reduce the number of samples in the price categories of $0-20 and $20-40, to give a chance to the model to be exposed more proportionally to the other categories.

Built With

Share this project:

Updates