Inspiration

A large number of Amazon product reviews have little to no user feedback when a particularly insightful review is shared, it is not immediately featured It is difficult for listings to emphasize helpful opinions Amazon and consumers would like to effectively identify the reviews that provide the most insight. The variation in the quality of product reviews makes it difficult to identify useful feedback We seek to find an automated method to identify helpful reviews

What it does

We classified Amazon reviews based on helpfulness We used Logistic Regression We achieved anaccuracy of 0.76

How we built it

We took following steps to featurize our data:

Filtered the reviews having more than 5 helpfulness votes Removed capitalization, punctuation, stopwords and numbers Create unigrams and bigrams from the data Applied the Lancaster Stemmer Used countVectorizer to create feature vectors for 1000 most common tokens Calculated TF-IDF weights

We classified a review as helpful if more than 80% of votes were favorable

We applied a Logistic Regression model Used cross validator to tune the hyper parameters We used the best hyper parameter to evaluate the model’s accuracy on our test set

Challenges we ran into

We faced many challenges when developing the model

Helpfulness distribution was heavily skewed Unskewing the data was crucial to prevent our model to be biased Memory Issues and Long Execution Times with Databricks Community Edition

We tried hashingTF and countvectorizer We tried increasing the vocabulary We tried the model with more data We tried the neural net multi layer perceptron classifier

What's next for Amazon Review Helpfulness Classifier

Extension of our work on the entire dataset Model’s performance may improve given more data Use spark streaming and streaming logistic regression to handle incoming reviews Applying deep learning methods to get better accuracy

Built With

Share this project:

Updates