Inspiration
A large number of Amazon product reviews have little to no user feedback when a particularly insightful review is shared, it is not immediately featured It is difficult for listings to emphasize helpful opinions Amazon and consumers would like to effectively identify the reviews that provide the most insight. The variation in the quality of product reviews makes it difficult to identify useful feedback We seek to find an automated method to identify helpful reviews
What it does
We classified Amazon reviews based on helpfulness We used Logistic Regression We achieved anaccuracy of 0.76
How we built it
We took following steps to featurize our data:
Filtered the reviews having more than 5 helpfulness votes Removed capitalization, punctuation, stopwords and numbers Create unigrams and bigrams from the data Applied the Lancaster Stemmer Used countVectorizer to create feature vectors for 1000 most common tokens Calculated TF-IDF weights
We classified a review as helpful if more than 80% of votes were favorable
We applied a Logistic Regression model Used cross validator to tune the hyper parameters We used the best hyper parameter to evaluate the model’s accuracy on our test set
Challenges we ran into
We faced many challenges when developing the model
Helpfulness distribution was heavily skewed Unskewing the data was crucial to prevent our model to be biased Memory Issues and Long Execution Times with Databricks Community Edition
We tried hashingTF and countvectorizer We tried increasing the vocabulary We tried the model with more data We tried the neural net multi layer perceptron classifier
What's next for Amazon Review Helpfulness Classifier
Extension of our work on the entire dataset Model’s performance may improve given more data Use spark streaming and streaming logistic regression to handle incoming reviews Applying deep learning methods to get better accuracy
Log in or sign up for Devpost to join the conversation.