About the Data
This dataset consists of a nearly 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. for learning how to train Machine for sentiment analysis.
The aim is to perform NLP and categorize the feedback as positive or negative.
Module Used for NLP
Natural Language Toolkit(nltk): The Natural Language Toolkit (NLTK) is a Python package for natural language processing. This is a suite of libraries and programs for symbolic and statistical NLP for English. First getting to see the light in 2001, NLTK hopes to support research and teaching in NLP and other areas closely related. These include Artificial Intelligence, empirical linguistics, cognitive science, information retrieval, and Machine Learning.
Steps Performed under NLP
Removing stop words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words.
Tokenization: Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
Stemming: Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.