Sentiment Analyzer

What it does

Submission for the 2020 Ignition Hacks Sigma Division. We created an AI model that predicts the sentiment of a given sentence, classifying it as positive (represented with a 1) or negative (represented with a 0). The code was written in Python using the scikit-learn machine learning library and the Natural Language Toolkit.

Methodology

The given data was first read into a Pandas Dataframe within a Google Colab Notebook environment. The following data cleaning techniques were implemented and tested in our pre-processing step. Each technique was run several times and usability was determined by averaging F1 scores. To standardize each method, dataframe size and model were set constant each time. Furthermore, the data was split into 20% for testing and 80% for training. See specific implementations in the submissions_extras.ipynb file in the repository.

String Processing

Attempts to remove stopwords and lemmatize names and other words were unsuccessful as the F1 scores decreased slightly. However punctuation removal improved the results.

Cleaning

Stopwords
Removing stopwords from the sentences rendered the model a bit more inaccurate
Lemmatization of names
Changing the names in the text (identified by an ‘@’ sign preceding the name) to all be the same name resulted in a slight loss in accuracy
Lemmatization with part-of-speech tagging
After implementing NLTK.wordnet’s lemmatization functions, we observed a noticeable decrease in accuracy for the model
Removal of punctuation
Removing punctuation lead to a very slight increase in accuracy

Since punctuation removal improved F1 scores, we solely implemented this technique.

Vectorizer

We tested CountVectorizer and TfidfVectorizer with different parameters and different classifiers to see which combination would yield the greatest accuracy. Since the focus is on accuracy and not speed, we easily opted for the TfidfVectorizer.

Classifier

The following classifiers were implemented and evaluated using the Sci-Kit Learn library:

Neural Network
Decision Tree
Logistic Regression
Support Vector Machine
Stochastic Gradient Descent

Logistic Regression yielded the greatest averaged F1 scores under constant data size, punctuation removal, vectorization, and train-test size.

We then used GridSearchCV to find the optimal parameters for each classifier and reevaluated their usability.

Built With

google-colab
jupyter-notebook
nltk
pickle
pycharm
python
scikit-learn

Submitted to

Ignition Hacks 2020
- Winner Division Sigma: Second Place Accuracy

Created by

I worked on exploring and testing different classifiers, pre-processing the data, and tuning the models

George Liu
I used the scikit-learn library to train, evaluate, and optimize different classifiers. I also worked on natural language processing (i.e. lemmatization with part-of-speech tagging using the NLTK library).

David Chen
Helped with the optimization and development of the sentiment analyser

David Wang
I tested out various machine learning models and optimized the logistic regression using grid search.

Michael Yang