What it does
Submission for the 2020 Ignition Hacks Sigma Division. We created an AI model that predicts the sentiment of a given sentence, classifying it as positive (represented with a 1) or negative (represented with a 0). The code was written in Python using the scikit-learn machine learning library and the Natural Language Toolkit.
The given data was first read into a Pandas Dataframe within a Google Colab Notebook environment. The following data cleaning techniques were implemented and tested in our pre-processing step. Each technique was run several times and usability was determined by averaging F1 scores. To standardize each method, dataframe size and model were set constant each time. Furthermore, the data was split into 20% for testing and 80% for training. See specific implementations in the submissions_extras.ipynb file in the repository.
Attempts to remove stopwords and lemmatize names and other words were unsuccessful as the F1 scores decreased slightly. However punctuation removal improved the results.
- Removing stopwords from the sentences rendered the model a bit more inaccurate
- Lemmatization of names
- Changing the names in the text (identified by an ‘@’ sign preceding the name) to all be the same name resulted in a slight loss in accuracy
- Lemmatization with part-of-speech tagging
- After implementing NLTK.wordnet’s lemmatization functions, we observed a noticeable decrease in accuracy for the model
- Removal of punctuation
- Removing punctuation lead to a very slight increase in accuracy
Since punctuation removal improved F1 scores, we solely implemented this technique.
We tested CountVectorizer and TfidfVectorizer with different parameters and different classifiers to see which combination would yield the greatest accuracy. Since the focus is on accuracy and not speed, we easily opted for the TfidfVectorizer.
The following classifiers were implemented and evaluated using the Sci-Kit Learn library:
- Neural Network
- Decision Tree
- Logistic Regression
- Support Vector Machine
- Stochastic Gradient Descent
Logistic Regression yielded the greatest averaged F1 scores under constant data size, punctuation removal, vectorization, and train-test size.
We then used GridSearchCV to find the optimal parameters for each classifier and reevaluated their usability.