Submission for Ignition Hacks 2020, Division Sigma ( ∑ ).
The prompt for this division of the hackathon was to build an artificial intelligence model for predicting the sentiment of a dataset of tweets. With a stronger statistics background, we decided to extend that knowledge in our approach.
What it does
Our program determines whether a piece of text has positive (1) or negative (0) sentiment.
How we built it
We used Google Colaboratory to collaborate on a jupyter notebook running remotely.
Our algorithm starts off by fetching the data remotely from a GitHub page, as Colab recycles the files periodically. Downloading it to the local environment and reading it using Pandas, next we tokenize and cleanse the data.
We remove hyperlinks, and call a function
cleanse(), which will determine whether or not to take into consideration the current word based on stop words, whether it is another user mention (starts with @), remove punctuation and symbols, and if none of the above, will lemmatize it, and return either the lemmatized word, or the original (with punctuation and symbols removed).
After our preprocessing, we train our Naive Bayes algorithm (with a Gaussian distribution). Our algorithm implements Bayes Theorem in probability and statistics, using the probability that a word is of one sentiment to determine the sentiment of a given sentence.
In our testing, the training data's processing (in the section labelled
# == Preprocessing) takes around 23-25 minutes to build.
Challenges we ran into
We could not think of an algorithm that had the right balance of efficiency and accuracy, and experimented with quite a few algorithms. After finding an algorithm of choice, we had a lot of trouble optimizing both the speed, memory, and accuracy. We also ran into difficulties working simultaneously on Google Colab as we could not work on it simultaneously without overriding (and erasing) one another's work.
Accomplishments that we're proud of
We are proud of learning how to use Google Colaboratory in the matter of two days. We are also proud of learning so much about machine learning and how sentiment analysis works. Finally, we are proud of a completed machine learning project for our first hackathon about machine learning.
What we learned
We learned how to work with Google Colaboratory and some machine learning algorithms and ideas such as Naive Bayes, neural networks, and vectorization.
Having been the first time we've used machine learning, we spent most of our first day watching tutorials and grasping new AI concepts, and started to focus in on NLP techniques. Lemmatization, morphological segmentation, part-of-speech tagging, and tokenization were some of the cleaning methods we learned and implemented.
What's next for Sentiment Analysis
- Improving the accuracy of the analysis. While about 80% accuracy is nice, we seek to refine the algorithm even more to increase its accuracy.
- Improving the efficiency of the algorithm.