Inspiration

The project is a collaboration between four people going to their first Datathon ever and trying to learn our crafts in data sciene. Our project is working on decoding word embeddings to predict the secret themes and built a classifer for IMBD review classifier.

What it does

Part 1: We uses the provided CNN data to do cosine similarity with the embeddings that need a theme and go through the top 10 documents with the highest cosine similarities and gather information for a common theme between the top 10 documents.

Part 2:The IMBD classifier was build for sentiment analysis was done on the dataset containing 5000 rows of movie reviews. To build a classifier to predict whether a custom review is good or bad, a pretrained word embedding, Glove, was applied producing a 68% accuracy.

Part 2:

How we built it

As it is our first touch in to Data Science, we actually spent a lots of time research on how word embeded work and decide to build a Word2Vec model to understand the corellations and how word could be translate into vectors and progress toward a Doc2Vec model to trying to train and understand how the documents data with differents length could still be represent as a 512 size vector.

The IMBD classifier was build for sentiment analysis was done on the dataset containing 5000 rows of movie reviews. To build a classifier to predict whether a custom review is good or bad, a pretrained word embedding, Glove, was applied producing a 68% accuracy. The Glove Vectors for Word Representation was obtained from Stanford NLP GloVe. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Challenges we ran into

P1: We ran into a lots of challenges regarding learning how to changing the hyperparameters to trying to get a similar results of vector weights as provided in the cnn sample especially the window size and min_count as text with different could create bias on those two variable. We end up using cosine similarity to find the themes.

P2: The main problem we ran in was not having enough time and resources to find the best optimized way to fit and evaluate the model, hence producing a low accuracy. Additional Deep Learning models should have been tested along with more precise data pre-processing to enhance accuracy.

Accomplishments that we're proud of

We actually proud of what we made of two models and a classifier as this is such a huge learning curve that we believe we overcome given in such a short time. As this is for some of us is our first hackathon/datathon we are really glad we could finish and submit our first commit.

What we learned

We learn about how Word and Documents could be represent under word embedding, and how we could utilize word embedding and ML models to do sentiment analysis on reviews.

Share this project:

Updates