Clustering Movie Scripts To Improve Recommendation Systems

Inspiration

Movies are often categorized into broad genres, such as horror or drama. However, this does not reflect the diversity within these genres, causing inaccurate recommendation systems. By using movie scripts, we can create subgenres within each genre without the need for human annotation, allowing for compoundable improvements.

What it does

Our project scrapes movie scripts off of IMSDB and preprocesses it, splitting into a list of all unique words (bag-of-words approach). From there, we use a pretrained FastText model to obtain word embeddings which our Gaussian Mixture Model then takes as input in order to create subgenres within each of the eighteen given genres. These subgenres are then used to update similarity matrices based on simple user-movie ratings, allowing for more accurate movie recommendations through precision and recall metrics.

How we built it

We used FastText to embed the scripts, which we obtained from IMSDB.com. We used a gaussian mixture model to detect components within each genre. We compounded similarity measures using genres/subgenres in an item-based recommendation system built on the MovieLens dataset.

Challenges we ran into

Differing data between IMSDB and MovieLens caused us to use a subset of the dataset. Additionally, figuring out a practical word embedding for the problem was an issue.

Accomplishments that we're proud of

Our project’s compoundability/stackability is a key aspect of it. This means that movie scripts and subgenres can be applied alongside other state-of-the-art methods to recommendation systems and still yield additional improvements to the precision and recall. Additionally, the unsupervised aspect of it is important, especially in the current era of big data. Finding labeled data is especially difficult and having someone annotate it is cumbersome, so this lack of human interference allows for automation. When paired together with current advancements in speech recognition, the possibility of an entire pipeline of auto-generating scripts seems not too distant.

What we learned

We learned syntactical specificities and limitations of implementations of algorithms in the libraries we used as well as various web-scraping and data manipulating/processing techniques (while trying to find the appropriate one to use). We also learned the value in obtaining more training data/features for improved results.

What's next for Clustering Movie Scripts To Improve Recommendation Systems

Next steps include finetuning our hyperparameters (the number of subgenres and the value of the factor used to alter the similarity value) as well as trying out different similarity metrics. Additionally, standard machine learning issues such as overfitting have to be resolved through regularization. A potential option is to include an earlier step in the pipeline where if audio files of movies are able to be found, then a speech transcriber an be used to generate a script or sites explaining the plot of the movie can also be used as additional features.

Built With

beautifulsoup4
fasttext
gaussian-mixture-models
machine-learning
natural-language-processing
numpy
pandas
python
scikit-learn
word-embeddings

Updates

Simon Zeng started this project — Sep 07, 2019 07:18 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.