Lit2Vec2

logo

Inspiration

Most book recommenders tend to recommend only the most popular books; we wanted to create a recommender which can recommend more rare books.

What it does

Most book recommenders tend to recommend more popular books, we wanted to create a recommender for rarer books.

How we built it

We were able to obtain anonymized book ratings for 3 million users. We filtered the ratings to only books that each user enjoyed, so only books containing scores of 4/5 and 5/5.

For those wanting to train on our data, here is a folder containing the training data

https://drive.google.com/open?id=1FePFPkGk_cx8KwJZOZ5D14vpTVnGyphf

The 'GoodReadsUser4MS.h5' file is the main training file, which contains the annoymized userID and the embeddingID in each datapoint. The rest of the files are dictionaries for converting the embeddingIDs to titles and authors.

To see how the data was used in training, please see our training notebook 'Lit2Vec2TrainingPublic.ipynb'

The training algorithm is based on the skip-gram version of word2vec. A 'target' book is a book chosen at random from the list of books a user has rated, and then the 'context' books are 4 other books, chosen at random from the same list, excluding the target book. We trained the embeddings for 10 epochs over our data.

For the recommender to recommend more rare books, for the negative samples we chose to sample from a log-uniform distribution [https://stats.stackexchange.com/questions/155552/what-does-log-uniformly-distribution-mean] and we ordered the books with the most occuring books at the head of the distribution, and the least occuring books occuring at the tail of the distribution. What this means is that the frequency of a particular book being selected as a negative sample is correlated with how popular the book is. Which allows for increased embedding similarity properties for the more rare books.

To see how exactly it was trained, please see our notebook

'Lit2Vec2TrainingPublic.ipynb'

Challenges we ran into

We had a lot embeddings, and it's very hard to training with a decent sized embedding dimension, since the GPU ram is limited. One of the members created a Pytorch Library for this situation, SpeedTorch ( https://pypi.org/project/SpeedTorch/ ) . Please see our repo for more details on how it was implemented.

What's next for Lit2Vec2

We are planning of incorporating different types of embedding training architectures, such as neural collaborative filtering, and comparing the results.

Built With

azure
python
pytorch
speedtorch

Created by

I contributed to the data collection of user info from goodreads, which expanded the ability to create meaningful recommendations.

Calen Robinette
Worked with python notebooks collecting data from GoodReads.

Iker Nebot Amoriza
Set up data collection parameters, data cleaning/processing. Coded the training architecture in Pytorch, selecting the sampling and training parameters. Developed a library to increase the size of the embeddings and training speed (SpeedTorch). Set up embeddings analysis.

Santosh Gupta
I love Machine Learning / NLP , especially for analyzing scientific texts
A former data engineer who loves science more :-)

Stanislav Jirák
Edoardo Pona
Chris Vukin
Matthew K