LSTM networks on religious documents (LSTM NORD)

Updates

Justin Liao posted an update — Nov 25, 2020 08:56 PM EST

Introduction: We will be implementing an existing paper, which attempted to generate text given a context. More specifically, the paper built an LSTM model that was trained on J.R.R. Tolkein’s The Lord of the Rings trilogy. In the paper, the model was trained to predict the following word, given a sequence of words as well as a context vector. After training was completed, the model was then used to generate a sequence of words on its own, by feeding its own predictions into itself in order to generate successive words and form sentences/passages. However, we wish to change the dataset in order to investigate a new, and possibly, more interesting topic. Instead of training our model on The Lord of the Rings, we intend to train our model on religious works such as the Bible, Quran, Hebrew Bible, etc. The expectation being that the model will eventually be able to generate passages that resemble those from the aforementioned works. Thus, the problem we will be addressing is a prediction problem. In addition to generating new passages, we also hope to be able to discern any differences between the model when it is trained on each of the Bible, Quran, Hebrew Bible, etc., thereby identifying differences between the texts. A simple classifier will be used to attempt to classify the resulting passages from our model as being from a certain religious text. Our interest in this paper stems from our interest in Natural Language Processing. The idea of generating completely new text using a computer was fascinating to us, and recently there have been more and more examples of models that generate texts by studying old ones, including Botnik’s generated Harry Potter chapter, and Microsoft’s chatbot, Tay, which was supposed to generate tweets. We believe that it would be interesting to take a look at religious texts, to see if a deep learning model could emulate those texts, since, as far as we know, nothing similar has been done for these specific texts.

Challenges: Our team has been struggling to find time to complete parts of the project, however we are in the midst of preprocessing and hope to have this done soon.

Insights: We don’t have concrete results yet.

Plan: Unfortunately, we are not on track with the project. However, with the removal of the final homework we believe we can get back on track over the break. We faced some challenges with certain religious texts being more tricky to find than others. As such, we intend to remove the Torah as one of the datasets due to the difficulty of obtaining an open-source English translation of the text in an appropriate format.

Log in or sign up for Devpost to join the conversation.

Justin Liao started this project — Nov 13, 2020 09:28 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.

Built With

Updates