LSTM Based Poems Writing Style Recognition Analysis

Digital poster

Who Are We

Kejing Huang (khuang55)
Zi Yang (zyang105)
Qiubai Yu (qyu14)

Introduction

Poetry (derived from the Greek poiesis, "making") is a form of literature that uses aesthetic and often rhythmic qualities of language − such as phonaesthetics, sound symbolism, and metre − to evoke meanings in addition to, or in place of, a prosaic ostensible meaning. Poetry is important because writing it lets us get out our feelings and thoughts on a subject while reading it encourages us to connect and find meaning in our experiences. Every poet has a style. Part of the keys to becoming a well-known and respected poet is honing that style and creating something entirely unique to you and your life experience. As students that love reading and writing poetry, and have seen what neural networks have been able to do with text classification, we want to test the boundaries of such a network. Our goal through this Deep Learning project is to see if a Long short-term memory (LSTM) network model can find and reveal the right poet given a part of a poem. We hope that our model can identify each poet's writing style.

Related Work

One of the related papers we read is Generating rhyming poetry using LSTM recurrent neural networks. It compiles a new dataset of amateur poetry which allows rhyme to be learned without external constraints because of the dataset’s size and high frequency of rhymes, and this is the first time that a recurrent neural network has been shown to generate rhyming poetry a high percentage of the time.

Other open-source related work:

Automatic Generation Method of Ancient Poetry Based on LSTM

Creative GANs for generating poems, lyrics, and metaphors

Generating Chinese Classical Poetry with Quatrain Generation Model (QGM) Using Encoder-Decoder LSTM

Data

What data are you using (if any)?
For data generation, we used poems from Kaggle/Project Gutenberg (https://www.kaggle.com/nltkdata/gutenberg), PoemHunter (https://www.poemhunter.com/), and Project Gutenberg website (https://www.gutenberg.org/) to organize all poems from the same poet into one text file, and uploaded all text files to the Google Colab manually.
How big is it? Will you need to do significant preprocessing?
We selected 11 poets: George Gordon Byron, Khalil Gibran, Johann Wolfgang von Goethe, Thomas Hardy, Seamus Heaney, Homer, Victor Hugo, Alexander Sergeyevich Pushkin, Percy Bysshe Shelley, Rabindranath Tagore, Voltaire. For data preprocessing, for each poem, split paragraphs into sentences, remove punctuations and truncate the sentences. Then convert words into numbers, create input data and labels, and generate training set (90% of the dataset) and testing set (10% of the dataset).

Methodology

Our goal is to build a LSTM network model that can find and reveal the right poet given a part of a poem. More specifically, firstly split poems into sentences, remove punctuations and truncate the sentences. Then convert words into numbers, create input data and labels, and generate training set (90% of the dataset) and testing set (10% of the dataset). Lastly build a LSTM neural network model with a softmax function appended to the final layer to convert the scores to a normalized probability distribution and train the model that can identify who writes a piece of work. We hope that our model can identify each poet's writing style. Furthermore, given enough data, we expect our model to be trained on the poems of a poet, and be able to classify later inputs as the given poet.

Metrics

Text recognition is a hard problem and there are many researchers spending a lot of effort on it. For our project, we cannot guarantee that our model will 100% identify the right poet. However, LSTM model outperforms other ML/DL approaches due to its capability of generating poetry with coherency, semantic meaning, and fluency comparable to texts written by humans, etc. Methods to improve the model accuracy include adding more LSTM layers, increasing the number of epochs or batch size, etc. However, adding more epochs may also lead to overfitting the model, resulting in low testing accuracy. We need to keep this in mind during model developing and testing. As long as our model demonstrates that it has learnt some writing style of each poet and it can make reasonable recognition, we will consider it as "success".

Ethics

Why is Deep Learning a good approach to this problem?
Tradition neural networks suffer from short-term memory and vanishing gradient. LSTMs efficiently improves performance by memorizing the relevant information that is important and finds the pattern. If we look at other non-neural network classification techniques, they are trained on multiple word as separate inputs that are just word having no actual meaning as a sentence, and while predicting the writer it will give the output according to statistics and not according to meaning. On the other hand, in LSTM we can use a multiple word string to find out the class to which it belongs. This is very helpful while working with Natural language processing. If we use appropriate layers of embedding and encoding in LSTM, the model will be able to find out the actual meaning in input string and will give the most accurate poet.
Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?
This project is beneficial to everyone that loves reading and writing poetry, or just gets started in poetry. Mistake made by our model is mainly revealing the wrong poet, which is not very harmful, but we will try our best to improve the model accuracy.