Text-based author identification

Hongming Fu, Ruize Ma, Yi Sun, Yu Gu(names arranged alphabetically)

Important links:
GitHub link
Slides link
Final report link

Introduction: Our project aims at predicting the author. By predicting the author, we can analyze writing styles, linguistic patterns, and stylistic features unique to each author. This can provide insights into authorial characteristics, preferences, and changes in writing style over time. By doing so, we will be able to gain a deeper understanding of literature. It can also be used in educational settings for tasks such as identifying plagiarism, assessing writing proficiency, and creating personalized learning experiences based on author-specific content.

Related Work: The dataset we found consists of 36 books from the Gutenberg project. We will preprocess the book materials to make them our training data, and set the training target the author and the age when that author was composing that book.

Data: For this project, we are going to use a dataset provided by the Gutenberg project.Project Gutenberg is a volunteer effort to digitize and archive cultural works. This source is valuable as it provides a number of works free of copyright issues, making it an ideal choice for academic and research purposes. There have been other NLP projects gathering their dataset from Gutenberg. Our dataset includes 36 pieces of works from a variety of genres, ensuring the diversity of literacy style, theme and chronological feature. The size of each book ranges from a few hundred kbs to 4mb. For preprocessing, we remove irrelevant content from each book, including catalog, preface and epilogue. These contents include no information regarding the author per se. Each book will be tokenized into individual sentences. It is worth noting that we will regroup our tokens based on the tasks we intend to implement.

https://www.kaggle.com/datasets/mateibejan/15000-gutenberg-books

https://www.gutenberg.org/

Methodology: We will first apply word embedding to generate the embedding vector of each word that appears in the novels. We will then build up CNN, RNN(LSTM or GRU), transformer in our neural network. We will tune the hyper-parameters and decide the number of layers based on the validation score. We will also try to build a generative model based on the accuracy of the final model we built.

Metrics: For most of our assessment, we have looked at the accuracy of the model. We will also use the cross entropy loss when doing gradient discent.

Innovations: Multi-channel multi-model CNN structure and committee model.

Reason of using deep learning to solve this problem: Since we are predicting information of the author based on the book, it means our model need to be good at extracting feature from tons of raw data. The feature of a book can be complex, deep learning can capture complex patterns and relationships in the data, leading to better performance on unseen examples. Also, our dataset contains tons of words, deep learning algorithms can scale effectively with increasing amounts of data. They can leverage large datasets to improve performance and generalization.

Stakeholders: Literary Scholars and Historians: To analyze texts and attribute authorship, particularly for historical documents where the authorship may be disputed or unknown. Students and Educators: For educational purposes, such as studying writing styles, understanding genre evolution, or examining the body of work attributed to specific authors. Marketing Teams: In book marketing and sales, understanding genre trends can help target the right audience for promotional campaigns.

Consequences of mistakes: Misclassification: Mislabeling the genre of a text could lead to cultural misunderstandings or misrepresentation, particularly for texts that are culturally or historically significant. Educational Impact: Inaccurate information can lead to the dissemination of false knowledge, which can be particularly damaging in educational settings. Commercial Implications: For publishers and distributors, misclassifications can lead to poor decision-making in marketing, stock selection, and user recommendation systems.

Built With

Share this project:

Updates

posted an update

Introduction: This can be copied from the proposal. We will be predicting the genre and author of texts in this project. By predicting the author, we can analyze writing styles, linguistic patterns, and stylistic features unique to each author. This can provide insights into authorial characteristics, preferences, and changes in writing style over time. By doing so, we will be able to gain a deeper understanding of literature. Additionally, it can be used in educational settings for tasks such as identifying plagiarism, assessing writing proficiency, and creating personalized learning experiences based on author-specific content. This is a classification problem.

Challenges: What has been the hardest part of the project you’ve encountered so far?

Writing the encoder and decoder for the transformer model, and improving the model's runtime (it takes several hours to run).

Turning Conv2d models to Conv1d and adding multiple channels(Calculating the shape of inputs for linear layer), preprocessing the data so that the n_gram inputs can fit the model well.

As we are training CNN, RNN, GRU, LSTM, Bi-LSTM, and Transformer models, we find that it would be difficult to finish these computations without current settings. We went through some difficulties setting up CUDA and cnDNN network to boost our training and tuning process. There are still difficulties adjusting our function and packages to align with our hardware setting.

Insights: Are there any concrete results you can show at this point?

The models’ accuracies: CNN: 1_gram CNN with 3 channels: Loss: 1.3648, Accuracy: 58.71% 2_gram CNN with 3 channels: Loss: 1.6228, Accuracy: 51.17% 3_gram CNN with 3 channels: Loss: 1.7238, Accuracy: 48.35% 1_gram CNN with 4 channels; Loss: 1.1774, Accuracy: 64.01% Transformer: Test Loss: 0.0864, Test Accuracy: 0.8543 LSTM: Validation statistics: Acc: 0.5232 Loss: 0.0331 F1 Score: 0.5122 GRU: Validation statistics: Acc: 0.8913 Loss: 0.0066 F1 Score: 0.8906 Bi_LSTM: Validation statistics: Acc: 0.5710 Loss: 0.0289 F1 Score: 0.5597

How is your model performing compared with expectations?

For the CNN model, I find the 1_gram model performs better than the 2_gram and the 3_gram model(In both training accuracy and testing accuracy), which is not what I expected.

It seems that LSTM and Bi-LSTM don't perform very well with our current setting, reaching validation accuracy around only 50%. The GRU based model, on the other hand, reached a new 90% accuracy, which is the highest among our models. Transformer model also reached validation accuracy over 85%.

It is also worth mentioning that the transformer model excels at predicting with short text. Our model is trained with a paragraph based dataset, each text having around 3000 characters. As we are playing around the model with texts outside our dataset, I find GRU performs well only if the text inputs are sufficient. The transformer model works much better with shorter snippets, text with only 400 to 500 characters.

Plan: Are you on track with your project?

Yes

What do you need to dedicate more time to?

Working on the innovation part, we are trying to find algorithms that can optimize the weights in the committee so that the combined model can perform better than the single model.

What are you thinking of changing, if anything?

Keep fine-tuning the hyperparameters. Still, we are thinking about restructuring our dataset. The dataset we are having right now is paragraph based. We are thinking of a combined dataset of both sentences and paragraphs. Thus, the model will be able to learn from both sentence structure and paragraph structure. While that idea is perfectly feasible, it requires computational resources beyond our capabilities. It will roughly take 3 hours to train each model, save the tuning.

Log in or sign up for Devpost to join the conversation.