Text-based author identification
Hongming Fu, Ruize Ma, Yi Sun, Yu Gu(names arranged alphabetically)
Important links:
GitHub link
Slides link
Final report link
Introduction: Our project aims at predicting the author. By predicting the author, we can analyze writing styles, linguistic patterns, and stylistic features unique to each author. This can provide insights into authorial characteristics, preferences, and changes in writing style over time. By doing so, we will be able to gain a deeper understanding of literature. It can also be used in educational settings for tasks such as identifying plagiarism, assessing writing proficiency, and creating personalized learning experiences based on author-specific content.
Related Work: The dataset we found consists of 36 books from the Gutenberg project. We will preprocess the book materials to make them our training data, and set the training target the author and the age when that author was composing that book.
Data: For this project, we are going to use a dataset provided by the Gutenberg project.Project Gutenberg is a volunteer effort to digitize and archive cultural works. This source is valuable as it provides a number of works free of copyright issues, making it an ideal choice for academic and research purposes. There have been other NLP projects gathering their dataset from Gutenberg. Our dataset includes 36 pieces of works from a variety of genres, ensuring the diversity of literacy style, theme and chronological feature. The size of each book ranges from a few hundred kbs to 4mb. For preprocessing, we remove irrelevant content from each book, including catalog, preface and epilogue. These contents include no information regarding the author per se. Each book will be tokenized into individual sentences. It is worth noting that we will regroup our tokens based on the tasks we intend to implement.
https://www.kaggle.com/datasets/mateibejan/15000-gutenberg-books
Methodology: We will first apply word embedding to generate the embedding vector of each word that appears in the novels. We will then build up CNN, RNN(LSTM or GRU), transformer in our neural network. We will tune the hyper-parameters and decide the number of layers based on the validation score. We will also try to build a generative model based on the accuracy of the final model we built.
Metrics: For most of our assessment, we have looked at the accuracy of the model. We will also use the cross entropy loss when doing gradient discent.
Innovations: Multi-channel multi-model CNN structure and committee model.
Reason of using deep learning to solve this problem: Since we are predicting information of the author based on the book, it means our model need to be good at extracting feature from tons of raw data. The feature of a book can be complex, deep learning can capture complex patterns and relationships in the data, leading to better performance on unseen examples. Also, our dataset contains tons of words, deep learning algorithms can scale effectively with increasing amounts of data. They can leverage large datasets to improve performance and generalization.
Stakeholders: Literary Scholars and Historians: To analyze texts and attribute authorship, particularly for historical documents where the authorship may be disputed or unknown. Students and Educators: For educational purposes, such as studying writing styles, understanding genre evolution, or examining the body of work attributed to specific authors. Marketing Teams: In book marketing and sales, understanding genre trends can help target the right audience for promotional campaigns.
Consequences of mistakes: Misclassification: Mislabeling the genre of a text could lead to cultural misunderstandings or misrepresentation, particularly for texts that are culturally or historically significant. Educational Impact: Inaccurate information can lead to the dissemination of false knowledge, which can be particularly damaging in educational settings. Commercial Implications: For publishers and distributors, misclassifications can lead to poor decision-making in marketing, stock selection, and user recommendation systems.
Built With
- kaggle
- machine-learning
- python
Log in or sign up for Devpost to join the conversation.