Author Detective: Texts, Traits, and Tea Leaves

Ruize Ma posted an update — May 03, 2024 06:34 PM EDT

Introduction: This can be copied from the proposal. We will be predicting the genre and author of texts in this project. By predicting the author, we can analyze writing styles, linguistic patterns, and stylistic features unique to each author. This can provide insights into authorial characteristics, preferences, and changes in writing style over time. By doing so, we will be able to gain a deeper understanding of literature. Additionally, it can be used in educational settings for tasks such as identifying plagiarism, assessing writing proficiency, and creating personalized learning experiences based on author-specific content. This is a classification problem.

Challenges: What has been the hardest part of the project you’ve encountered so far?

Writing the encoder and decoder for the transformer model, and improving the model's runtime (it takes several hours to run).

Turning Conv2d models to Conv1d and adding multiple channels(Calculating the shape of inputs for linear layer), preprocessing the data so that the n_gram inputs can fit the model well.

As we are training CNN, RNN, GRU, LSTM, Bi-LSTM, and Transformer models, we find that it would be difficult to finish these computations without current settings. We went through some difficulties setting up CUDA and cnDNN network to boost our training and tuning process. There are still difficulties adjusting our function and packages to align with our hardware setting.

Insights: Are there any concrete results you can show at this point?

The models’ accuracies: CNN: 1_gram CNN with 3 channels: Loss: 1.3648, Accuracy: 58.71% 2_gram CNN with 3 channels: Loss: 1.6228, Accuracy: 51.17% 3_gram CNN with 3 channels: Loss: 1.7238, Accuracy: 48.35% 1_gram CNN with 4 channels; Loss: 1.1774, Accuracy: 64.01% Transformer: Test Loss: 0.0864, Test Accuracy: 0.8543 LSTM: Validation statistics: Acc: 0.5232 Loss: 0.0331 F1 Score: 0.5122 GRU: Validation statistics: Acc: 0.8913 Loss: 0.0066 F1 Score: 0.8906 Bi_LSTM: Validation statistics: Acc: 0.5710 Loss: 0.0289 F1 Score: 0.5597

How is your model performing compared with expectations?

For the CNN model, I find the 1_gram model performs better than the 2_gram and the 3_gram model(In both training accuracy and testing accuracy), which is not what I expected.

It seems that LSTM and Bi-LSTM don't perform very well with our current setting, reaching validation accuracy around only 50%. The GRU based model, on the other hand, reached a new 90% accuracy, which is the highest among our models. Transformer model also reached validation accuracy over 85%.

It is also worth mentioning that the transformer model excels at predicting with short text. Our model is trained with a paragraph based dataset, each text having around 3000 characters. As we are playing around the model with texts outside our dataset, I find GRU performs well only if the text inputs are sufficient. The transformer model works much better with shorter snippets, text with only 400 to 500 characters.

Plan: Are you on track with your project?

Yes

What do you need to dedicate more time to?

Working on the innovation part, we are trying to find algorithms that can optimize the weights in the committee so that the combined model can perform better than the single model.

What are you thinking of changing, if anything?

Keep fine-tuning the hyperparameters. Still, we are thinking about restructuring our dataset. The dataset we are having right now is paragraph based. We are thinking of a combined dataset of both sentences and paragraphs. Thus, the model will be able to learn from both sentence structure and paragraph structure. While that idea is perfectly feasible, it requires computational resources beyond our capabilities. It will roughly take 3 hours to train each model, save the tuning.

Log in or sign up for Devpost to join the conversation.