Who Let the Cat Out? Siamese Network Authorship Verification

(Demo at the end of video 3:06)

Final Writeup | Poster | Github

Who

Alyssa Loo / aloo1 Brendan Ho / bho15

Introduction

In an internet age inundated by fodder content produced by a small group of writers for political or financial gain, it is important to tell if articles from apparently different sources are in reality written by the same people. Other applications for authorship verification include fraud and plagiarism detection. In addition to these depressing uses, authorship verification can be used to help attribute authorship to recently discovered texts and manuscripts. This is a binary classification task: when given two text inputs, the model should classify whether the texts are written by the same author based on stylometric features. We attempted to re-implement a PAN 2020 Authorship Verification Contest Submissionlink as it uses deep learning techniques in a field otherwise dominated by machine learning feature engineering techniques. It achieves stylometric comparison via a Siamese network.

Related Work

We found a paper where, instead of a binary classification problem of authorship identification, this paper implements authorship identification through unsupervised clustering (link).
Here is a blogpost that describes the logic behind siamese networks. Instead of cross-entropy loss, this model learns that the square of the difference between feature vectors should be large when the authors differ, and should be small when they are the same. This is known as contrastive loss. Additionally, the weights and parameters are identical across the networks, and parameter updating is mirrored across both subnetworks.

Data

We retrieved the data from the PAN2020 organizers. The small dataset that we used contains 53 000 text pairs scraped from fanfiction.net.

Methodology

We trained a Siamese Network, which will consist of two sub-networks with a final dense layer for contrasting loss. The two sub-networks will share the same weights and be updated concurrently, but they will be fed different text inputs. Each sub-networks will output a feature vector representing their input text, and the final dense layer will return a contrastive loss representing the heuristic where, if the gold label is that the two text inputs are of the same author, the feature vectors should be similar; if the gold label is that the two text inputs are of different authors, the feature vectors should be different.

Metrics

We used Binary Cross-Entropy Loss, Binary Accuracy and F-1 Score. Base: > 50% F1 Score
Target: > 60% F1 Score
Stretch: > 70% F1 Score

Ethics

As with any deep learning model, it comes with a boatload of ethical questions. A big one is based on the “black-box” nature of all DL models. If comparing two feature vectors, why can we say that its entries mean something? How do we know what subsets of entries mean what? The problem of interpretability is a problem with all deep learning models, but is still a valid one to probe. Should we be okay with not knowing what our model is basing its decisions on?

Another question may be why deep learning is the correct approach for this particular problem. Models for authorship verification has typically been machine-learning based, but manually feature-engineering stylometrics may be difficult as syntactic/lexical elements author ‘style’ can vary so largely in scope within the text (ie. within the sentence, globally across the text, etc.). Deep learning allows us to capture latent stylometric signifiers that we may not be able to engineer on our own. Tying this back to the first problem, however, there are definitely tradeoffs.

There are also massive implications of solving the authorship verification task with deep learning: Could there be a privacy issue with optimizing these sorts of authorship verification algorithms? Whistleblower testimony, anonymous reports, etc. often require anonymity for the safety of the testifier—will we be compromising these in times when anonymity is needed for the safety of parties involved? If we optimize authorship verification, will it be enough to also optimize authorship obfuscation?

Division of Labor

Acquiring and preprocessing training/testing Data (bho15)
Model design, model architecture and demo building (aloo1)
Engineering a model (bho15)
Training and testing the model (aloo1)