Cracking Arbitrary Substitution Ciphers

Important Links

Updated Methodology and Results

Please refer to our final report for updated methodology and results.

Introduction

The substitution cipher is a traditional encryption technique that substitutes units of the plaintext into other units thereby forming the ciphertext. Each substitution cipher is defined by a key, which is essentially a mapping function specifying how to substitute each plaintext unit. One key weakness of substitution ciphers is that they do not hide the statistics of units in the ciphertext thus prone to a type of attack called frequency analysis. While frequency analysis could already crack substitution ciphers, we are interested in whether modern neural networks could also crack substitution ciphers with less hard-coded components.

Concretely, we aim to build a deep sequence model that could recover ciphertexts and produce the underlying plaintexts. We focus on substitution ciphers that operate on characters and numbers but impose no constraints on the keys (in contrast to Ceaser Cipher, which has extra constraints on the key space). We are motivated by the failure case presented in one of the past deep learning final projects, which achieves only 37% on recovering ciphertexts encrypted with arbitrary substitution ciphers. We believe the use of one-hot encoding ignores essential information about the statistics of units from the ciphertext and can be improved with better choices. We adopt the approach proposed in Aldarrab & May, 2021, which includes the unigram frequency information of the ciphertext into the embedding. Furthermore, we explore the possibility of enhancing the transformer encoder blocks with the use of 1-D CNNs as feature extractors.

Related Works

Can Sequence-to-Sequence Models Crack Substitution Ciphers?

Aldarrab & May employ an end-to-end multilingual model to solve the simple substitution on ciphers. The model is trained on multilingual data so that it can obtain end-to-end decipherment without relying on a separate language ID step. To generalize the model’s ability to recover from that noise by inserting, deleting, or substituting characters while generating plaintext, they used Sequence-to-sequence model architecture with frequency encoding.

Character-Level Translation with Self-attention

Gao et al. proposed an architecture, conv-transformer. They adapted each encoder block to include an additional subblock, which they applied to the input representations before applying self-attention. The sub-block consists of three 1D convolutional layers, aimed to resemble character-level interactions of different levels of granularity.

CipherBusters: Automatically Busting Classical Ciphers

In this past deep learning final project, Chung et al. investigated using modern deep learning sequence models (RNNs and Transformers) to crack classical ciphers including Ceaser, Vignere, and arbitrary substitution cipher. While they could solve Ceaser ciphers to high accuracy, their models fail to generalize to Vignere and arbitrary substitution ciphers.

Decipherment of Substitution Ciphers with Neural Language Models

Kambhatla et al. uses large pre-trained neural LMs to crack substitution cipher. They modified the beam-search algorithm and used global scoring of the plaintext message. Compared to the traditional n-gram models, this method results in much lower error rates in cracking challenging ciphers.

Word Translation without Parallel Data

This paper showed a method to build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

Data

We use the wikitext-103-raw-v1 split from wikitext for our training, validation, and testing data. Wikitext is a language corpus extracted from selected Wikipedia articles. For preprocessing, we only kept letters (all switched to lowercase), numbers, and spaces, while removing all other characters and symbols. The following table introduces the sizes of our training, validation, and testing datasets.

	Number of Characters
Training	500,978,592
Validation	1,065,888
Testing	1,198,238

Upon training, we plan to dynamically apply our substitution cipher function on the raw text inputs, and encrypt them with different substitution keys.

Methodology

We plan to use an encoder-only transformer model. The sequences will be split into characters and fed into the model. We may not need to use a decoder because we are guaranteed that the output sequence length is the same as the input sequence length, as all ciphertexts are encrypted on a character level.

We are training our model on taking ciphertexts as inputs (with the character frequency encoded in the embedding) and output plaintexts as outputs. To encode the frequency of characters in the ciphertext in the embedding, we design our embedding matrix in the following way, following an approach proposed in a previous work: Instead of having a dictionary from a specific letter to its embedding, we associate an embedding with the most frequent character in the ciphertext, the second most frequent character, and so on. Thus, the model is learning to recover the ciphertext to plaintext based on a frequency-sorted ciphertext, which helps a lot in our case because the relative frequency of characters in English is relatively stable, and could make the recovering process much easier and generalizable.

The challenging part of implementing our model might be to choose an appropriate architecture. We will need to run a hyperparameter search for the structure of the architecture (i.e. model dimension, number of blocks, number of heads per attention head, etc.) to determine what is an optimal architecture for the task. If the encoder-only model does not work, we will explore alternative architectures such as the encoder-decoder models, as well as tree decoders.

Metrics

We create test and training ciphers from plain text of the wikitext dataset using methods we implemented. We will use the character-level accuracy as an evaluation metric. Under this metric, the accuracy will be the result of dividing the number of correct characters in the predicted sentence by the number of characters in the original plain-text sentence. Since there are many algorithms and architectures aiming to solve this problem, we plan to use performances from some algorithms as our bases. For our base, we plan to achieve at least the accuracy of the model from the group CypherBusters which is about 38% on the substitution cipher. Our second base is the accuracy from cracking ciphers based only on character frequencies in our dataset. We should at least surpass this second base. Our target is to get at least 70% accuracy. Our stretch goal is to get above 90% accuracy.

Ethical Concerns

Cracking substitution ciphers, and cipher cracking in general, can raise concerns about data security and privacy issues. Even though the substitution cipher has not been in serious use for a long time, we still need to consider its social implications. Specifically, there needs to be a discussion about regulations of DL-empowered deciphering, and how such technology should be used.

Deep learning is particularly a good approach because it is excellent in pattern recognition, and has great adaptability and scalability, which is lacking in traditional cryptanalysis methods. However, we do recognize that using deep learning to crack substitution ciphers must be treated with careful consideration of the ethical and social issues behind it.

Division of Labor

Yiyang: Implement and improve the model architecture. Evaluating the model performance, conducting accuracy analysis, and graphing and interpreting the results. Updating the project documentation and final report. Flip the burger.
Xiangxi: Develop the model architecture and implement Sequence-to-Sequence model setups, particularly in the word encryption and character frequency embedding sections. Experiment with different hyperparameters for testing. Fry the chicken.
Tianze: Develop the model architecture and implement Sequence-to-Sequence model setups, particularly in encoding, statistical analysis, and decoding mapping sections. Conduct accuracy analysis. Test and compare with other state-of-art models. Peel the potatoes.
Yihan: Handle the data preprocessing. Help with group discussion and guide with project methodologies. Implementing word encryption, character frequency, and embedding sections. Build and finetune the model architecture and parameters. Grill the corn.

Challenges

It is difficult for us to find the latent relationship between the characters. Simply using an attention encoder is not enough, and we can hardly enhance the model's performance above the baseline and the model translates the ciphertexts by their appearance frequency directly. So, we incorporated a convolutional layer into our model to handle deciphering tasks, following approaches proposed in previous works Character-Level Translation with Self-attention and Can Sequence-to-Sequence Models Crack Substitution Ciphers? We have used different channel sizes to capture dependencies from short to long ranges surrounding each character.

Insights

We have successfully implemented the model and trained a preliminary version of it. Our current model achieved a character-level accuracy of ~70%. We tested our model with raw texts collected from CS1470’s ed posts, random Wikipedia articles, and news articles, and we will be able to demonstrate more test samples in the coming weeks.

Our model performance is as our expectation. Our target set at the beginning of this project is 70%, and our model now has accuracy about 70% and even above it. Our next goal would be achieving the 90% accuracy.

Plan

We are on track with the overall timeline. We want to conduct more analyses (especially with ciphertexts of different window sizes) on the outputs that our trained models produce and see if we can add new components to make it work better. We used to train our model with an additional masking objective, in accordance with the RoBERTa training objective (however, with a 5% probability for masking). However we found that the nature of our task is somewhat different from the masked language modeling task (since deciphering itself already requires the model to take contextual information into the calculation), thus we are planning to remove this MLM objective and replace it with a training procedure without introducing masking tokens. We are also interested in addressing the consistency issues in the model. Our current model decodes identical characters to different characters, which can be an issue in the deciphering process. We plan to incorporate an extra loss function that penalizes inconsistent deciphering into our current model, in hoping to resolve the inconsistency issue.

Reflections

Our project ultimately turned out as we expected. We even reached our stretch goals in terms of the accuracy of our final model. Originally, we did not implement the CNN-enhanced transformer encoder blocks and the performance was not as good as expected; but after doing a literature review and playing around with the model architecture, we managed to make it perform better. Our execution of the research plan was in accordance with our original plan - and if we were to do this project again we would have faced the same challenges and come up with the same breakthroughs as we did this time.

If we have more time, we wanted to try out including more characters to the substitution cipher (currently it includes 26 lowercase letters, 10 numbers, and a space character). We also thought about removing the space character or applying substitution ciphers at a subword token level (instead of at the character level).

Our main takeaway is to be open-minded about changing the plan - as we progress gradually through the project, we have to make adjustments and architecture changes in order to improve model performance. If we had stuck with the traditional transformer architecture (and only scaled it with more blocks or with a larger hidden size), we wouldn’t have achieved our final results.