What it does

Encrypting a message usually leads to a ciphertext that's incomprehensible and while that's not technically problematic, it looks odd with all the special characters so it'd be easily guessable that an encrypted message is being transmitted through something like an email.

NLS receives the ciphertext and outputs a natural languge text contains enough information to reconstruct the ciphertext.

This way, a message has been securely encrypted (our particular implementation is AES-128) since it has a natural look, it won't be obvious that one is transmitting an encrypted message.

How I built it

NLS is composed of four pieces:

1) Data: We used a previously downloaded English Wikipedia as base and after cleaning it up, Partially parsed it until sentences. We also had/generated a list of frequently used whole words accompanied by letters that come at the start of words or sentences.

2) Lower level encrypting: For encryption we used PyCrypto to implement AES-128, ECB with block size 16 bytes.

3) Numerical: We changed the base of the ciphertext (by default to 10, but can be any other base upto 26). Made a new number system to represent digits with letters of the alphabet in way to match our letter data. So eventually a ciphertext would be changed into a decimal number whose digits are the most 10 used sentence-generating/word-creating letters of the alphabet.

4) Text generation: We used our Wikipedia date to train a hidden Markov model to generate texts. Specifically, such a text must have sentences whose starting letters are consecutively the letters of our base-changed ciphertext (derived in step (3)).

These steps would be reversed for decoding/decryption.

Challenges I ran into

1) Data-wrangling. Converting a 15G text file to another and parse it to get sentences is not pretty. 2) Some weird programming problems e.g. having a defective clipboard buffer on my system that changed "some" texts when I pasted them, buy read it fine when I put them in a text file and read the fild instead! 3) In order to prevent from having the model to learn everything, every time it's executed, we needed to export the settings. The exporting functionality of the markov library we used was both slow, required too much memory and it didn't have the form we needed. So instead of using it, we wrote the dynamic object to a binary file on the hard drive and then just reread it whenever we needed it which turned out to be faster. 4) We tried to do a recurrent neural net at first, but it was simply much slower than markov for it produce anything that's comparably more natural-looking.

What I learned

First time dealing with data wrangling, markov models and neural nets, messing around with memory in Python etc.

What's next for NLS

I'll publish it as open-source in the next couple of days. It has some bugs when one directly changes the threshold of allowable letters and their word production frequency that I should fix later. Encryption implementation is not ideal. It doesn't hurt for its interface to pretty, either!

Built With

Share this project: