I have always been inspired by procedural generation and machine learning. I have always thought that natural language is a good platform for it so this is my attempt a sort of turing-test.

My hope was that by using a set of data which featured a healthy amount of "dubious" records, I might be able to smokescreen some of the distinctively non-human artifacts that come with generating natural language.

What it does

The first part of the project is in training and generating the RNN models needed to generate the text. The second part is a django web application that presents a random record and asks a user if they think it was created by a human or generated by a bot.

How I built it

I used python to script and interactively (with ipython and the dajngo shell) shape my raw data into a set which that is ready for training. I integrated an environment for a library called TextGenRNN to produce fake UFO sighting records, which along with a set of real records, are presented as a lightweight django application.

Challenges I ran into

Since I was working alone, a lot of the challenge came from the ground I needed to cover in the short period of time I had to work with. I coped by relying heavily on multitasking and by being very selective about which of the needed parts could become reusable and the order in which the project needed to be assembled.

Training an RNN to do anything in 24 hours can be a challenge, if not for the sheer amount of time spent waiting for the training and testing processes to complete. An essential component to making the deadline was using faster runs of the training algorithm to work out the kinks early on. From there I just kept iterating in the direction of positive results.

Accomplishments that I'm proud of

I am proud of being able to submit anything after not giving up when my first 3 ideas for projects failed during the first day of the event.

I am always proud whenever I have the chance to take technology and apply it directly into feeding my curiosity, but what makes me the most proud is if my creations are able to evoke that same human curiosity in others.

What I learned

I learned a lot of about RNNs and LSTMs by spending so much time using the TextGenRNN library. While I understood a bit about RNNs and how they work before I started working on this project but now I have a greater appreciation for the different ways one can use a trained model. After only a few iterations of testing I started to develop a sense of how the set of parameters for the training algorithm would affect the final results.

I also learned how to selectively trade in on application features and complexity so that I would be able to deliver a working prototype in such a short amount of time. I am a fan of test-driven development, but for a project like this there is just not enough time for the TDD-goat. Picking a good design for the views and templates I needed for a skeleton prototype was essential in reducing the amount of django code that I needed to implement.

What's next for IntheskyInthesky

I think the next biggest step in terms of functionality is to parameterize the length of the generated chunks. As it stands, the fake records are easily discernible due to a number of identifiable characteristics, but perhaps the most glaring is the relatively static size of the generated text.

Visually the web application could use a full suite of features and a more robust system for gathering and analyzing results. If done properly, they yield that data needed to inform the designs for next generation.

Another possibility could be setting up the web application to generate records in realtime as a user requests them, instead of sending static records from the database.

Share this project: