Authorific: machine learning with dictators
Text data is rich with distinctive vocabulary and conceptual signals. These signals can be used to perform predictive text, like classifying raw text into categories.
We should be able to, in theory, analyse a piece of text and make quantitatively-backed up statements such as "this looks like an Adele song", or "seems like something Hitler would write", or "sounds like that one Taylor Swift song with the mercenary ladies".
If we are able to generate meaningful features from the text, then even unstructured, messy text data can be treated as a just another type of input into a machine learning algorithm. Here, we use a dual-pronged deep learning and support vector machine approach to extract concept features and vocabulary from raw text.
We build up the models on the training set, and achieve very high accuracies on the test set (over 95% for the deep neural nets and over 90% accuracy in the case of SVMs). Our 4 overarching categories are: music, literature, Taylor Swift lyrics, and speeches. Within these, we have a number of artists. We have been able to capture the specific style of individual authors (e.g. the writing style of Charles Dickens vs. Dostoevsky).
We are all interested in data science and machine learning, so we decided we wanted to do something in this area. The categories were chosen simply for fun! :)
Although we got excellent results on the model, we ran out of time when trying to integrate everything for the web app on Amazon Web Services (we will demo the site on the laptop; link below). A simple version works, but we could not get around to e.g. displaying a picture of the predicted category for input text. This could be improved in the future.
Things we learned
So much! I did a lot more bash scripting than I normally do, and learned a lot about text processing and saving model objects (import pickle, anyone?). Max, in charge of the web app, learned a lot while trying to integrate our trained models with the online interface. Andrey got more practice running deep neural networks, and Alexey did plenty of difficult data processing.