hackLanguage

Create a Native Language Classifier to identify the "native tongue" of an author from a sample of their writing in English, leveraging Google Compute Engine processing power to train both a baseline Support Vector Machine and a novel Deep Learning approach for multimodal classification.

The Data:

Data scraped from English language Wikimedia comments, where users identify their native language in their user account, as well as their proficiency in English. The data set contains 20 different native languages with nearly 600K comments in total. nltk was used to tokenise the data prior to model classification.

The Model(s):

Two possible classifiers were built and trained:

A Linear Support Vector Machine (SVM) leveraging a set of bespoke natural language features (1 to 3-gram token language models) trained on each language corpus
A Long-Short Term Memory Recurrent Neural Network (LSTM) approach, using Google TensorFlow. The autoencoder uses a sequence-to-sequence methodology but where the output sequence is a label identifying the language for the input text.

Given the time limitations for constructing and training the models, only three "native" languages were trained for both models - English, Russian and Spanish for the LSTM and French, Russian and Spanish for the SVM.

The LSTM network gives good perplexity scores when evaluated during training against a separate test set (the system is still training...) however live testing has proved impossible to implement due to an issue with the decoder. As such the SVM has been implemented as the final model for today. (The LSTM is still damn cool, using a machine translation framework and treating classifier labels as effectively another "language" to be learned from the input representations; an entirely novel approach.)

The Output:

A web app built with Django, capable of taking in user text strings and then classifying the user's "native" tongue (provided they are French, Russian, or Spanish). The classifier has a 50% accuracy on our test data - starte-of-the-art approaches calibrated over many, many months reach maybe 75%.

The Future:

Given time to train, refine code and fix pesky decoding bugs, a model capable of classifying up to 20 native languages from English input text alone.

The Wikimedia comments data also includes user estimates of their own competence level in English. Expansion of the model could allow for multi-task learning not just of the native tongue, but the relative proficiency of the author compared to the self-evaluations of their peers.