Language Detection

Poster!

Inspiration

We both have a passion for Natural Language Processing and believe that it is an emerging field of importance. With NLP being used in HealthCare (Text classification), Advertisements (Sentiment Analysis), and much more, we wanted to create a model that could detect languages given a sample set of words from that language. This is a classification problem (assigning inputs (words) to a language).

This article talks about solving Optical Character Recognition (OCR) for text recognition in natural scene images. It uses unstructured data and structured data for its model. After preprocessing, it implements text detection using EAST (Efficient accurate scene text detector) and CRNNs (Convolutional Recurrent Neural Nets) to train its model. We will be referring to their model as the base point for our model.

Data:

We will be using some of the data from NanoNets(https://nanonets.com/blog/deep-learning-ocr/) and hopefully elsewhere. The dataset will be fairly large, and we will need to perform preprocessing to structure the data properly.

Methodology:

We will be implementing our model similar to the one implemented by NanoNets (https://nanonets.com/blog/deep-learning-ocr/). We will be using an RNN and instead of translating this will be a classification problem for a certain language.

Metrics: We will use accuracy as our main metric, aiming to do at least 15% better than RNG (100/ num languages being tested). We plan to start out with 2 languages and then add languages as we improve our model.

Ethics:

Deep Learning is a good approach to this problem because the model can be trained to learn certain features about words/characters of a language. Given labels with which to judge itself, the model can learn over time how to identify what features/attachments certain words have to their respective languages.

We are planning to measure success based on how accurate our model is at detecting the correct language according to the labels. Our results' implications denote how well our model is able to detect distinct features in different languages.

Checkin 2:

The hardest part about this project so far has been finding a proper data set to train our data on. Since we are doing a classification problem, we originally wanted to use a dataset with handwritten languages. However, it was very difficult to find such datasets and even if found, a lot of preprocessing to structure the data would be required. To solve this we decided to use a data set with text, so it would be much easier to structure the data.

There are no concrete results to show at this point, but we expect to have a lot of it done next week. Since we feel the hardest part about our project was finding the correct architecture and dataset, we believe preprocessing and beginning to create our model should be easier now that we have a good understanding of how we are creating it.

We believe we are on track with our project. We have had frequent meetings with our mentor, and we have decided that figuring out the design of our model and finding a good dataset were more important steps that will make the rest of our project easier. We now need to dedicate a lot of our time to preprocessing and creating our model. If we are on good track, we are thinking of adding more than 2-3 languages to see how accurate our model is.

Final Writeup : https://docs.google.com/document/d/1HWBXW1SKatn8nohcDXKhs3NRWNDOb_FJp_ZM-zq4NT0/edit