I always wished for something like iNLTK to exist, which will make NLP more accessible and democratize its benefits to non-English speakers. iNLTK kind of grew organically. My vision for iNLTK is that it should be the go-to library for anyone working with low-resource languages.
What it does
iNLTK library provides out of the box support for various NLP tasks, for low resource 12 Indic Languages. The library has 30,000+ downloads and 450+ stars on GitHub.
How I built it
• iNLTK provides Data Augmentation, Sentence Similarity, Sentence Encoding, Word Embedding, Tokenization and Text Generation utilities for low resource 12 Indic Languages • The library is backed by ULMFiT Language Models which I had trained using Fastai and Pytorch libraries, producing SOTA LM perplexity and Classification accuracy in 12 Indic Languages
Challenges I ran into
Availability of datasets is one of the biggest challenges which I faced. As part of trying to overcome these challenges, I ended up creating 20+ datasets for low resource Indic Languages which are available for anyone to use and benchmark on Kaggle
Accomplishments that I'm proud of
I'm proud of the following accomplishments which iNLTK received:
• By Jeremy Howard, Sebastian Ruder on Twitter • Shared a lot by community on LinkedIn • Data Augmentation post about iNLTK was trending on LinkedIn • iNLTK was trending on GitHub in May, 2019 • Shared on Reddit, Facebook, Quora etc by the community
What I learned
iNLTK was my first ever 200+ star repository on GitHub. In the process of building it, in addition to learning fundamentals of NLP and Deep Learning, I learnt about API design, making it as user friendly as possible. Additionally, I learnt a great deal about the tools and frameworks which I was using, like Pytorch and Fastai.
What's next for Natural Language Toolkit for Indic Languages (iNLTK)
The plan is to add NER, POS tagging in iNLTK for low resource Indic Languages.