Third year project and dissertation. There are a number part-of-speech (POS) taggers that work well for a handful of world languages. However, there are large numbers of languages that lack language resources. This project will take a few state-of-the-art POS taggers to work collaboratively to annotate some text in a poorly-resourced language (which in this case is Hindi, the fourth-most spoken language in the world) with linguistic information. The initial input data is a small sample of manually annotated data, which is used by both taggers. This is fed into the available taggers and the more probable/agreed output is voted (or accepted). These are evaluated and we attempt to build a corpus by bootstrapping. Further, a neural-network-based POS tagger is built and tested with the available corpus and compared against the state-of-the-art taggers, and then with the other taggers. The ANN tagger is further expected to work on other languages as well, irrespective of their structure.

Built With

Share this project: