Multi Linguistic Document Classifier

Document Classifier

Inspiration

The possibilities of document classification, we will find a non-exhaustive list of the things we can do with document classification are below:

Classification of file types, Classification of document types, Classification of document languages, Classification of countries of origin, Classification of merchants, Classification of line items, Classification of urgency, Classification of privacy-sensitive data etc

What it does

It classifies a given set of documents.

How we built it

Used PyPDF for text extraction , Doc2Vec for feature extraction and XGBOOST for classification. And Fast API for inference .

Challenges we ran into

Faced problems in implementing doc2vec to generate the tagged docs.

Accomplishments that we're proud of

We extended this model to next level . A basic implementation of Multi Linguistic classifier is done.

What we learned

NLP techniques and the complete flow of an NLP project.

What's next for Multi Linguistic Document Classifier

A very basic model for Multi Linguistic classifier is implemented.

Built With

anaconda
doc2vec
fastapi
gensim
python
xgboost

Updates

Dhanesh Dhanapalan started this project — Dec 18, 2021 11:32 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.