Inspiration

The field of intelligent document extraction is a rapidly growing field with huge potential to create disruptions in various document processing industries from healthcare (insurance forms), finance (tax forms, invoices), consumer goods (receipts), etc. Unfortunately, because of the unique complexity of the problem and limited open-source data available, very little progress has been made in the open space. By building a modular, extensible end-to-end tool in PyTorch with a dedicated web interface, we hope to enable easier and faster adoption of the technology to both researchers and practitioners alike!

What it does

Very simply put, it "takes in a document, runs OCR (Optical Character Recognition) and uses those OCR results to either train new NER (Named Entity Recognizer) models or predict using existing pretrained ones for forms and receipts data".

How we built it

For OCR component, we took the best pretrained checkpoints, froze and optimized it, created a torchserve workflow to stitch together detection and recognition model and finally served it on the cloud in a serverless manner. For the Extraction part, we fine-tuned custom models using huggingface library, built an end-to-end training pipeline .

Challenges we ran into

  1. Creating workflow in torchserve was particularly challenging, as OCR had three separate models stitched together in a complex serial+parallel manner, each component having their own custom preprocessing and postprocessing handlers.
  2. Also for deployment to Vertex AI platform on GCP,we had to modify the torchserve’s source-code to make it compatible with Vertex AI deployment conditions.

Accomplishments that we're proud of

  1. Serving in the most efficient way possible: We froze the model-weights to torchscript format, optimized it for cpu deployment and served in a dedicated model-server by creating a workflow in torchserve. Given torchserve is a new space, we are proud to make some contribution there.
  2. Creating a Few-shot extraction tool: The pre-trained model is very accurate even in very low resource setting. The key to this success was the selection of a dedicated Document NER model architecture which takes in the 2d information of documents in the form of bounding boxes instead of just plain simple BERT based NER models.

What we learned

  1. Building an end-to-end system in PyTorch was truly an incredible learning experience. From training models to freezing its weights, optimizing it, serving it in torchserve and deploying as a docker container, we learnt a lot during this process.
  2. Experimented with a lot of pre-trained models for NER tasks and fine-tuned for specific datasets to achieve the best accuracy.

What's next for Document Extraction Tool

  1. Create a detailed documentation on it's core components, how to use and extend it.
  2. Add more models and architectures.

Built With

Share this project:

Updates