Inspiration

ML-based document segmentation and data extraction, NLP-based text analysis.

What it does

Uses Amazon Textract to generate raw text from PDF file. Then Amazon Comprehend is used to discover insights and relationships in text using custom Comprehend models trained on our data.

How we built it

  1. Amazon S3 -> To store the training data which contains the category and its relevant document text in a csv format.
  2. Amazon Textract -> To get the text containing in a PDF document.
  3. Amazon Comprehend -> In this service we will provide the training data csv and provide input/output details pointing to Amazon S3. Then we will create a job based on the custom classification (the model built based on the csv file). This job will read the text files in the s3 bucket and provide a .jsonl file in which the category is mentioned with a score value.

Challenges we ran into

Amazon Comprehend uses text file to determine the category. Used textract to get the raw text of PDF files.

Accomplishments that we're proud of

Using AWS services to overcome the problem statement.

What we learned

The need for document classification. ML-based document segmentation and data extraction, NLP-based text analysis.

What's next for Intelligent Document Classification

Can look into integrating Amazon Augmented AI to Build and manage human reviews for machine learning applications.

Built With

Share this project:

Updates