Inspiration
ML-based document segmentation and data extraction, NLP-based text analysis.
What it does
Uses Amazon Textract to generate raw text from PDF file. Then Amazon Comprehend is used to discover insights and relationships in text using custom Comprehend models trained on our data.
How we built it
- Amazon S3 -> To store the training data which contains the category and its relevant document text in a csv format.
- Amazon Textract -> To get the text containing in a PDF document.
- Amazon Comprehend -> In this service we will provide the training data csv and provide input/output details pointing to Amazon S3. Then we will create a job based on the custom classification (the model built based on the csv file). This job will read the text files in the s3 bucket and provide a .jsonl file in which the category is mentioned with a score value.
Challenges we ran into
Amazon Comprehend uses text file to determine the category. Used textract to get the raw text of PDF files.
Accomplishments that we're proud of
Using AWS services to overcome the problem statement.
What we learned
The need for document classification. ML-based document segmentation and data extraction, NLP-based text analysis.
What's next for Intelligent Document Classification
Can look into integrating Amazon Augmented AI to Build and manage human reviews for machine learning applications.
Built With
- amazon-web-services
- comprehend
- s3
- textract


Log in or sign up for Devpost to join the conversation.