7Team 7 | CBRE Challenge: Get in Line

Inspiration

Image processing and Optical Character Recognition is changing the way we do business through digitalized services, such as information scanning and extraction for business documents and automatic data entry.

What it does

Our OCR model takes an input of image files and extracts the text while retaining the sequence of the conversation in context. It parses the image into arrays, which then go through pre-processing steps. The processing steps include gray scaling, otsu thresholding (which makes the contrast between light and dark greater), kerneling, dilation, median blur, and contours.

How we built it

We made two approaches to the project: using pre-trained English OCR models, and training our own ground truth tesseract model. After processing all images, we utilized python tesseract api libraries in order to properly extract and order characters. For the tesseract model, we first created a training dataset using a large language training text. We then downloaded a font similar to that used by the author of the popular scientific comic, xkcd. We then edited that font to be all capital letters, as lower case letters were not used in any of the training images. We then created a model built from scratch using the generated training text images. This model is able to be used with tesseract in order to extract text from images.

Challenges we ran into

Multiple parts of tesseract were more difficult if not impossible to work with in Windows, so many workarounds had to be made, including using WSL to run. This includes font files not properly converting between different encodings, specific libraries not being available on windows, and a conflicting compatibilities for certain python libraries.

Accomplishments that we're proud of

Out team learned how to build OCR models from scratch using ground truth data. We are also proud of the OCR model whose accuracy provides very close images to the tesseract library's model.

Built With

font-editor
opencv
pytesseract
python
shell
tesseract

Submitted to

TAMU Datathon 2022
- Winner CBRE - Get in Line

Created by

Wrote a web scraper to pull images and transcripts from xkcd to use as testing data

Akil Manivannan
Engineer that likes to code. If I can write code that can actually impact somebody in 48 hours, then that's 48 hours well spent.
Joseph Chau
Nicholas Kasman
Software Engineer @ TAMU
Ethan Emmanuel

Updates

Joseph Chau started this project — Oct 09, 2022 01:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.