Inspiration

Image processing and Optical Character Recognition is changing the way we do business through digitalized services, such as information scanning and extraction for business documents and automatic data entry.

What it does

Our OCR model takes an input of image files and extracts the text while retaining the sequence of the conversation in context. It parses the image into arrays, which then go through pre-processing steps. The processing steps include gray scaling, otsu thresholding (which makes the contrast between light and dark greater), kerneling, dilation, median blur, and contours.

How we built it

We made two approaches to the project: using pre-trained English OCR models, and training our own ground truth tesseract model. After processing all images, we utilized python tesseract api libraries in order to properly extract and order characters. For the tesseract model, we first created a training dataset using a large language training text. We then downloaded a font similar to that used by the author of the popular scientific comic, xkcd. We then edited that font to be all capital letters, as lower case letters were not used in any of the training images. We then created a model built from scratch using the generated training text images. This model is able to be used with tesseract in order to extract text from images.

Challenges we ran into

Multiple parts of tesseract were more difficult if not impossible to work with in Windows, so many workarounds had to be made, including using WSL to run. This includes font files not properly converting between different encodings, specific libraries not being available on windows, and a conflicting compatibilities for certain python libraries.

Accomplishments that we're proud of

Out team learned how to build OCR models from scratch using ground truth data. We are also proud of the OCR model whose accuracy provides very close images to the tesseract library's model.

Built With

Share this project:

Updates