Inspiration
Matching images with manually entered user data can be a time-consuming process, but unfortunately, it is all but a necessity in many business applications. Bill.com challenged us to develop a model that would automate this task. Our goal was to save customers time and effort by quickly producing results that are as accurate as possible.
What it does
Meet and Receipt involves a multi-stage workflow. The main sections of our model are the OCR generator, the matching/scoring algorithm, the gamma weighting, and the bipartite matching algorithm.
- We used PaddleOCR to accurately extract transcriptions of receipts that did not originally come with a corresponding OCR file. The OCR files we generate are of the exact format of the provided OCR files, and are piped together with the provided OCR files into data pre-processing.
- For the main matching stage, we use fuzzy matching to compare every fragment of text in the OCR to every column in the user-entered data. We compute similarity scores for every possible (receipt, entry) tuple. We take the top-scoring connections and save them to process them further.
Challenges we ran into
The formatting for dates and prices were very inconsistent. To resolve this, we use Regular Expressions to identify and parse types of data. This wound up being one of the most challenging, but also the most important parts. We convert all dates to be of the YYYY-M-D format and prices to have 2 digits of precision before we attempt fuzzy matching. We also give dates, prices, and addresses confidence scores. For example, we can be more confident that a term is a date if it contains "January" or is of the form "//____". We also can be more sure that an OCR fragment is an address if it contains common features of Malaysian addresses such as jalan, meaning street, or contains large Malaysian cities.
What's next for Meet and Receipt
Steps we could take to improve the model’s real-world applicability are including data from other countries and optimizing the code to reduce time complexity. One way to do this is to index the user entry table by date. We could also prioritize text that is located in coordinates that correspond strongly with the four components.
Built With
- deepnote
- paddleocr
- python
Log in or sign up for Devpost to join the conversation.