We focused on the preservation of microfilm newspaper. We learned a lot about the formatting and function of microfilm. We also learned that their are very few scanners capable of scanner such records at a fast rate. We struggled to gather the records properly because while Mizzou has a ScanPro 3000(a 25mp scanner and top end unit) it did not function properly, and would not complete auto scans. This just highlights how the materials are becoming less accessible throughout the world.

Within the software portion of development, we came across a lot of issues. We spent several hours working to complete the project using UiPath with little success due to many factors. The user interface of UiPath was not intuitive within a short portion of time, I spent at least 2 hours before I began to feel comfortable with the software. Once we became comfortable with the software we realized that the OCR it created was not at a high enough confidence for us to use. We had the most success with Tesseract on UiPath giving us almost 55% accuracy compared to the python version of Tesseract that gave us almost 75% accuracy on the same documents. Still we wanted to understand if we could improve and maintain the use of UiPath as it had many features we thought would be helpful. We tried many steps to improve our recognition. We increased the file size and tried several different formats of files and while increasing the file size did help, it wasn't enough to keep us working with UiPath. We also ran into many of the featured packages being depricated, for example Abbyy an OCR reader was only updated to version 12 and did not come installed with UiPath meanwhile the only version we could download off of the Abbyy website was 15 and was not supported(although it gave us the highest accuracy when we ran it through the desktop app).

Once we had moved on from UiPath, we felt like we were taking steps forward. Tyler had built the Tesseract reader in python already and we were well on our way to completing this step of the project. Madi had been working on the front end, and was well on her way to completing the project.

While we were able to read the files in with a fair amount of accuracy, we struggled to output the data in a usable manner. The standard output for pyTesseract was in tabular format which caused more issue than it probably should have.

Once we had the method to read and output a set of data, the next step was cloud integration. We set up a Firebase easy to use database to store and manage our data.

We spent a decent amount of time on the integration of the firebase to the frontend website(which we hosted through amazon). We came across many more issues along the way but overall we feel very happy about the state of our project.

Share this project:
×

Updates