I have always wished when I received readings (that are scanned pdf files) from my university professors that I could search for particular terms just like ctr-f. I am aware there are services that offer ocr reading of unsearchable pdf files but I felt that it would be more convenient to have it already built-in into the pdf viewer on the browser.

What it does

In its current state, it can only search for one word at a time and it only highlights the words that it recognises. Navigation is yet to be implemented and the search results are as only good as the azure OCR api (read api). I am using the free-tier so it will only search the first two pages.

How I built it

I have adapted the pdfjs platform to incorporate this additional search functionality. Due to time constraints, instead of overlaying the whole pdf document with a scanned version, I have created an api that handles the search part of the pdf search bar by calling on the azure read api and storing and serving data that can be easily managed by the pdf application.

Challenges I ran into

Initially, my intention was to create a chrome extension which does the same thing. However, I encountered some problems when trying to access the pdf viewer objects to get information about the pdf file (scale, page size etc.) so that I could display the search highlights. I have found out that getting global objects of a site through an extension (more specifically, the content script) is not possible (due to permissions).

Another challenge was when I pivoted to directly developing on top of the pdf platform. It was taking quite a lot of time to mirror the normal search bar as I was just simply not familiar with the codebase.

Accomplishments that I'm proud of

I am mostly proud of completing my first hackathon!

What I learned

I have learnt alot of javascript concepts (I usually use python) and javascript stack like nodejs, express and also some databases (mongodb). I have also learnt a lot about chrome extension development.

What's next for pdfjs ocr search bar

This is definitely not a finished product. The use of a custom api to serve search result was just a shortcut to make the deadline. I will aim to mirror the normal search bar that is already implemented in the pdfjs platform. Therefore, it should overlay the ocr html text on top of the pdf pages. This will allow for a more consistent end product as it will seamlessly fit into the platform.

Share this project: