Image-based Webscraper

Inspiration

Originally, I wanted to use an OCR to scan images on Google Maps to extract menus for restaurants that didn't have actual menus posted. Unfortunately, I discovered about halfway through that Google Lens was already integrated into mobile versions of Google Maps and basically did what I wanted to do. I did not expect to be able to improve on Google Lens meaningfully, so decided to spend time learning more about image processing and playing around with different OCRs and ended up making this.

What it does

Given a website url, the script scrolls through the website taking screenshots. These are then processed and fed into an open source OCR and outputted into a pdf.

How we built it

The script is written in python and uses libraries such as PIL and cv2 for image processing. Selenium is used query webpages to take screenshots. Reportlab is used to generate the pdf.

Challenges we ran into

I had very little experience going in and was not able to put in as much time as I wanted to due to external factors making my spring break a lot more busy than anticipated. To be totally honest, I started realizing I didn't have as much time as I hoped early on, which played a part in me not trying hard to find a team because I didn't want to be dead weight. I struggled for ideas after realizing Google Lens exists.

Accomplishments that we're proud of

The script works fairly accurately for websites that a majority text such as Wikipedia. Learning to use a bunch of different tools for image processing and how to sort out different dependencies was also pretty interesting.

What we learned

I overestimated what I would be able to do by myself, especially since I did not have too much experience.

What's next for Image-based Webscraper

I want to continue to improve image processing side of things. I was working on a way to intelligently fill in outlines to make text clearer, but was not able to figure that out in time.

Built With

cv2
json
numpy
ocr
pil
python
reportlab
scipy
selenium
tempfile

Updates

DEREK LEE started this project — Mar 28, 2021 01:49 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.