Inspiration
I was inspired by the google search engine and the whole concept in general.
What it does
It scrapes urls from a page and puts them into a database that you can interface with, creating apis and other products. And once its done scraping the urls from that page, it then iterates through them, until it runs out of links to find. There is also the API that Ive provided which is supposed to be an example of what you can do.
How I built it
I built it using BadgerDB, golang, and goquery. I started off by building a basic proof of concept that can get all the urls from a page, and then moved on from there. I then added concurrency to speed up the overall execution, and then had to debug a lot of annoying race conditions.
Challenges I ran into
I ran into a lot of race conditions. But more notably I reached a 429 HTTP error, which meant that I was requesting that resource too many times. My workaround was to have it retry and have a max retries to prevent hangs.
The more notable race condition that I resolved would probably be the issue of concurrent read writes which I initially solved using mapmutex locks and unlocks, but then moved to a more common way of doing it, and used channels which allowed me to send data instead of modifying a shared resource.
Accomplishments that I'm proud of
Figuring out how to properly use a DB and have it be accessible from another app, such as an API. And also figuring out how to properly take advantage of go's concurrency to speed up the execution of my code. Another thing that Im proud of is learning how to setup a webserver in golang and receive JSON data and return valid JSON.
What I learned
I learned a lot about how to properly use goroutines, and also how to properly use a Database. It also gave me an insight on how to use go in a more production environment rather than some small projects.
What's next for ScraperBoi
Support for distributed computing to increase the overall amount of urls that I can process. And also have it be interfaceable with external databases such as Google's Firebase and AWS S3.
Built With
- badger
- database
- golang
- nosql
- scrape
- spider
- url
Log in or sign up for Devpost to join the conversation.