Making Datasets with Web Crawlers - BONAP Datasets

What it does

This project is a to the point Python notebook on web crawling. I decided to use the The Biota of North America Program (BONAP for short) website for scraping since it has no JavaScript that loads part of the site and it contains easy to read HTML and CSS.

Challenges I ran into

Time was a real constraint since I had to optimize the time it took to scrap BONAP's web site. It started out with me waiting for a process to finish in 30 mins. With my starting development being so slow, I had to spend the majority of my time researching coding practices for optimizing web scraping and indexing performance.

What I learned

I learned a lot about the differences of web scraping and web crawling.
The most important knowledge I gained was how to work with Jupyter notebooks.
I learned how to use the Python package Scrapy.

Built With

csv
datascience
datasets
jsonl
jupyter
pandas
python
scrapy
webcrawling

Updates

Brian Almaguer started this project — Aug 22, 2021 04:26 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.