What it does

This project is a to the point Python notebook on web crawling. I decided to use the The Biota of North America Program (BONAP for short) website for scraping since it has no JavaScript that loads part of the site and it contains easy to read HTML and CSS.

Challenges I ran into

Time was a real constraint since I had to optimize the time it took to scrap BONAP's web site. It started out with me waiting for a process to finish in 30 mins. With my starting development being so slow, I had to spend the majority of my time researching coding practices for optimizing web scraping and indexing performance.

What I learned

  • I learned a lot about the differences of web scraping and web crawling.
  • The most important knowledge I gained was how to work with Jupyter notebooks.
  • I learned how to use the Python package Scrapy.

Built With

  • csv
  • datascience
  • datasets
  • jsonl
  • jupyter
  • pandas
  • python
  • scrapy
  • webcrawling
Share this project:

Updates