PubMed Web Scraper

Inspiration

One of the challenges for most projects is to find a good dataset for the given goal. This is why I think that there is tremendous value in creating basic tools that enable the acquisition of data. When it comes to gaining more knowledge about medical conditions like Covid-19, resources like PubMed provide a valuable source of information that can be mined with NLP.

What it does

This script takes in a search term as well as an integer for the desired number of results. Using these two arguments The script returns and saves a comma separate table with the columns: title, name of authors, journal, date, DOI and abstract.

How I built it

Using the two input arguments we send a request top PubMed. I then carefully extract and compose the relavent information for each article. This is done in a loop for each article. The final data is put into a pandas dataframe before it gets saved as a CSV.

Challenges I ran into

The requests returns a lot of different information and it is tedious to make sure to extract the relevant parts. Furthermore, legacy Python that doesn’t verify HTTPS certificates by default and (workaround needed).