What is Pyparazzi?
A python package aimed at answering the question: how do we gather meaningful data? With this library, we present to researchers and data scientists a modularized and pipelined methodology of data acquisition, processing, and visualization.
Web scraping is made so much easier!.
Pyparazzi allows developers, data-scientist and researchers fetch data from different sources and make visualizations by using a high level api.
How to install it
pip install pyparazzi
Search Rooms
By defining the rooms Pyparazzi knows where it should go to grab the data.
To generate the api key go to : https://developer.twitter.com/en/portal/dashboard
from pyparazzi.scrapy.rooms import TwitterRoom
API_KEY = "<TWITTER_API_KEY>"
API_KEY_SECRET = "<TWITTER_API_KEY_SECRET>"
ACCESS_TOKEN = "<TWITTER_ACCESS_TOKEN>"
ACCESS_TOKEN_SECRET = "<TWITTER_ACCESS_TOKEN_SECRET>"
if __name__ == '__main__':
room = TwitterRoom(
api_key=API_KEY,
api_key_secret=API_KEY_SECRET,
access_token=ACCESS_TOKEN,
access_token_secret=ACCESS_TOKEN_SECRET,
)
entities = room.fetch(
q=["Covid"], num_results=10
) # fetching tweets where topic is COVID19
for entity in entities:
print(entity.text)
Flicker
To generate the api key go to : https://www.flickr.com/services/api/
from pyparazzi.scrapy.rooms import FlickrRoom
FLICKR_API_KEY = "<FLICKER_API_KEY>"
if __name__ == '__main__':
room = FlickrRoom(api_key=FLICKR_API_KEY)
entities = room.fetch(q=["covid", "Dog"], num_results=50)
for img in entities:
print(img.url)
Bing
To generate the api key go to : https://www.microsoft.com/en-us/bing/apis/bing-web-search-api
from pyparazzi.scrapy.rooms import BingRoom
BING_API_KEY = "<BING_SEARCH_API_KEY"
if __name__ == "__main__":
room = BingRoom(api_key=BING_API_KEY)
entities = room.fetch(q=["mouse", "covid"], num_results=50)
for img in entities:
print(img.url)
Wikipedia
No API key is needed for Wikipedia. Only install the Python package.
from pyparazzi.scrapy.rooms import WikipediaRoom
if __name__ == "__main__":
room = WikipediaRoom()
results = room.fetch(q=["crypto"], num_results=10)
for result in results:
print(result.text)
Create a custom Room
from pyparazzi.scrapy.rooms import SearchRoom, DataEntity
class MyRoom(SearchRoom):
def fetch(self, q, num_results = 100) -> [DataEntity]:
#TODO: do something cool here to fetch your data
return []
if __name__ == '__main__':
room = MyRoom()
entities = room.fetch(q="covid", num_results=200)
for e in entities:
print(e)
Scrapping Data
# import libraries
from pyparazzi.core import Dataset
from pyparazzi.scrapy.rooms import BingRoom, FlickrRoom, TwitterRoom
from pyparazzi import Paparazzi
BING_API_KEY = "<BIN_API_KEY>"
FLICKR_API_KEY = "<FLICKER_API>"
TWITTER_API_KEY = "<TWITTER_API"
TWITTER_API_KEY_SECRET = "<TWITTER_API_SECRET>"
TWITTER_ACCESS_TOKEN = "<TWITTER_ACCESS_TOKEN>"
TWITTER_ACCESS_TOKEN_SECRET = "<TWITTER_ACCESS_TOKEN_SECRET>"
if __name__ == "__main__":
# create search rooms
bing_room = BingRoom(api_key=BING_API_KEY)
flickr_room = FlickrRoom(api_key=FLICKR_API_KEY)
twitter_room = TwitterRoom(
api_key=TWITTER_API_KEY,
api_key_secret=TWITTER_API_KEY_SECRET,
access_token=TWITTER_ACCESS_TOKEN,
access_token_secret=TWITTER_ACCESS_TOKEN_SECRET,
)
# search parameters
query = ["mouse", "covid"]
n = 200
# create paparazzi
me = Paparazzi(
rooms=[
bing_room,
flickr_room,
twitter_room
]
)
# fetch data from the multiple sources, and create dataset
dataset = me.scrape(q=query, num_results = n)
# export dataset to disk
dataset.save("data.pkl")
# load dataset from disk
dataset = Dataset.from_file("data.pkl")
for entry in dataset:
print(f"{type(entry)}, {entry.content}") # print entity content
Utils
Filter only image entities
from pyparazzi.core import Dataset
from pyparazzi.scrapy.rooms import TextEntity, ImageEntity
dataset = Dataset.from_file("data.pkl")
dataset = dataset.select(mime_type = ImageEntity)
for entry in dataset:
print(f"{type(entry)}, {entry.url}")
Filter only text entities
from pyparazzi.core import Dataset
from pyparazzi.scrapy.rooms import TextEntity, ImageEntity
dataset = Dataset.from_file("data.pkl")
dataset = dataset.select(mime_type = TextEntity)
for entry in dataset:
print(f"{type(entry)}, {entry.text}")
Visualize data
Text
- Get Text Entities:
from pyparazzi.core import Dataset
from pyparazzi.scrapy.rooms import TextEntity
if __name__ == '__main__':
dataset = Dataset.from_file("data.pkl")
dataset = dataset.select(mime_type = TextEntity)
- plot worldcloud
dataset.plot(kind="wordcloud")
- plot scatter plot
dataset.plot(kind="scatterplot")
- other util functions
print(dataset.to_list())
print(dataset.to_numpy())
Images
Coming soon
How we built it
Pyparazzi is built on the top of python packages such as dask, numpy, sklearn, matplotlib, pandas, seaborn, opencv-python etc.
Challenges we ran into
How to merge and organize data from different sources Flickr, Bing, Twitter, Wikipedia.
What we learned
We learn about web-scrapping, how to process heterogeneous datasets(images, text, etc.) and plot multidimensional data using the TSNE(t-distributed stochastic neighbor embedding) algorithm
What's next for Pyparazzi
Incorporate more visualization and data interpretability tools.
Log in or sign up for Devpost to join the conversation.