Inspiration
I just wanted to strengthen my NLP skills, learn web scraping, and how to build custom dataset from scratch. Finally, I wanted to explore functionalities of spacy NLP library.
What it does
It builds a web scraper.
Uses spacy library to parse the news text data.
Performs NLP processing such as Tokenization and Lemmatization.
Does Dependency Parsing.
Builds Named Entity Recognition(NER).
Visualizes both the dependency parsed tree and NER.
Finally, it builds a dataset from the unstructured and scraped text data
How we built it
I used python request library for the web scraping. Then, I used spacy library for other NLP functionalities.
Challenges we ran into
Didn't really run into any in particular.
Accomplishments that we're proud of
I am happy that I can build a custom dataset from an unstructured text data such as news articles.
What we learned
Web Scrapping and Custom Dataset creation.
What's next for Web-Scraper-NER-Dataset-Builder
I would like to expand the functionalities, in particular, enabling it to scrape any type of data, build a UI interface using framework such as streamlit. In this way, any one or developer/data scientist who wants to do web scraping can have an automated tool that empowers him/her to that.
Log in or sign up for Devpost to join the conversation.