I really just wanted a data-set to try out some NLP tricks and run some experiments. I also like the news. Thus, I built sheriff: a web-crawler which extracts stories from news-sites.
What it does
Sheriff can crawl news-sites in two ways: either by starting at the home-page and detecting news stories, or by crawling the RSS feeds of news-providers. During scraping the title, date published, story content, and--where available--the author name are extracted.
Using the story-text we summarize the story into up to five main points. Additionally, we generate keywords based on the text with which we tag the story.
Finally, there's a prototype web-app that presents scraped stories.
How I built it
- RSS scraper: self-built
- Generalized web-scraper: scrapy
- Database management: mongodb
- Web-app: flask
Challenges I ran into
Scraping in a generalized fashion is hard. There is not standard for storing critical information: for instance, date published may be stored in the content of a meta-tag, it may be a string a class, or whatever the developers. There is no consistency, so you need to consider all the possible variations and detect them.
Accomplishments that I'm proud of
I'm pretty please that I was able to cover as much ground as I did in a handful of hours, despite not having experience in many of the frameworks used. It's in a really rough, incomplete format; However, it gets the idea across while giving me a base to further expand off-of.
What I learned
I did a mongoDB tutorial pre-hack, but beyond that I had not experience in noSQL based databases. Alongside that, while I've worked on html-scraping, I wasn't farmiliar with scrapy.
While I'm pleased with the amount I did, I should have managed my time better. I sunk too much time working on a generalized parser, when I should've focused on RSS feeds. This would've been quicker and allowed me to get more features into the final submission.
What's next for Sheriff
- better RSS crawler & messaging system
- Statistical crawling methods
- Clustering vased on underlying events
- Massive refactoring