Inspiration

I really just wanted a data-set to try out some NLP tricks and run some experiments. I also like the news. Thus, I built sheriff: a web-crawler which extracts stories from news-sites.

What it does

Sheriff can crawl news-sites in two ways: either by starting at the home-page and detecting news stories, or by crawling the RSS feeds of news-providers. During scraping the title, date published, story content, and--where available--the author name are extracted.

Using the story-text we summarize the story into up to five main points. Additionally, we generate keywords based on the text with which we tag the story.

Finally, there's a prototype web-app that presents scraped stories.


How I built it

  • RSS scraper: self-built
  • Generalized web-scraper: scrapy
  • Database management: mongodb
  • Web-app: flask

Challenges I ran into

Scraping in a generalized fashion is hard. There is not standard for storing critical information: for instance, date published may be stored in the content of a meta-tag, it may be a string a class, or whatever the developers. There is no consistency, so you need to consider all the possible variations and detect them.

Accomplishments that I'm proud of

I'm pretty please that I was able to cover as much ground as I did in a handful of hours, despite not having experience in many of the frameworks used. It's in a really rough, incomplete format; However, it gets the idea across while giving me a base to further expand off-of.

What I learned

Programmatic Learning

I did a mongoDB tutorial pre-hack, but beyond that I had not experience in noSQL based databases. Alongside that, while I've worked on html-scraping, I wasn't farmiliar with scrapy.

Time-management

While I'm pleased with the amount I did, I should have managed my time better. I sunk too much time working on a generalized parser, when I should've focused on RSS feeds. This would've been quicker and allowed me to get more features into the final submission.


What's next for Sheriff

  • better RSS crawler & messaging system
  • Statistical crawling methods
  • Clustering vased on underlying events
  • Testing
  • Massive refactoring

Built With

Share this project:
×

Updates