Hacked by Steven Feng (Rice '22), with mentorship from Albert Wang (NJIT '22). ScrapeTheLines is a Python script which uses the BeautifulSoup API and urllib2 to scrape the headlines of different news websites for repeated keywords. The script then returns the top ten keywords and their relative frequency, along with the frequency of other relatively infrequent keywords. The intention is for the user to then search those keywords themselves on the news outlets they prefer, thus at least reducing bias created by search algorithms on the Web.
Here are the articles I referenced for simple facts about the different news channels: https://en.wikipedia.org/wiki/BBC https://en.wikipedia.org/wiki/Fox_Broadcasting_Company https://en.wikipedia.org/wiki/CNN https://en.wikipedia.org/wiki/NBC https://en.wikipedia.org/wiki/NPR
Here is a neat graphic of the political leanings of various news outlets, courtesy of the Washington Post:
My first and foremost goal is to optimize the scraping algorithm so it's more efficient and more unbiased. Next steps are to implement some sort of data visualization and find some way to export scraped data on an Excel spreadsheet.