There's so many sources of information online, many of which put some level of spin on the content before you read it. However, in order to understand the full story behind any event, it's important to see multiple perspectives and understand how they spin the story. Therefore, we thought it would be incredibly useful to have a website that can aggregate sources on the most important current events and provide people with an objective measure of how that news source covers the event in terms of emotions.
What it does
Pair News pulls the latest, most important news stories together and aggregates multiple news sources that talk about the same event. It also shows the sentiment of the article in order to show the user whether that article is framing the event in a more positive or negative light than other sources.
How we built it
The first step was to scrape the web for all relevant news articles. We utilized a free news API to help with this as it helped with finding the sources along with metadata on them. Then we scraped the actual text of the articles in order to compare them and find which are about the same story. We found a library that would help with this, but it didn't work for some sources and left much to be desired for others. So we wrote custom scraping code for some websites in addition to filtering code to exclude things like sports match results announcements which we weren't interested in.
Next we used Porter's stemmer and a punctuation remover on the text of the article to prepare it for similarity computation. The similarity was performed by converting the text into vectors and computing their cosine similarity. This similarity score was then used to group the articles that were about the same event.
After that we utilized the Natural Language Processing capabilities of Googles Cloud Platform to perform sentiment analysis, named entity extraction, and named entity sentiment analysis. The sentiment value was then used to determine which news sources spun an event positively and which spun it negatively.
Finally, we used Django to bring these results to the web and used Bootstrap to display everything to the user in an intuitive and visually-pleasing manner.
We obtained the domains NewsComparing.tech and PairHeadlines.com from .tech and Domains.com. The second of which contains what we consider to be a subtle, yet clever pun.
Challenges we ran into
There were many hurdles throughout this challenge, but none that we weren't able to overcome, or at least sidestep. The web scraping library that we found initially had large problems with some major sites like ABC News, which we found unacceptable. So we had to implement custom scraping code for some websites and create a variety of specific filters for others in order to get good results.
The grouping of the texts and sentiment analysis took longer to compute than we expected on a large data set like the one we were using, so the bugs we had set us back in time significantly. This gave us less time for front-end and UI development.
We also planned to make our site using Google's Material Design Lite, but had many issues with making it render correctly and so switched to Bootstrap in the end.
Accomplishments that we're proud of
Creating a website that seems like something we would actually want to use has been an incredibly rewarding experience, especially for what has been the first hackathon for most of us. It was really interesting to combine a variety of technologies, including web scraping, rule-based NLP, deep-learning NLP, and modern web technologies into a single coherent project.
What we learned
Having people who are experienced with different parts of a project is extremely useful because of the fact that everyone can teach the other members of the group about things that they don't know as much about, which makes everyone more productive. It's also important to leave plenty of time for front-end design, it can take more time than expected.
What's next for Pair News
We could implement our completed entity analysis code in the UI, which would provide the user with information about what each news organization portrays specific entities, such as Saudi Arabia, Black Lives Matter, and Donald Trump.