According to Gallup, Americans trust in mass media is at an all time low. While this seems alarming, it’s not all that surprising. Recent revelations of Hillary Clinton working with Google to modify search results and Facebook content being unfavorable to conservatives highlight the dangers of consuming news from individual media organizations. While it is ultimately the responsibility of the user to consume information responsibly, the current digital environment has resulted in an abundance of information, which makes accomplishing this very difficult.
What is it?
N Combinator is a multi-document summarizer, which combines perspectives from various articles into a single summary. The overall goal is to reduce bias and provide users with more “complete” news.
How was it built?
N Combinator is a web app built with the Python Flask framework. It's hosted on AWS Elastic Beanstalk, and uses the Python newspaper package to extract and summarize individual news articles from the web. We use both the Google Cloud Natural Language API and the IBM Watson Tone Analyzer for sentiment analysis, and then our own homebrew algorithm to aggregate the summarized articles into a new, multi-perspective and well-rounded summary.
There were a lot of small obstacles along the way, but the most significant hurdle was actually deploying the web app on AWS EB. One of our dependencies was the
lxml package, which depended on some external C code and so installation wasn't nearly as simple as a
pip install lxml. The default Elastic Beanstalk behavior is to
pip install -r requirements.txt, but deployment kept failing because lxml needed its dependencies. After digging into the documentation, creating a configuration file called
.ebextensions/python.config with various YAML-formatted text instructing
yum to install the correct packages did the trick.
Once the web app was up and running, EB failed to load the
static folder which contained all our images and stylesheets, so the site looked like a bunch of plaintext. Modifying some more EB config files didn't fix anything, because it actually wasn't broken in the first place. It took about an hour to finally stumble upon an extremely obscure StackOverflow answer with 0 upvotes, which told us that the Chrome browser extension Ghostery does not agree with Elastic Beanstalk. After disabling Ghostery, we were able to finally view the full web app. Who knew that would be the problem?
Another key challenge we faced was combining the individual summaries to produce a single multi-document summary. We needed to determine a suitable method by studying literature and modifying and combining different techniques for our specific use case.
We’re fortunate to have come together as a team and address a real issue we all believed in, even if it was a project we knew would be challenging to complete within the time constraints. Each of us stepped outside our comfort zones and embraced the challenge of diving into a completely new field. We’re proud to have successfully delivered a proof of concept of our ultimate goal while pushing ourselves to the limit throughout this weekend.
Our future goals include creating a query based system so that users do not have to search for similar articles themselves. We plan to continue building off our proof of concept and polishing up the overall system by incorporating additional elements into the algorithm. In addition, we hope to incorporate this engine with other services and current news platforms.