In the current age of information characterized by the free flow of information, the quality of the information that is circulated has diminished. This is caused in part by the ease of access to a large audience via the Internet - where sharing opinions as fact has never been easier. While there are plenty of benefits to the Internet Era, this new paradigm requires a whole new set of tools to deal large amounts of information that have not been subject to "quality control" as in most traditional media.
Technology has been tremendously successful in classification tasks in other domains and has become essential to filtering spam for example. However, the task of discriminating between real and fake news is extremely challenging from a technical standpoint and arguably an intractable problem given the current technological landscape and the state of research in NLP and ML. Therefore, designing a "magic classifier" that could perform as well as state-of-the-art spam filters to counter fake news seemed unrealistic.
That being said, we realized that the accuracy of the classifier for news "sorting" was only important if you were trying to draw an absolute line between fact and fiction. We think that addressing people's mentality when faced with information is the first step in truly mitigating the effects of fake news. For this reason, at the outset of this weekend, we were aiming to create an application that questioned news that seemed implausible rather than categorically stating that a given article was non-factual.
What it does
Our project is a Google Chrome extension that parses websites that you browse, while simultaneously querying the Bing Search API to develop a "on-the-fly" cluster that we could use as a "ground truth" metric. This was aimed to address the lack of external context of current language-based approaches to detect fake news. We then forwarded this cluster of articles on the same topic along with the target article to our server, where we used NLP and ML techniques to determine if the target article was likely to contain fake news or not. We highlighted salient sentences in the article and provided related searches in articles we considered to be "at risk" of being fake. This same sort of warning came when we detected strong bias or opinion in an article, regardless of whether it was positive or negative.
How we built it
The Google Chrome extension parsed the website you visit and passed on this information to the backend that performed detailed analysis on the text and the cluster of news articles.
We based this evaluation on 3 metrics:
We used a trigram language model along with a word frequency distribution to determine which sentences in the group of articles were the most salient. This allowed us to reduce the amount of noise that we had to deal with when making inferences. The algorithm we used was loosely based on the SUM-BASIC algorithm.
We them modelled these salient sentences using the pretrained word2vec embeddings (trained on Google News dataset of about 100M words). These word vectors allowed us to model semantic relationships between the articles. We compared the similar between the most salient sentences in the articles of the cluster - using their cosine similarity as a baseline. If the target article was significantly more dissimilar in relation with the difference observed within the cluster, we would expect the target article to likely contain non-factual or at least highly biased content.
To build on that last aspect, opinion in a news article that is not an editorial is often a good sign of "fake news". In order to increase the robustness of our analysis, we used Google's Natural Language API to do sentiment analysis on the target article. However, we were not as interested in whether the article was reflecting positive or negative sentiment, rather we cared whether the bias was strong, regardless of whether it was good or bad.
Combining these three metrics together, we would highlight the salient sentences that our algorithm suggested could be false and provided links to related articles. This follows the original idea of making people start to ask questions rather than simply classifying this as white and black.
Given that the compute power required to do this sort of analysis was beyond the scope of "normal" server instances, we had to use a distributed system to host our extension. In other words, we had a server dedicated to the NLP analysis with the other to servicing the extension's requests.
Challenges we ran into
The large amount of latency that was associated with performing complex analysis on the text data meant that there was a long delay between the initial loading of a webpage and the results that were returned to the webpage.
The distributed system and the use of many different technologies created a heterogeneous system that was challenging to coordinate smoothly.
The heterogeneity of the internet made it difficult to standardize the web parsing - which then increased the difficulty of doing more advanced analysis by increasing the amount of noise.
Accomplishments that we're proud of
We think that the underlying premise behind creating an app that made the reader question rather than imposing a decision.
We integrated current and relevant ML and NLP techniques to perform inference on a difficult task.
We managed to coordinate an extremely heterogenous mix of technologies.
What we learned
Combining lots of technologies is not an easy task. It required a lot of less-than-optimal workarounds that while functional, are not robust and high-performing. Heterogeneity is difficult to deal with in tech!
What's next for Fake News Detector
Optimizing the speed of inference and increasing the complexity of the classification by using more advanced statistical models such as LSTMs, which require lots of time and effort to train and gather data.