Fake news is a problem that faces everyone, how do we tell what is real from what is not? Tim Cook, the CEO of Apple, recently said “we have to give the consumer tools to help with this and we’ve got to filter out part of it before it ever gets there without losing the openness of the internet. It is not something that has a simple solution.” Normally, to confirm an article’s validity people have to read through citations and trace them back to reliable sources. It is a time consuming and tedious process that no normal news consumer does. Hence, many people fall prey to believing very biased or flatly untrue news, leading to misinformed decision making. With the advent of software like BS detector, finding out sources of untrue news has become easier. However, these tools often use a database of biased or untrusted sites which is less versatile and allows novel sources of fake news to slip through the cracks.. Other tools traces sources to their sources and evaluates the validity of those. This still requires that manual upkeep. We wanted something different.
What it does and how we built it
As two programmers who have based a whole venture off of natural language processing, we knew the power of NLP. We had seen it identify tweet authors based on 10 past tweets from different authors. We have seen it predict protein protein interactions. We wanted to apply this power to this problem. We hypothesized there is a fundamental difference between real and fake news is written, specifically in their headlines. Fake news’ headlines generally grab attention more immediately to get more clicks. The articles text itself we believed would have more noise, thus we targeted headlines of the text in the articles because of the stronger perceived differences. We wanted to differentiate between real and fake news based on their headlines. We used IBM Bluemix and IBM Watson to achieve this, and we retrieved fake and real news from the popular crowdsource database, Kaggle. The fake news dataset contained over 12000 fake news headlines as found by BS detector, and the real dataset had over 420,000 news headlines. Due to the limitations of Bluemix’s Natural language classifier, we only trained the model using around 6,000 headlines from each set, trying to ensure a balanced dataset, while maintaining a large enough testing set. We then tested on a 13,000 set of real and fake headlines, with around 6000 fake ones, and around 7000 real ones. We achieved a surprisingly high accuracy of 92.4% Each prediction also came with a corresponding prediction confidence level. We were so surprised by this we went back to check if there were any duplicates in training and testing sets, and we found there were none. We have implemented this in a Python GUI application that will take an input of a text of headline and it will output whether the algorithm believes it is real or fake.
Challenges we ran into
The classifier took a long time to train, and with Internet connectivity issues our Python programs would often crash, since they used the IBM Bluemix API to access the Natural Language Classification service we created.
Accomplishments that we're proud of
We achieved a high accuracy of 92.4%.
What we learned
Machine learning is surprisingly powerful.
What's next for Faux News
We want to implement this as a legitimate web app that users can quickly access and enter headlines of articles they come across very easily. It will output whether the article contains real or fake news based on our classifier, and will provide simple and quick information for our users. This has the potential to transform consumers' mindsets and empower them.