Found a dataset from news site 538 that contained tweets from before the election from troll farms and we decided to analyze them and compare them to other bodies of text. Initially, we just compared them against well known bodies of text like the brown corpus and others to see how frequent words appeared in them, and compared that directly with the text input by a user on the command line. This was done using python, flask, and scipy.

While this seemed to work ok, we then decided to train a model on IBM Watson with some of the tweets as well as some random tweets to be able to see whether someone's tweet was more closely aligned with that from a troll-farm or that of a normal person. Data cleaning was done with unix tools like sed, excel, and put into IBM Watson.

This model is exposed on a site created using react, and nextJS, with some simple css. It also cleans returned similar tweets because some of them had links which could not be vetted for safety.

Share this project: