Bot or Not ?

Bot or Not: Final Evaluation

Malique Bodie Github Link

Introduction Twitter’s impact on society and modern culture goes unspoken, but the true extent of our reliance on this technology is outstanding. Twitter has served as both a uniting agent for users, as we saw during the Arab Spring, as well as an agent of division, as seen during the presidency of Donald Trump. Given its importance, it comes as no surprise that malicious actors have targeted Twitter and its users, using bots, or Twitter accounts operated autonomously by software using Twitter APIs. The necessity for a way to detect these bots, and differentiate them from genuine users is a very apparent one, and there have been many iterations of algorithms that accomplish this. This project implements two of these proposed methods: BotOrNot: A System to Evaluate Social Bots Twitter Bot Detection Using BiDirectional Long Short-Term Memory Neural Networks and Word Embeddings

Methodology Data There are numerous datasets that contain various information about different types of Twitter bots, but the most comprehensive is the Cresci-17 dataset, which is shown to produce the best results amongst top performing Twitter Bot detection models. The data set consists of multiple bot types including social spambots 1,2 and 3, as well as genuine users. The data is divided into user account data, and tweets. The account data consists of information like number of followers/following, number of likes, location, profile picture, header, number of tweets, account age etc… The tweet data is just a dataframe of tweets along with the user id, tweet id, retweets and likes. I also utilized pre-trained word embeddings for my nlp model given by GloVE.

Pre-Processing Preprocessing for the BotOrNot model was straightforward. This model was trained on user data features which were essentially pulled directly from the data csv, and edited slightly to include some binary columns from non binary data, like if the user has a profile pic (regardless of what it is). The Preprocessing for the Natural Language Processing model was a bit more intensive. The tweets had to be tokenized and padded/ shortened to a uniform length of 20. The word embeddings were loaded from GloVE (Global Vector Embeddings) pre-trained embeddings, from which there were multiple options for the embedding size. The paper specified an embedding size of 200, but for computational simplicity I decided to use embeddings of length 25. Tokens that were not present in this vocabulary were , including hashtags and urls.

Architecture Three different model architectures were implemented. The first was a simple feed forward neural network trained on user features, that consisted of two dense layers with relu activation functions followed by a final dense layer with a softmax output. The second model was also trained on user features, but utilized a Random Forest classifier from a sklearn library instead of a feedforward network. The last model was trained on user tweets and consisted of an embedding layer, 3 stacked bidirectional LSTMs followed by one dense layer with a sigmoid activation function.

Results The three models all ultimately did a great job at classifying Twitter Bots. The dense layered network received a 93% validation set accuracy, the Random Forest Classifier outperformed this and received a 98% validation set accuracy. The bidirectional neural network took so long to train that I was only able to finish a single epoch with one bidirectional LSTM and embeddings of size 200 even though the paper uses a stack of 3 bidirectional LSTMs with 30 epochs.

Challenges I faced many challenges with this model, primarily with the preprocessing for natural language processing. It was difficult to tokenize the input, and translate that into indices for the pretrained GloVE embeddings. I believe that the accuracy for the NLP model would have increased if I spent more time preprocessing and creating a unique token for emojis, hashtags, and retweets. I used a TweetTokenizer form NLTK but this just seems to have split on whitespace and separates emojis, I think that if there were a pretrained Twitter word embedding vocabulary then that could have improved this model even further. I also faced challenges with pytorch and pandas compatibility in the preprocessing stage, I ended up having a lot of data type issues and eventually just decided to use tensorflow for the nlp model. It was also very difficult to debug the nlp model because loading in the data and GloVE embeddings took such a long time that I would have to wait over 30 minutes to find bugs in the code (of which there were many).

Reflection Ultimately I believe that the most efficient bot detection method would be the simpler networks trained on user data. These methods took less time to train, less time to preprocess data, and are a lot more intuitive than the natural language processing model. They can also easily be improved by using data mining techniques to identify more user features and training the model on those as well. My nlp model could also be improved with better tokenization and by possibly training an embedding matrix specifically tailored to Twitter. The paper discusses a tokenizer they used that identified urls, hashtags, and special emoji characters and had unique tokens for them, but I was unfortunately unable to find this tokenizer. If given more time, I would also like to compute the nlp model in the same way the paper did, using three LSTMs instead of one, but due to the computational power that it required, I was only able to train using one bidirectional LSTM and one dense layer, which still took an absurdly long time.