Villain be gone

Inspiration

Taking on Finra's Yhack 2017 challenge to predict and connect future possible 'villains' in the modern world.

What it does

Analyzing the Finra dataset, we attempt to make conclusions on possible connections suspicious entities may have. Going further, we show it may also be possible to predict suspicious activity of an individual based on his relationships and history.

How we built it

Step 1: Data Analysis Going through the initial dataset isn't easy, understanding the data we are working with is a crucial first step to take. Simply going through the data manually, and seeing if there are any correlation within the dataset as to not waste time with useless information helped us understand and get comfortable with the project. - What data is relevant? - Which pairs of data seem to have correlation? - How signifiant are these correlations?

These questions and more were investigated and answered during Step 1.

Step 2: Inferring Additional Information Most of the time, data can hide valuable information not accessible through traditional means. This kind of hidden information was further investigated, most notably entity resolution. Due to the large variance of data sources and formats, some entities can be linked and combined, therefore giving us more information about important entities. We used traditional NLP such as traditional word embeddings such as tf-idf, as well as explored more effective methods such as entity resolution using part-of-speech tagging, and online sources of data.

Step 3: Establishing Connections, and Testing the Limits Now that the data has been properly treated and understood, we can already make some conclusions about certain information. Are guilty companies more closely affiliated to banned brokers? Does employment history affect your credibility? Simple questions like this can already be answered to a certain extent with proper data treatment.

What more can we infer that a human may not be able to though? This is where machine learning plays a role. Now that we've seen that there is a correlation within our dataset, what can a machine learn about it, and can it be used to predict "villainous" activity? Using a traditional support vector machine (SVM), we attempt to learn what we can in the limited allotted time.

What we learned

Connections come in all shapes and sizes, some more obvious than others. But with modern tools like machine learning and NLP in AI, new conclusions can be found on previously thought conclusive data.