The Finra challenge was to make a proof of concept platform to show new correlations or new ways to visualize data in the financial industry
What it does
Given a dataset of financial advisers and related information we create a logistic regression model to classify if a financial advisor is unethical. Using the employment history from the given dataset we create relations between financial advisors. Using the ethics score and the relationships between advisors, we predict how ethical an advisor will be based on the influences around them
How we built it
Generating Mock Test Data
First, python was used to process the IAPD XML files and generate test data. We took a guided approach by suggesting weights that might influence how "good" an advisor is. Then, using Python's sklearn, we created a logistic regression model on a fuzzied version of the data to classify financial advisers, with some randomness that would appear in a real data set. The results up to this point were stored in a MySQL for querying relations. With the created MySQL database, we queried for a subset of individuals that would be useful for display purposes.
Manipulating and Displaying Data
Challenges we ran into
We were given an initial 35gb file with court cases but we found most of the data to be irrelevant, after a few hours of trying to use natural language processing on various parts of it.
After generating test data and training our model we found that the accuracy was unrealistically high and had to implement fuzzing to ensure a robust model.
It was difficult at first to determine what defines a meaningful interaction between two advisers.
Visualizing data with D3 was not as simple as hiding or showing a data structure.
Accomplishments that we're proud of
Looking at the whole project, we have created test data to train a model, we have created a robust model to give a score for ethics, and we have implemented an algorithm to predict a financial advisor's future behavior. We are all extremely happy to see a diverse range of algorithms being implemented successfully together.
What Our Team Learned
How to use Sklearn to make a classifier
How to use D3 to create a great user interface
How to optimize SQL queries for large data sets
What's next for Financial Ethics Propagator
The driving force behind our Financial Ethics Propagator was that we could give a great user interface for non-obvious correlations between data. Currently we parse a specific IAPD file to make predictions specific to the financial industry. Looking forward, it would be trivial to use our model to predict things outside of the realm of the financial industry given the relevant data so we are looking to retool the input portion of our code.
Looking forward in the financial market, we feel that there are still correlations that could be easier to see. For example we think it would be useful if we could show unusual stock buying history along with our node graph. This would allow for investigators to easily navigate gigabytes of data to determine if someone has committed a fraudulent purchase