In one calendar year, approximately 1 in 6 children are sexually victimized within the United States. Unfortunately, technology has enabling instant messaging and social media has been identified as a large source of where these grave events trace back. With this information, we knew that helping the efforts of undermining sexual predators was a must, and one that could additionally be helped with through machine learning and blockchain technologies, combined with an easy-to-use user interface.
What it does
This app basically can take as an input a sentence, phrase or a few words from a conversation, and using Text analysis and machine learning can determine whether the dialogue in the conversation may be potentially considered harassment. If so, the input transcript is stored on a blockchain which can then generate a report that can be reviewed and signed by authorities to verify the harassment claim, and therefore this becomes a proof of any subsequent claim of abuse or harassment.
How we built it
1: scraping the web for dialogue and conversation data 2: extracting raw chat logs using STDLib from perverted justice (to catch a predator NBC series) archives which resulted in actual arrests and convictions (600+ convictions) 3; curating scraped and extracted data into a labelled dataset 4: building a neural network (3 layers, 40 neurons) 5: using the nltk toolkit to extract keywords, stems and roots from the corpus 6: sanitizing input data 7: training neural network 8: evaluating neural network and retraining with modified hyperparameters 9: curating and uploading dataset to google containers 10: setup automl instance on google cloud 11: train a batch of input corpora with automl 12: evaluate model, update overall corpus and retrain automl model 13: create a blockchain to store immutable and verified copies of the transcript along with author 14: wrap machine learning classifiers around with flask server 15: attach endpoints of blockchain service as pipelines from classifiers. 16: setup frontend for communication and interfacing
Challenges we ran into
extracting and curating raw conversation data is slow, tedious and cumbersome. To do this well, a ton of patience is required. the ARK blockchain does not have smart contracts fully implemented yet. we used some shortcuts and hacky tricks, but ideally the harassment reports would be generated using a solidity-like contract on the blockchain Google's AutoML, although promising, takes a very long time to train a model (~7 hours for one model) There is a serious paucity of publicly available social media interaction dialogue corpora, especially for one to one conversations. Those that are publicly available often have many labeling, annotation and other errors which are challenging to sanitize.Google cloud SDK libraries, especially for newer products like AutoML often have conflicts with earlier versions of the google cloud SDK (atleast from what we saw using the python sdk)
Accomplishments that we're proud of
cross validation gave our model a very high score using the test set. However, there needs to me much more data from a generic (non-abuse/harassment) conversation corpus as it seems the model is "eagerly" biased towards harassment label. tl.dr: the model works for almost all phrases we considered as "harassment". The scraper and curating code for the perverted justice transcripts are now publicly available functions on STDLib. these can be used for future research and development work
What we learned
Scraping, extracting and curating data actually consumes most of the time in a machine learning project.
What's next for To Blockchain a Predator
integration with current chat interfaces like Facebook messenger, WhatsApp, Instagram etc. An immutable record of possible harassing messages, especially to children using these platforms is a very useful tool to have, especially with the increasing prevalence of sexual predators using social media to interact with potential victims.