We are #teamthorn, a group of data scientists, engineers and devs devoted to helping the non-profit organization, Thorn. Thorn’s mission, to be digital defenders of children, resonates deep within us. We have been proud to work together to help geolocate and discover previously unnoticed prostitution rings within the United States.
Using soft text matching, we cluster Adult posts by their peculiar semantics. By collecting sets of geographic locations, timestamps and even phone numbers we identify “Adult cells,” one more more individuals working together to proliferate prostitution.
You can see some of the final results here:
- Real-time API for identifying escort cells: http://22.214.171.124:8089/api/recent
- Visualization of historical escort cells: https://thorn.ngrok.com/#/
We truly hope to continue work on this endeavor and aid law enforcement in shutting down these horrific groups.
We focused on a two fold approach:
Initially, we built quick visualizations of location and prices of Adult posts. Going deeper, we used a heuristic matching strategy to identify Adult cells.
Geolocation and price extraction
To geolocate adult posts we first extracted city names from the given tsv. As the “location” column was structured enough for exact state / country resolution, we focused on backing out state names from cities. We built an Elasticsearch gazetteer (Geographic index) using the Foursquare’s Quattroshapes and customized a scoring function to boost on both city name matches and population size.
To extract prices from unstructured description fields we tested two methodologies. First, a heuristic set of regular expressions built on observation. Second, we leveraged Stanford’s 7 class NER to extract PRICE tags from descriptions. Due to the brevity and lack of clarity in adult posts, this second method performed with lower recall.
Adult Cell identification
Our strategy was to pick a phone number at random. Then use long, unique, exact, substring-match search to discover other phone numbers operated by the same people. This allows us to iteratively explore the listings and phone numbers, generating an increasingly large set of networks over time.
The computation was quite intensive, so we ran the computation on a cluster of 10 n1-standard-1 VMs on the Google compute cloud to complete the analysis.
The algorithm begins by getting a list of all phone numbers, here: https://github.com/reinpk/thorn/blob/master/scripts/network-expand.js#L35 https://github.com/reinpk/thorn/blob/master/src/parse/phones.js
A single phone number is expanded to a set of long (15 words), unique substrings extracted from all the listings associated with that telephone number: https://github.com/reinpk/thorn/blob/master/scripts/network-expand.js#L42 https://github.com/reinpk/thorn/blob/master/src/search/phrases.js
The phone number’s 15-word phrases are tested for exact matches against other listings: https://github.com/reinpk/thorn/blob/master/scripts/network-expand.js#L45 https://github.com/reinpk/thorn/blob/master/src/search/phones.js When an 15-word, exact match is found, the two phone numbers become part of the same network.
For final analysis, we flatten the graph of connections and output basic analysis: https://github.com/reinpk/thorn/blob/master/scripts/graph.js https://github.com/reinpk/thorn/blob/master/scripts/analyze.js
- Our historical crawler indexed 38,600 phone numbers from the Thorn dataset. Within those listings we detected 143 separate networks of connected phone numbers.
- Our real-time API endpoint at http://126.96.36.199:8089/api/recent indexes new escort postings within roughly an hour, automatically detecting the relevant phone number network
One of the networks we detected is very large, with nearly 38,200 phone numbers. The length of matching strings, overlapping phone numbers, overlapping names, and extensive use of the same unicode character styles indicates that it is one network.
Beyond the huge network, we detected 142 smaller networks. For example (408) 512-6428 was associated with a 19-number network centered here in the Bay Area, and (347) 363-6681 was associated with a 12-number network based in New York City. On average each smaller network has 4 associated phone numbers.
We believe these computed networks represent real-world social networks, and perhaps trafficking rings, pimps, etc.
The impact of this project extends far beyond this hackathon. Building on our basic adult cell discovery algorithms, Bayes Impact has the opportunity to impact millions of women and children worldwide. Moving forward, we recommend a two-fold approach for improving cell detection.
For efficiency, adult advertisements should be indexed into Elasticsearch for fast recall and high precision. The current implementation in node.js + csv files was good for a quick start, but not scalable or robust.
For automation, anomaly detection can be built on top of our cell discovery network. Often, we rediscover the same sets of individuals across the United States. Previously unseen sets of individuals and locations are indicative of a growth in a new adult cell. Such anomalies can be surfaced via our geographic visualization.