The Anti-Social Network

Discovering pimps, burner phones and interstate syndicates

Data Prompt: Fight Child Sexual Exploitation

Partner: Thorn – Digital Defenders of Children

Data

7M+ rows of data collected from three escort services sites in TSV format

Impact

It is estimated that 100,000 to 300,000 children are at risk of commercial sexual exploitation in the United States and one million children are exploited by the global commercial sex trade each year.

Initial Work

Our initial exploration of the dataset sought to extract pricing information as well as determine how individuals may adjust this over time. In order to do this we needed to identify individuals within the dataset. We targeted the phone number as a potentially good way to uniquely track each individual over time, but what we found was surprising:

burner phones

This plot shows the number of ads posted by individuals against the duration of their involvement with the ad service. We expected to see a somewhat linear trend, but found that most people had posted less than 25 ads before apparently either changing phone numbers, or discontinuing their use of the site.

We began to suspect that people were using pre-paid cell phones and changing phone-numbers after making them public on the ad website a certain number of times. It seemed like something that would imply that they were trying to avoid detection by the authorities. Thus, we tried to identify those people that were using "burner phones" as they seemed the most likely to be trying to hide something.

In inspecting some of the examples of posts from the same phone number, some looked very similar, with only a few words changed. It occurred to us that looking for extremely similar documents would be one way to identify people, as new ads were likely to be copies of old ones.

In order to do this we took a filtered view of the dataset by the three largest metro-areas, Atlanta, New York City and Los Angeles and filtered any observations with no phone number provided. This yielded a dataset with 266,555 rows. From there we used a TFIDF vectorizer to grab chunks of sentences to act as the post's "signature". We used a 3-gram representation to do this:

vectorizer = TfidfVectorizer(min_df=10, ngram_range=(3, 3))
corpus = vectorizer.fit_transform(df.postText.values)

For instance, a post with the content: "We Cater to Upscale Gentleman and We Are very Open Minded"

Would be broken into "we cater to", "cater to upscale", "to upscale gentleman", etc.

In order identify similar posts, we used a distance measure, the cosine distance, to see how similar other posts in the corpus were. In order to fit this into the time available, we focused on a few ads that we had noticed had very similar copy to others in our initial exploration of the dataset. One of these ads was from someone using the moniker "Stacy Hustles".

After choosing a "seed" ad, we sought all posts that were within a very close distance of it using the cosine metric discussed above.

At this point we noticed that the closest ads all shared the main copy, but had a different girl's name, and a different phone number. It was clear that we had stumbled upon a network of individuals who were in some way or other working together.

We then repeated the process using the new phone numbers we had discovered and grew the network further. At each iteration, new members we discovered and the final result of the "Stacy" network is shown below:

stacy's graph

Taking a new seed ad, the process was repeated and a new cluster found. Interestingly this cluster contained almost identical posts from both Atlanta and Los Angeles. The ads use the same names and copy with only a different phone number. We don't claim to understand the implications of this interstate relationship, but suspect that a larger organization may be involved.

Impact

With more time and resources, we believe that this process can be scaled to identify small and large networks within the cities and nationwide.

It could be used to find individuals masking their identities with fake phone numbers, or track criminals that have been known to use these message boards to facilitate their illicit operations.

Trevor & Tom

Share this project:

Updates