We thought about how there were many web scrapers and tools that were used to scrape a website generally, but we wanted to expand and use a web scraper to fight for good.
What it does
Scrapes a url and grabs a specific content from the html component of the url. Stores html into text file. Text files contain naturally language processed words and phrases and those words and phrases are compared to a text file of manual words that can initialize sexual exploitation. Does a frequency analysis of words with a list of information gathered of the html. Sentiment analysis of words from list of information gathered from html is also used and modeled from the flair library.
How we built it
We used multiple python libraries and natural language processing libraries such as nltk, flair, beautiful-soup for scraping, and sklearn for text extraction. We created two scripts, one where the user can input a url, and another where a user can simply put the html file. Multiple text file generations with term frequency inverse document vectors, which grabs important words from the scrape process, resulting in comparing terms to determine the safety of a website.
Challenges we ran into
We encountered problems with using a sketchy post via craigslist, we attempted to scrape it with our scrapper, but their html is encoded with break lines instead of lines or p tags. Another problem we encountered was one of our teammate's (Hebah Beg) was unable to set up the libraries that were used for the project onto her device to run the programs.
Accomplishments that we're proud of
Ability to scrape websites, ability to run natural language processing methodologies into what was scraped, compare and contrast words from what was scraped and words that could assume sexual exploitation, the teamwork that was assessed, debugging of the code, to solve a problem.
What we learned
Enhancement of web scraping with the expansion and use of natural language processing.
What's next for Web Scraping Human Trafficking
Future representation, expansion of the code, debugging, and other machine learning methodologies.