What it does
It's a web application (backed by Azure) that tells you how much a title is clickbait, giving it a score from 0 to 100 (and sometimes higher!). Titles from Buzzfeed receive higher clickbait scores, and titles from actual news sites (CNN, PBS, etc.) will receive low clickbait scores. Check out the live demo here.
How I built it
I first collected hundreds of sample titles from Buzzfeed (considered clickbait) and other news sites (CNN, PBS, ABC, Fox, etc, not considered clickbait). Then, I came up with and generated values for a variety of heuristics, such as "does it begin with a number?", "does it include a word in ALLCAPS?", "how many words are in the title?" for each title.
I then plugged the data into Microsoft Azure's Machine Learning platform and ran a regression on predicting clickbait-ness based on the heuristics. With some trial and error, I found a Boosted Decision Tree to be the best regression type for predicting clickbait, giving a mean error of ~23% in predicting the training data.
Challenges I ran into
Gathering article titles was difficult because few news sites have official APIs for pulling down their news article titles. Some titles were simply copied and pasted one by one. I was able to strip elements from some webpages so I could copy lists of titles at a time, which made data collection much faster. My training data included almost 1000 titles in total, and is available on GitHub.
Coming up with effective heuristics also took some work. I used some online word analysis tools to find the most commonly used words and make that a heuristic. Some clickbait titles could "look" like legitimate news. Having many heuristics enabled Azure to build an effective decision tree that could handle the tougher cases.
Accomplishments that I'm proud of
It actually works! Most titles ripped right from Buzzfeed will score highly in clickbait scores, while other news pieces from other news sites will score low.
What I learned
How to effectively train a prediction model with Azure and tie it to a web application service.
What's next for Clickbait Detector
The dream is to build an API and browser extensions that automatically block out clickbait titles while browsing the web.