What it does

It's a web application (backed by Azure) that tells you how much a title is clickbait, giving it a score from 0 to 100 (and sometimes higher!). Titles from Buzzfeed receive higher clickbait scores, and titles from actual news sites (CNN, PBS, etc.) will receive low clickbait scores. Check out the live demo here.

How I built it

I first collected hundreds of sample titles from Buzzfeed (considered clickbait) and other news sites (CNN, PBS, ABC, Fox, etc, not considered clickbait). Then, I came up with and generated values for a variety of heuristics, such as "does it begin with a number?", "does it include a word in ALLCAPS?", "how many words are in the title?" for each title.

I then plugged the data into Microsoft Azure's Machine Learning platform and ran a regression on predicting clickbait-ness based on the heuristics. With some trial and error, I found a Boosted Decision Tree to be the best regression type for predicting clickbait, giving a mean error of ~23% in predicting the training data.

The trained model was then deployed to Azure's web application platform, which generated an API for querying the model. I wrote a PHP middleman for abstracting away API access (and secret keys), and then a HTML/JavaScript frontend for making requests to the PHP script. Try the live demo with your own titles and try out the examples here:

https://www.ismyinternetworking.com/clickbait/index.html

Challenges I ran into

Gathering article titles was difficult because few news sites have official APIs for pulling down their news article titles. Some titles were simply copied and pasted one by one. I was able to strip elements from some webpages so I could copy lists of titles at a time, which made data collection much faster. My training data included almost 1000 titles in total, and is available on GitHub.

Coming up with effective heuristics also took some work. I used some online word analysis tools to find the most commonly used words and make that a heuristic. Some clickbait titles could "look" like legitimate news. Having many heuristics enabled Azure to build an effective decision tree that could handle the tougher cases.

Accomplishments that I'm proud of

It actually works! Most titles ripped right from Buzzfeed will score highly in clickbait scores, while other news pieces from other news sites will score low.

What I learned

How to effectively train a prediction model with Azure and tie it to a web application service.

What's next for Clickbait Detector

The dream is to build an API and browser extensions that automatically block out clickbait titles while browsing the web.

Share this project:
×

Updates