Front page of Tagger News
Example of a tag-specific page: the Startups category
Word clusters in Hacker News posts, developed during the process of automatically categorizing articles, using the tidytext package.
Hacker News is heavily used among software developers, but it has limitations. The conversation can be dominated by a few popular topics, and niche articles that appeal to a small audience can have a hard time gaining attention. In contrast, Reddit's support for "subreddits," communities where users can keep up to date on specific topics they're interested in, has led to an explosion of usage and variety of discussion.
What it does
Tagger News provides a new way to browse stories on Hacker News, allowing you to follow specific topics you're interested in, such as "Web Development", "Mobile", "Politics", and "AI / Machine Learning". These are automatically tagged based on the article's content so that Tagger News always has up-to-date news.
How we built it
We collected the full text of 20,000 Hacker News posts, and used machine learning (specifically topic modeling and random forests) to train a classifier for assigning articles into topics. This included a combination of developing a test set using exploratory analysis in R, such as the tidytext package, then training machine learning models in Python with scikit-learn.
We deployed this into a Django application hosted on Heroku with a UI skin based on the Hacker News design. It uses scheduled CRON jobs to synchronize with Hacker News. Our code is available on GitHub here.
Challenges we ran into
The Hacker News API is slow, as is the library we used for article extraction. We kept the web scraping running in the background across three computers for about eight hours as we were developing the product.
Keeping the database up to date on Heroku relative to Hacker News articles is challenging, especially since some articles could time out while they were loading.
Accomplishments that we're proud of
We were happy with the performance of our machine learning algorithm, achieving at least a .85 AUC for almost all topics on our supervised training set (and some as high as .95). We were also able to seamlessly transition trained models into production thanks to the scikit-learn library.
We constructed a dataset of full-text articles posted on hacker news, now freely available, that could be used in many other machine learning and natural language processing investigations.
What we learned
We learned to use the goose package to scrape text from a variety of articles. We had experience developing machine learning algorithms for production before, but rarely under these time constraints, and we were impressed with the ability of scikit-learn to transition from exploratory analysis to production.
What's next for Tagger News
Putting it on Hacker News, of course! We expect it to be tagged under both "Web Development" and "AI/Machine Learning."