With this project we sought to solve the problem of having to pick out textual medias for oneself. Starting with the sterile database of wikipedia, we wanted to make something that would continually suggest new articles to a user based off of their browsing history with the chance to introduce them to new but related topics. Sometimes when we choose what we consume we fall into a rut in which we cannot see out of, to quote To Kill a Mocking Bird by Harper Lee: "People generally see what they look for, and hear what they listen for." We wanted to introduce users to interesting articles beyond their normal scope.
What it does
Serendipity uses machine learning and graph theory to curate a constant stream of Wikipedia articles that are always relevant to your interests. On the first install it will go through your history of previously visited Wikipedia articles and build a model of your interests. From there, it will recommend a Wikipedia articles to read and measure your level of interest in each article you visit. Based off of whether or not you enjoyed the article, it will adjust the model and recommend either similar or different articles.
How we built it
The backend is built in python and the webserver is running on flask. For each recommendation, it uses lxml to parse over 100,000 wikipedia articles on average. The graph processing is all writen in custom python code that is optimized for speed and cacheability. We wrote a custom decorator for Python to enable automatic disk-based caching of function calls between runs. The machine learning is done with nltk and scikit-learn.
We find our top 10 suggested wikipedia articles by treating Wikipedia as a graph where the nodes are individual articles and inter-article links are edges.
The first step is to find the specific category of articles we are in (ignoring Wikipedia's categories because they are far too broad). We do a breadth first search out to a distance of 2 (averaging over 100,000 links traversed for each search) and find an approximation of the center of a graph via finding the node that we pass through the maximum number of times. The center of the graph is generally a larger category than the individual article we started with. For example, if we started with "Common Lisp" then "Programming Languages" would likely be the central category of the graph.
Then we want to find other nodes that are closely related to both our central node and our starting article. We do so by doing another breadth first search and finding a set of nodes that minimize the distance between the central node and the starting article.
This list of the top 10 articles is then sent to the lexical analysis code to rank them.
We use term frequency – inverse document frequency (tf-idf) to rank a list of neighboring articles against a users past history. This gives us a relationship between the lexical content of a given set of wikipedia articles (suggested) against the users history. From this we recommend the n highest ranked.
We use the python library sklearn for this, nltk.corpus for its collection of english stopwords, numpy for matrix processing, and a self made library for parsing text out of wikipedia markup.
Challenges we ran into
One of the largest challenges we ran into is the sheer amount of data that we have to process for every request. In order to build out a graph of Wikipedia articles within distance 2 of a given article we have to process over 100,000 Wikipedia articles on average.
Another difficulty was sanitizing Wikipedia articles to get them into a clean format to use with nltk. We wrote a custom mark up parser using regular expressions to parse a Wikipedia article into a clean list of words contained in the article.
Accomplishments that we're proud of
We're very proud of the relevance of the results. For example, based off of a single visit to the compilers Wikipedia article, it comes up with 5 recommendations:
What we learned
This whole process has been a huge learning experience for all of us. We've learned a tremendous amount about graph processing and applying machine learning to texts via nltk.
What's next for Serendipity
The next step is to build out an authentication. Currently the backend does not authenticate different users which presents a security concern since it reveals one's interests. After that, we plan on releasing it publicly on the Chrome webstore.