Wiki Alexandria

Inspiration

Whenever pretty much anyone has a research paper they have to do the first place they usually go is Wikipedia. It's a good resource for getting the general idea of a topic to know where to start yet more often than not, you run into (This article needs improvement), (Original Research) or the dreaded [citation needed]. So can you trust the information that you find on those pages? In addition, once you do begin searching for primary sources, you often find that you need to perform multiple searches of the structure (topic + related words) in order to locate the relevant information. We decided to automate this process.

What it does

Wiki Alexandria takes a Wikipedia page and distills it down to a list of key words. These words are then sent out in the format (topic + related word) in a query to multiple databases of published academic sources (google scholar, refseek, arxiv) and sorts the results to get the top 10 articles that appear most often among the combined searches.

How we built it

Challenges we ran into

1) There were not APIs for either Refseek or Google Scholar both had to be made from scratch as html parsers. 2) Due to the nature of the API usage (extremely high volume to perform all of the match up queries) we ran in to MANY rate limit problems on both of these services. They kept detecting our trawler as a search engine bot and shut us out before more than one search could be performed. This occurred for both the Refseek and Google Scholar API. 3) Issues with enabling asynchronous requests 4) Two of the APIs we wanted to use were not available during the 24 hrs. Microsoft's Academic Search API required an app id that had to be requested via email as it was in beta, given that it is a weekend it is highly unlikely that the manager of the API ever saw our email. JSTOR's API is only available to research organizations. 5) Due to the limitations of API request speed, requests take a long time to run the first time. To combat this, once searched results are cached. Some example searches already cashed include Gorilla Glass, Spectral Geometry and Web Crawler.

Accomplishments that we're proud of

It works!

What we learned

Interpreting html takes a lot of time and if one string is wrong the whole thing doesn't work properly, even if it doesn't throw an error
better python skills
implementation of asynchronous requests

What's next for Wiki Alexandria

implementation of Digital Public Library of America's API for primary source information
sort algorithm to weigh the top-ten results from each database to compile an overall top ten list
cache more pages
implement a more humanities centered database
better website UI

Built With

arxiv
beautiful-soup
google-scholar
html
python
refseek
tornado

Updates

Jack Collins started this project — Feb 07, 2016 11:50 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.