arxiver

(pronounced "archiver" as if X were the Greek letter)

arxiver is an API designed to extract relevant information about scientific publications from the public database arXiv.org, owned and operated by Cornell University. You can use arxiver to send search queries and find new publications in various academic areas. Results are returned in JSON in the form of "Papers".

Inspiration

I got the inspiration for creating an API for arXiv.org while I was trying to find new research papers in the field of Computer Science. I was tired of always being late to the party in terms of CS research. Someone would also have to tell me about the latest and greatest algorithm and bring me up to speed. How does everyone know about it before me? That's when I found that arXiv.org organizes their papers by submission date. The UI is a little clunky and it's only available on their site. There's no access to an API for getting a list of NEW papers, only searching papers. And even the search API is difficult to use. Data are returned in XML and tags seem to have a weird namespace. So I was also inspired to redesign their search API with a more intuitive and convenient JSON format that only displays relevant information.

How it works

It works like any other RESTful API. You can pass URLs with parameters to get JSON results. You can either search using a string query or get a list of new papers from an academic field. The specifics are available at the homepage.

There is also a Python package available on PyPI, so Python developers can bypass the HTTP all together. Details of how to install and use the package are also available at the homepage.

You'll get a list of papers as a response which contains dictionaries with the arXiv page, the title, the abstract, the authors and their links, and links to the PDF of the paper.

Challenges I ran into

The arXiv.org HTML design is absolutely confusing and doesn't play well with web-scraping. I spent a lot of time trying to extract relevant information about each paper. Even arXiv's current search XML API is messy. All the tags have a strange prefix that prevents easily traversing the document in any language. It was very tedious but now with arxiver, no one else with have to go through what I had to go through.

Accomplishments that I'm proud of

I'm really proud of the Python package that I developed for this Hackathon. Over 200 developers and counting have downloaded it. In fact, I used the package as the back-bone for the HTTP version of the API.

What's next

More relevant information. For each paper that is submitted to arXiv.org, there is a whole page of information with references, citations, alternate formats.

Share this project:

Updates