Court opinions and journal articles form the foundation for all new legal thinking in the United States, and URL references have become a critical part of those documents in recent years. About 50% of links in Supreme Court cases, and about 70% in some law journals, are dead. Perma.cc, a tool invented in the Harvard Library Innovation lab, was developed to mitigate this problem for the future. Since its creation, Perma.cc has become the legal industry's standard for URL archiving, but there is one glaring issue: it can't do anything about the thousands of links rotting in existing databases.
What it does
We built a web application which takes .JSON and .PDF input, scans files for URLs, creates permalinks for them, and inserts those permalinks into the documents. Then the user can download the permalinked files!
How we built it
This system was built on top of a Node.JS/Express front end. We utilized C on the back end to do our API calls to the Perma.cc API.
Challenges we ran into
One big challenge that we had was handling the messy input data we scraped from CourtListener, a court opinion database with over 5 million different files. The URLs were often malformed due to low quality PDF OCR. Additionally, we saw problems with our web front end when attempting to post files to users for download. The segmentation between server and client-side code in Node led to this confusion, which we eventually sorted out by creating a static directory with all converted files.
What's next for PermaCC++
We are in touch with some of Harvard's law journals about adding unrotted documents to their databases and will be discussing with Perma.cc about merging the two tools. The potential for our backend tool is almost limitless with respect to link archiving. If it is set up with a proper web scraping tool, we could begin permalinking virtually any JSON/PDF-based database to halt link rot.