PDF Highlight Combiner DLX

Inspiration

One of our group members noticed that when we went to online learning, comparing our thoughts and ideas from texts that we read became more difficult. We couldn't easily sit down and compare what we all annotated, so we thought we might be able to create an online software that could analyze pdf files and display useful information about the group of files.

What it does

On run, the program asks for an original pdf file that does not have highlights. Then, the program asks for a directory that contains many pdf files similar to the original, with the exception of the highlights made on each file. Once both file paths are selected, the program keeps track of all the highlights made in the files in the directory. It creates a new pdf in the same location as the original pfd, and opens it. The resulting pdf is highlighted based on the frequency of highlights. A lighter highlight has been highlighted at least once, but not often, and the darker a highlight is, the more often it was highlighted in the directory.

How we built it

Given that our group has little practice with developing programs like this one, we started by deciding what kind of input/output we wanted, and the kinds of classes we may need to get us from the input to the output. We knew we didn't have a ton of code to write, but the code we needed to write did a lot of things we didn't know how to do, so there was always at least one or two people googling and experimenting with APIs during the afternoon/night.

Challenges we ran into

Pdf libraries are not created equal! We spent a lot of time trying to figure out which pdf library we would use for reading and editing the files. A lot of time was spent searching online to find an API that did what we needed with the PDFs. We eventually settled on pdfBox for getting data and pfdClown for writing data. We changed our minds a couple of times throughout the project.

Accomplishments that we're proud of

We are proud of the project as a whole! We are aware that the code is a bit messy and things don't work quite how we want to, but when we run the program, our goal of differentiating the highlights that were made often or only once is evident.

What we learned

For most of us in the group, this was our first time using GUIs, using external libraries in a project, working as a team on a project, and our first time trying to read/edit pdf files. Needless to say, we learned a TON about project logistics and how powerful external libraries can be if we understand what they do. Likewise, we learned that external libraries can be confusing and difficult to work with, especially if they are not well documented or used often, like pdfClown.

What's next for PDF Highlight Combiner DLX

Given our lack of experience, we decided to create something very basic. In the future, we would like to use javascript and expand this project on the web. We imagine that this tool could be used in something like Canvas, where a professor can view a summary of all of the annotations from their class on that particular reading. We would like to expand the types of annotations to include notes, categories, tags, or include who highlighted specific passages, to allow for even more information sharing.

Built With

java
pdfbox
pdfclown
swing

Submitted to

DubHacks 2020

Created by

I worked on the backend component that extracts highlighted words from the PDF, solving bugs, and provided overall support for the APIs that we used.

Lam Mai
Worked on Highlight object and helped with general problems.

yavuzalp Turkoglu
Worked on using scanned pdf content to highlght resulting pdf and general backend help

treguv Vlad Tregubov
I built the basic GUI, worked on project structure to make the classes work together, and general backend support.

Austn Attaway

Updates

Austn Attaway started this project — Oct 18, 2020 11:34 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.