This project was built with the idea of matching larger queries and providing matching between text and research papers. If one is looking for papers similar to a paper they have seen before, our goal is to have that be easily operable. We attempt to remove all the guesswork from tailoring a search by using a multi-token comparison algorithm.
What it does
Our application takes a number of papers (notably astronomy and space-based papers), converts them to static images, runs text recognition on the pdf image, removes garbage words, tokenizes the documents, and generates a vocab space vector. The frontend takes a multi-token input and attempts to match the term frequency–inverse document frequency scores of the documents we precomputed on the backend. At the end of this matching process, we return the top 30 matched scores.
How we built it
We pulled papers from core, which is a database for academic papers. This was done by selecting a number of astronomy journals and then grabbing publications from these journals to create an astronomy-based document set. In order to make use of older pdfs that may have not been typing compatible, we converted the pdfs to images and ran language recognition on the images to create a set of text from the pdf. Then, we removed all the stop words, lemmatized to remove tenses and plurality, and finally uploaded the raw documents to our google cloud bucket. Our goal was to maximize our document space given the time, which resulted in ~750 tokenized documents. The second round of precomputation comes from aggregating our document space, which gets cumulative sums for the vocab space and calculates the overall term frequency–inverse document frequency vector. Once we have the cumulative count for each of the tokens in the vocab space, we run back over our documents to create a matched vector with tf-idf weights matching that of the vocab space. Finally, token, document, idf score tuples are pushed to a PostgreSQL database to allow rapid computation on the backend. The technologies used in this piece are: python, google cloud bucket, google cloud postgreSQL, tesseract OCR and nltk.
The backend has three pieces: pullDocuments, searchSpaceDB, and searchDB. pullDocuments grabs the information from the SQL database given an array of document IDs. searchSpaceDB takes a text query, generates an tf-idf vector for that text input matching the shape of the vocab vector, runs the matching computation using cosine similarity with the other documents in the table and returns the top 30 matched scores. searchDB returns information linking back to core, so that when we return the document id, we can directly link the users back to relevant papers. The technologies this piece uses are: python, flask, node.js/express, google cloud functions, google cloud buckets and postgreSQL.
The frontend was built using react/node.js and used Material-ui for the styled components. The frontend runs on google app engine.
Challenges we ran into
Our primary challenges were related to integrating with the google platform. Initially, we had wanted to be able to use a pdf for submission formats, but that required the integration with tesseract, which in turn required a custom installation on a docker image. We also had some difficulty connecting all the working pieces and spent a substantial amount of time setting up DNS redirects and tokens, as well as making sure our APIs allowed cross-origin.
Accomplishments that we're proud of
It works. We built a full-stack app from the ground up with a functional search engine in it within 24 hours.
What we learned
We need to allocate much more time to configuring google cloud given that it is such a powerful, but complicated tool.
What's next for Kerbal Search Program
We would really like to add in pdf submission support, which requires integration with tesseract. This will require redesigning the backend so the content type is a pdf instead of plaintext.