As students interested in Data Science and Machine Learning, we've found that a great way to stay up to date with this rapidly changing field is to read academic papers recently published. However, many of these papers were extremely long and tough to comprehend, and the abstract of a paper was often missing, and if it was present it wasn't informative. Furthermore, we often spend a lot of time going through papers that were similar to something we read before, but we didn't know that beforehand due to the sheer length of the paper. Also, we often were looking for an answer to a specific question, but we didn't know where to look for an answer, as the answer was often buried in 80 pages of technical terminology. To solve this, we developed Simplitize, a web app that helps you understand academic papers via NLP Question Answering and Document Summarization.
What it does
There are two features of Simplitize. First, the user can copy and paste a paper into our webpage, and we'll summarize it in under 10 sentences. Second, the user can copy and paste a paper into our software along with a question about the paper, and we'll provide them the answer to their question (or "None" if the paper does not contain the answer). I'll go more into how this works below.
How We built it
There are two types of document summarization, abstractive summarization, and extractive summarization. Abstractive summarization is where we try to provide a summary by focusing on the big picture, but keep most of the main sentences intact. Extractive summarization is when we break down most of the sentences to summarize the document, however, there are often many grammatical errors and it is tough to understand. We chose to use abstractive summarization as the purpose of our project was to make these papers easier to understand. We used Natural Language Processing to give each sentence a "rating" to how essential it was to the "big idea" of the paper, and then ranked the sentences. We then presented the user with the most important sentences in their paper constructed into a holistic summary.
We wanted to go beyond the typical hackathon project of an article summarizer, so we integrated an extremely new deep learning algorithm to provide question answering: a transformer. A transformer is an algorithm used for seq2seq modeling, and we wanted to train a model that could extract information from text. We then found the Stanford QUestion Answering Dataset (SQuAD), and planned on training a transformer on SQuAD. However, training a model from scratch on SQuAD would take four days to train, and we had only had 20 hours to go. To solve this, we applied transfer learning and used a pre-trained transformer and performed hyperparameter tuning locally, which took ~4 hours. We then saved our model and integrated it with our flask API to connect it to the frontend.
Challenges We ran into
There were two major challenges we ran into (in addition to a gazillion bugs):
- As stated earlier, we ran into time constraints, preventing us from training a transformer on SQuAD from scratch. However, we solved this through transfer learning. We learned that transfer learning is used very commonly in the field of Natural Language Processing.
- We had issues connecting our Flask API with Heroku, which is where I normally deploy flask APIs which I've written. Unfortunately, I didn't have time to debug this issue, so I ran the API on localhost and used ngrok tunneling to get an endpoint URL. When we put this into production, we plan on deploying it on either Heroku or PythonAnywhere as those are more scalable.
Accomplishments that We're proud of
- We have a fully functional web application that can be deployed, which can help many students get a better understanding of deep learning.
- At the point of publishing this project, I believe we are the only software that provides high-quality question answering specifically for academic papers; our project is novel and hasn't been repeated!!
- We learned about how we can apply transfer learning to natural language processing, something that neither of us had much experience with. This is applicable in the future, as most NLP algorithms are built via transfer learning.
What We learned
- We learned how to apply transfer learning to Natural Language Processing, which neither of us had done before. We plan on continuing to use transfer learning when working on other projects.
- We learned a lot about the applications of question answering, which is a relatively new field in NLP. We hope to apply this to other fields in the future.
What's next for Simplitize
We plan on hosting our API on PythonAnywhere, as it is clearly more scalable than running it on a local server. After that, we hope to deploy our website on simplitize.tech (in process of buying domain right now). We hope to get feedback on our project, and then reiterate.