Inspiration

We were inspired by the project statement offered by Merk and Co. This problem statement invited us to help doctors/ professionals in the medical industry view and understand research articles. This is a handy tool as it would assist these professionals to understand the content of the articles at a faster pace, as they currently have issues keeping up with the constantly updating information/ data on PubMed. This tool would allow them to learn more about the most recent information in the medical field in a much easier and efficient way.

What it does

Our tool extracts text from research articles in PDF format. We summarized the text after extracting it. In addition, we created an image and QR code generator for use in the infographic. If the person reading the infographic is interested in learning more about the research article, they can scan the QR code to be directed to the full research article.

How we built it

We utilized HTML and Python programming languages. In addition to that, we implemented numerous other modules such as Spacy, PyPDF2, Craiyon, Jinja2, PictureGenerator, Fitz, lxml, and re. We also utilized the Crossref API

Challenges we ran into

Extracting text from the PDF proved to be an extremely difficult task as the text we were able to extract was not structured properly. Furthermore, we were unable to remove the footnotes and references from the extracted text, which further impaired our output. Another challenge we ran into is collecting the details of the PDF using the PDF's DOI (Digital Object Identifier).

Accomplishments that we're proud of

After a grueling 24 hours of coding, we were able to finally extract structured text from the pdf and generate an adequate Infographic. This was a big accomplishment for us and a huge sigh of relief as generating the image and QR codes was a bit more straightforward.

What we learned

We learned how to utilize NLP and realized that data/text extraction from PDFs is a very lengthy process. However, despite the struggle, we were able to learn more about modules such as Spacy and PyPDF2. In the event that we run into similar problems in the future, we will be able to solve them (or have an idea to solve them) with more ease.

What's next for MSD Infographic Generator

We would be able to find better and more efficient ways to extract the text from the PDF if we had more time. Furthermore, we can find better ways to convert the extracted and summarized text into more appealing infographics that can be used in place of the research article itself.

Built With

Share this project:

Updates