Sci-Search

Inspiration

As regular users of literature search sites, we understand firsthand the difficulty in procuring helpful literature. We came to understand the compounded struggle of this bottleneck in Genomics - with a mass of literature, making progress in scientific research is cumbersome. We hope to circumvent this with our development of Sci-Search.

What it does

Sci-Search queries and parses 4 repositories for biological scientific literature. Next, Sci-Search utilizes our Paper Prioritization Index (PPI) to prioritize literature as it is searched. We produce metadata for each scientific paper via multiple web scrapers and generate relevant keywords and mentions of genes via the use of natural language processing.

How we built it

To build Sci-Search, we utilized Python and BeautifulSoup to create 4 scrapers to procure scientific articles - from Google Scholar, PubMed, BioRxiv, and MedRxiv. Next, we utilized the RAKE NLP algorithm for rapid extraction of keywords and wrote Regex expressions for extracting mentioned genes. Afterwards, we queried the UMLS CUI database for procuring known diseases and wrote a priorization algorithm. For the construction of the front end, we utilized the React Javascript framework and the Ant-Design package.

Challenges we ran into

We ran into several challenges. One challenge includes inconsistency of data obtained from the databases. Google Scholar and Medrxiv did not provide an API so we had to build web scrapers to pull the data. PubMed and Biorxiv both provided data in different formats, causing lots of frustration from data formatting. Another challenge was getting our prioritization model working. We had to use Regex expressions to extract gene names from the text and then use multiple databases to search for diseases correlated with the genes. We then had to search back through the paper to see if the diseases were mentioned anywhere.

Accomplishments that we're proud of

We overcame many of the challenges we ran into, and in the process gained valuable knowledge and experience. We were able to successfully: query data from 4 databases, prioritize search results using a novel algorithm and several different analytics for each source (including key terms and gene names). With our solution, the user can enter a search term or phrase and get the most relevant results from 4 different databases very quickly.

What we learned

Coming into this challenge, we were very weak on using APIs, web scraping, and REACT. All of us have greatly improved in each of these areas during this hackathon. We had to build not one but two webscrapers, an entire REACT app, and incorporate as well as design many APIs into our solution.

What's next for Sci-Search

In the future, using machine learning to analyze our articles will help us improve our keyword extraction and gene extraction. Additionally, we will supplement our prioritization index with machine learning based ranking (ie. TextRank algorithm). Additionally, we develop a login user system for researchers to manage ongoing projects.