GeoScholar

Inspiration

As students and researchers, all of us have faced long hours in the library meticulously reading abstract after abstract in search of gaps in the literature and in search of research-supporting articles. We realized that many people face these problems, and we decided to build a web application to solve them. Additionally, we wanted to provide a way for researchers to publish articles about topics and places that have not been researched before.

What It Does

Our web application, GeoScholar, aids researchers in identifying neglected research areas and cuts down the time it takes to find articles. We give our users geographical control over their literature search by visualizes research articles as points on a map, allowing people to understand the "where" in their research. We believe that understanding the geographical location of studies prevents users from divorcing research from their geographic contexts and misapplying research conclusions to new studies. GeoScholar lets a user type in a study topic into a search bar and returns points on a map. These points represent the study areas of the most influential research articles for the user-defined topic and the institutions these articles were published from. The map returned by our application highlights neglected research study areas and topics. It also helps users choose new study locations and publish research on them, increasing our knowledge by broadening our perspectives.

How We Build It

Libraries Scohlarly: https://github.com/OrganicIrradiation/scholarly

Google Language Cloud: https://cloud.google.com/natural-language/

Getting data to create the Author Feature Layer: The first step is processing the search query. To get the information we need for the Author feature layer we first must use the scholarly API to create a generator object for the publications. From there we parse the publication to get the data we need to geocode. Parsing the publication was simple as it was a Json formatted object. The difficulty in this part is processing the authors string in order to put it in a format that will increase the accuracy for the google cloud language API. First we had to use regex matching in order to strip the string of white spaces. This preserves the names in a "LastName,FirstName" format seperated by the word "and." We then had to split the string so that we now have a list of "LastName,Firstname." Now we parse through the list and split the name by comma so that we can individually get Last Name and First Name. We then check if the name is a length of one. The reason for doing so is because if it is a length of one then that means there is also a middle initial or name. We then reconstruct the name into "FirstName MI LastName" and then join the list into a singular strings of names separated by the word "and." In doing this process there was a significant accuracy increase when using Google language cloud. This increased the salience of each term by 25%. Once Google language cloud is finished analyzing we parse the entities and return a list of entities that are classified as a Person or Organization.

Now that the we have a list of authors we are able to more accurately find author information such as their institution and the numbers of times they have been cited. After parsing through the list of authors we gain all the information we need about the authors and the publication. From here we input the data into a pandas data frame, query for a list of institution names then batch geocode them and append the X and Y coordinates of the institution to the pandas data frame. Finally the data is output as a csv to later be used to create and upload the feature layer through the ArcGIS python API.

Getting data to create the Publication Feature Layer: The majority of the data is obtained during the creation of the Author table. However, we used similar techniques to find the study location in the abstract of the publication. First we pass the abstract into the Google language cloud API. We then parse the results and return the result with the highest salience. From there we add the study location information to a publication pandas table and then query it for all study locations. We then batch geocode the study location and append the X and Y coordinates to the table and output the data as a csv.

Challenges

Challenges included using machine learning to process author names, host institutions, and study locations; geocoding the institution names and study locations into feature layers; and troubleshooting ArcGIS Enterprise; finding a workaround for a geoprocessing service bug.

Accomplishments

Anna designed the web application and logo. Justin created a script that increased the accuracy of Google Language Analysis on Scholarly data output format and batch geocoded the results to produce data that can be used to create feature layers. Molly ran Enterprise Builder, set up a federated ArcGIS Server, and wrote a Python script for a geoprocessing service. Thao managed the project and organized the presentation. Mudit wrote a Python Script and set up the geoprocessing services.

What We learned

We learned about APIs, user interface design, back-end development, how to work together as a team, and how to best leverage each other's strengths.

What's Next

The next implementation for GeoScholar would include a language translator. A translator would allows GeoScholar to translate the user’s search term into other languages and return search results in other languages. Thus, the point layer would contain more articles and give researchers a better idea of where research on their topic has been done and what institutions have published related articles. We would also like to consult researchers who use our applications and ask them about geoprocessing services they would like to see in the next update. We would sort through their ideas and program the most useful tools. Finally, we would like to include more citation-based analysis that identifies locations of institutions that have cited the article and the study areas of these articles.