Inspiration

We were inspiried by personal difficulties in searching through university faculty lists to get a sense of professors whose research might align with our own interests. This project was aimed at developing an extensible tool to make the discovery of these connections more natural and meaningful. Since our manual versions of these searches are often poorly organized and not easy to maintain organized notes on we hoped that this would give the opportunity to add more information to our knowledge base as we discover it

What it does

The project takes a list of professors and a set of paper abstracts they've written and utilizes a word-embedding model (BERT) to represent each word of those abstracts into a high dimensional space that is hoped to be an indicator of the type of work being done. These data-points are used to create a summary statistic of each researcher's research interest and calculates pair-wise distances in this space between each researcher. It then structures this information as in a graph database, representing professors as nodes and distance in similarity as edges.

How I built it

We used a kaggle dataset of paper metadata scraped from arxiv to derive a set of authors, their papers and the corresponding abstracts. The abstracts were read into a python string, split on sentences, and applied to a pretrained BERT embedding model to extract vector representations for each word. The common "stop_words" were then removed from this to try to focus it on more research relevant content. Neo4j was then used to represent this, with Professor names as nodes and semantic similarity as edges.

Challenges I ran into

We were all previously unfamiliar with databases and encountered a large number of issues in getting data in and out of these databases. Due to such issues we switched datasets multiple times, having to restructure the pipeline from scratch. We additionally struggled with reading things into Neo4j, finding multiple API's that led to errors.

Accomplishments that I'm proud of

What I learned

What's next for Researcher Similarity: A network view

We hope to later extend this to include additional measures of similarity between researchers. This might include citations in and out as well as collaborator counts, and perhaps some measure of online behavior (e.g. twitter scraping).

Built With

Share this project:

Updates