Inspiration

Getting access to research opportunities is challenging. Knowing what to work on is even more difficult. As high school students ourselves, we experienced the difficulties of looking for research topics firsthand. The advice we received from others was simple: we had to just keep reading research papers until we stumbled upon something we found interesting. And although we agreed with the general premise of reading being the basis for scientific inquiry, we thought there had to be a more targeted approach. The current method was simply far too inefficient. We witnessed many of our peers attempting to Google about a topic and being swamped with too many papers to count—for someone just getting started with research, this can be quite intimidating. This was not only a problem of access, but also one of equity. We noticed that it was far easier to overcome this bottleneck of information overload if students could consult a mentor experienced in the research process. Without such a helping hand, we saw students become overwhelmed with the sheer abundance of the number of options and opt to simply not embark on a research journey. Academatch was our attempt to solve a problem that we experienced: put simply, how can we help students find research areas that excite them?

What it does

Academatch is, at its core, an advanced recommendation system. It allows a student to input text that describes their existing research interests and subsequently provides a curated list of five professors and their papers (which might be good places to start reading). But how does this happen? The text that the student provides is converted into a vector that contains numerical values through a process called embedding. This is useful since it takes into account not only the words which the student uses but also the context in which they convey them. The same method is used to create vectors of abstracts that we scraped from a collection of published scientific papers. Storing all of the vectors in a vector database allows us to find the vectors that most closely resemble the vector of the student’s inputted text. From start to finish, this means a user can simply type in a couple of broad areas they are interested in and receive five curated topics to further explore.

How we built it

Without data for abstracts, it would be impossible for us to convert them into vectors. As such, our first step was to not only find a source for academic papers but also to aggregate them into a database. We decided to draw these papers from the open-access repository ArXiv, which offers public API access to promote interoperability. Using the API, we created a list of abstracts from professors of three of North Carolina’s public universities: UNC-Chapel Hill, NC State, and UNC-Charlotte. The main component of our project was implementing the embedding process to convert each of these abstracts into vectors. To decide on which Large Language Model (LLM) to use to perform the embeddings, we tested out many different LLMs and tried to determine (based on our own perceptions) whether or not the LLMs were providing reasonable vectors. This was done by comparing how similar the vectors appeared to be when we knew the abstracts were somewhat similar. After trying numerous alternative BERT models and Doc2vec, we decided that DistilBERT best served our needs for creating embeddings for academic papers specifically. Both DistilBERT and the ArXiv API were accessed via Python code. From here, we store the vectors in a vector database called Pinecone. We also built the front end for the Academatch website using HTML and CSS. When a user submits text to the front-end, we wanted to convert that text to a vector and then see which of the vectors that were already stored in Pinecone most closely matched the user-inputted vector. To do this, we decided on using Flask to transport the user-inputted text to the backend Python code which implements DistilBERT and sends it to the Pinecone database.

Challenges we ran into

Chronologically speaking, the first difficulty arose when using the ArXiv API and finding out that there was a limit to how many requests we could make per hour. Unfortunately, there was no better way to solve the problem than to make the maximum number of requests per hour and wait until we could make more, which took a while. This was also the reason we limited our professor and paper database to only three universities—although we wanted a larger dataset, it would take far too long to scrape the data from ArXiv. On a different note, we also had to deal with tradeoffs between speed and space when performing embeddings. Indeed we had to limit the size of our vector to make our program computationally fast but also needed to determine a point at which the size of the vector was big enough to obtain accurate results and performance. Additionally, we had difficulty deploying Academatch to the web since we could not find a way to host dynamic web pages for free. There exist a myriad of free static web hosting services, but since we needed to incorporate the Flask to tie together the front and back end, none of these options were feasible. For the few dynamic hosting options available, they required upgrading to a paid plan in order to host something that requires a large memory (as is the case with Academatch). This happened to us when trying to host Academatch on Repl.it, a popular place of hosting for dynamic pages. To overcome this challenge, we had to resort to simply hosting the app locally on our laptops.

Accomplishments that we're proud of

We are very proud of many aspects of our project, but some of them strike out to us the most. Firstly, bridging between and interfacing with multiple technologies was something that our team did not have much experience doing, and so the wide array of technologies that Academatch uses is an achievement we hold to a high regard. In particular, we had to spend a significant amount of time perusing documentation to see how we can store use input in the Pinecone database in a way that is accessible through Flask. We were also very happy with how we were able to understand and use the ArXiv API to scrape the data and create the overall vector database. Finally, we were glad to see that the embeddings were robust enough to provide insight for students that do not give very detailed responses to the initial prompt about their primary baseline interests.

What we learned

The development of Academatch has been a learning journey that has taught us web scraping skills, provided experience with large language models, and introduced us to word vectorizations/vector databases. We have also added to our foundation of html and css skills while creating the Academatch website.

What's next for Academatch

First and foremost, we would like to develop the capability to host a dynamic page. Obviously, without such a platform, it is not scalable and usable by more people unless they can pull up Academatch on their personal devices. This might include finding grants or sponsorship for the project hosting or finding alternative creative methods for hosting dynamic pages. Along with such a change, we want feedback on the usability of the webpage and therefore want to create more opportunities for client testing. We recognize that there are many different use cases that we did not anticipate that result in fairly inaccurate results—the only way to catch such edge cases is to have mass adoption of the technology to see how different people use the product. We would also like to expand the database so that it’s more than just North Carolina schools being represented. This would likely require us to collaborate with ArXiv directly to bypass the requests per hour limit. Simultaneously, however, we recognize that many people would like recommendations tailored to their geographical location, so we would like to implement filters that allow students to eliminate papers before we use Academatch to search for them.

Which large language model and datasets did you use within the project?

The dataset used in this project was derived through the method of web scraping. We first downloaded a list of all faculty members at UNC-Chapel Hill, NC State University, and UNC Charlotte through the UNC Salary Information Database. Using this list of faculty, we then web-scraped the ArXiv API to find research abstracts written by professors at these specific institutions. The abstracts were then combined with the list of professors to have one big dataset.
The large language model used in this project was BERT (Bidirectional Encoder Representations from Transformers). DistilBERT encodings were applied to the abstracts in our dataset to obtain vector embeddings.
The vector embeddings were then uploaded to a vector database, which is another dataset we used in our project.

Are there any specific testing instructions that the judges should know about?

We were unable to host Academatch as a website, so the program must be run locally.

Built With

Share this project:

Updates