Inspiration

Getting research experience is hard, and one the most frustrating part was navigating outdated academic directories. Schools normally have a limited number of professors on them, with information about research interests and emails scattered in different sources. With modern tools, we can do much better than surfing the web for hours. Thats why I created ProfMatch.

What it does

When you type in a natural language prompt in the search bar, it will match you with 10 professors that are most similar to the interests you described. You can keep doing this, getting unique results each time.

How I built it

The core app was build using React on the frontend and a Python + FastAPI web server on the backend. The app is hosted using Railway (https://railway.app/). The search functionality comes mainly from using vector search in a Pinecone vector database, and additional displayed metadata (like descriptions, emails) is stored in Supabase (PostgreSQL).

To get the information and store it in the database, I use a combination of Python, the Perplexity API and Selenium. I created a parallel web scraper, that starts by scraping profiles of researchers on google scholar and extracting information about their publications using selenium. I then used this information and prompted the Perplexity API to verify the following information:

  • Is the researcher primary doing research, or are they moving on to other ventures
  • Is the researcher retired or emeritus (they most likely wont respond to emails)
  • Validate the researcher's affiliation (websites rarely update this)
  • Check of the researcher is a celebrity/famous researcher (Andrej Karpathy or Andrew Ng probably won't respond to a high schooler)

If the researcher is valid based on these perplexity queries, and if they have publications from the current year (so as of now 2025), then I will query perplexity again to get their email. I then add their email, name and google scholar profile to a supabase database.

Lastly, I have a script that scans the database for entries that haven't been added to the pinecone database. I then embed a combination of their publication information from google scholar and a description generated by the perplexity api, and add their uuid to the pinecone database. This was I can store metadata in supabase while matching with pinecone, avoiding pinecone's storage restrictions.

Challenges I ran into

The biggest challenge I ran into was speeding up the data collection process with web scraping. Because google has captcha's, it is impossible to run generic linear web scraping with no delays. I at first considered using something like Anthropic computer use, but the pricing for that is too expensive.

I then settled on this strategy:

  • For the time being, I had web scraping with no delays, but I did the processing and database insertions in parallel (using multiprocessing). This means that while I am getting new names I am adding existing ones. This roughly doubles that amount of researchers I could add from around 400 to 849 total
  • The way to scale this would be to add rotating proxies. My code adds simple integration for rotating proxies, however, the cost of implementing it would be too much for this hackathon, since I would have to set up a VPS through multiple cloud providers and use squid to set up proxies. However, with this implement, it could potentially scrape 3000-5000 researchers a day. Adding multiple instances of this scraper would also be a big speedup.

What I learned

  • How to containerize and deploy applications, while it wasn't using a robust cloud provider like AWS, I still had to Dockerize my application
  • how to web scrape using python and the speedups and tradeoffs behind it.
  • How to use database to preprocess and store information, improving response times for users.

What's next for ProfMatch

Technical improvements:

  • Add proxy rotation to the web scraper
  • deploy the web scraper to the cloud
  • also deploy the database listener (which adds everything to pinecone), and create a cron job that checks the database
  • potentially rewrite the web scraper using Golang for faster scraping

Apart from that, the main thing is reaching out to users and getting feedback. I will probably use something like LinkedIn to post about this, as well as reach out to the following demographics:

  • high school / undergrad students looking for research
  • grad students looking for advisors
  • researchers looking for collaborators
  • conferences / journals looking for people to peer review papers
  • RnD, biotech or deep tech companies who recruit mainly PhDs

I see a lot of potential applications for this project, but it all depends on user feedback and what the majority of people want from the app.

Built With

Share this project:

Updates