Discovery Agent

Inspiration

The project is inspired by a 2020's CoronaWhy.org initiative, aimed at revolutionizing how we uncover and interact with information (in 2020 it was specific to covid research). We wanted to do this back then for COVID researchers, but didn't have LLMs to do the magic, now we can! :)

What it does

It ingests 120,000 arxiv research papers, builds a knowledge graph, then matches your specific area of interest to the cluster of the graph that is most relevant to you. Then you expand your area of research through connectivity of the graph and agent performs ReAct task orchestration to research those topics for you. The end result is a creative synthesis of research topics relevant to your field but outside of your core focus.

alt text

How we built it

Ingested 120k papers into MongoDB, ran indexing using FireWorks.ai and ingested embeddings into Qdrant vector database. Ran entity extraction and triplet formulation using Mixtral hosted on Fireworks to build a knowledge graph. We are ingesting this super graph into Neo4j using Langchain adapter. We use LlamaIndex and LangChain to do vector search and use ReAct plugin to perform task orchestration that is generated by Mixtral. Then we visualize it using Mixtral triplet to graph ASCII pipeline. We host it using Streamlit app.

Challenges we ran into

MongoDB vector search usage is tough, meanwhile Qdrant doesn't have great UI.
NER is kinda easy but doing it for Triplet generation use case is really hard, disambiguation and consistency are key challenges
Neo4j is a whole different world of complexity when dealing with structured data and knowledge graphs
Deploying streamlit apps is always fun but challenging to figure out how to do more interactive UI, gave up on that due to time constraints
Wanted to use Arize eval but ran out of time, wasn't as fast to set up
Wanted to fine tune Mistral for triplet generation use case but didn't have enough time
Wanted to use OpenPipe for graph generation use case, but ran out of time