Inspiration
The project is inspired by a 2020's CoronaWhy.org initiative, aimed at revolutionizing how we uncover and interact with information (in 2020 it was specific to covid research). We wanted to do this back then for COVID researchers, but didn't have LLMs to do the magic, now we can! :)
What it does
It ingests 120,000 arxiv research papers, builds a knowledge graph, then matches your specific area of interest to the cluster of the graph that is most relevant to you. Then you expand your area of research through connectivity of the graph and agent performs ReAct task orchestration to research those topics for you. The end result is a creative synthesis of research topics relevant to your field but outside of your core focus.
.png)
How we built it
Ingested 120k papers into MongoDB, ran indexing using FireWorks.ai and ingested embeddings into Qdrant vector database. Ran entity extraction and triplet formulation using Mixtral hosted on Fireworks to build a knowledge graph. We are ingesting this super graph into Neo4j using Langchain adapter. We use LlamaIndex and LangChain to do vector search and use ReAct plugin to perform task orchestration that is generated by Mixtral. Then we visualize it using Mixtral triplet to graph ASCII pipeline. We host it using Streamlit app.
Challenges we ran into
- MongoDB vector search usage is tough, meanwhile Qdrant doesn't have great UI.
- NER is kinda easy but doing it for Triplet generation use case is really hard, disambiguation and consistency are key challenges
- Neo4j is a whole different world of complexity when dealing with structured data and knowledge graphs
- Deploying streamlit apps is always fun but challenging to figure out how to do more interactive UI, gave up on that due to time constraints
- Wanted to use Arize eval but ran out of time, wasn't as fast to set up
- Wanted to fine tune Mistral for triplet generation use case but didn't have enough time
- Wanted to use OpenPipe for graph generation use case, but ran out of time
Accomplishments that we're proud of
- We did end to end agentic workflow pipeline in less than 24hr, and it works! :D
What we learned
- We need to upskill in graph based DBs
- We need better eval tools
What's next for Discovery Agent
- Improving NER and Triplet pipeline
- Interactive mode of engagement on Streamlit
- Expanding MongoDB usage
Log in or sign up for Devpost to join the conversation.