Inspiration

ScholarGraph was inspired by the need to enhance research and analysis in academic fields. Traditional research methods involve manually searching through vast amounts of literature, identifying relevant papers, and analyzing their content. Understanding the collaboration networks among researchers and the impact of their work is challenging. The sheer volume of scientific literature makes it difficult for researchers to stay updated on the latest developments in their field. With the huge volume of data available, it is important to look at an interest area holistically and provide a comprehensive study and analysis.

Hence, this project leverages graph theory to model complex relationships between papers, authors, and topics, providing a comprehensive view of the research landscape. By utilizing the Arxiv HEP-TH citation graph, which includes 27,770 papers and 352,807 edges, ScholarGraph aims to facilitate deeper insights into scientific contributions and collaborations

What it does

ScholarGraph is a powerful tool designed to generate a deep research based report grounded in truth by GraphRAG. It analyzes the interconnectedness of academic papers, authors, and topics. It extracts relevant information from paper abstracts, models topics as nodes linked to papers, and constructs a citation network that reveals the impact and influence of each paper. This platform allows users to identify key authors, topics, and papers, making it easier to navigate the vast literature. The key features are:

1. Deep Research Agent for Research:

An Agent who can generate plan for Research Strategies and Execute the Steps using Semantic RAG and GraphRAG. Also, re-fine the plan based on answers and prepare comprehensive reports spanning multiple topics.

2. Research Papers Analysis:

An Agent who can:

  • answer Simple AQL Based Queries,
  • answer Complex Queries which need NetworkX Algorithms to be executed, and
  • answer Hybrid Queries which need both, AQL Queries and NetworkX Queries using cuGraph

3. Grounded Research:

Powered by GraphRAG and **NetworkX Algorithms ensures that the responses are grounded in Truth – No hallucinations

4. HybridRAG:

Semantic RAG used to provide context for AQL Query Generation and NetworkX Code Generation, GraphRAG is used to provide accurate answers

5. GraphRAG based Data Enrichment:

Dataset is enriched using GraphRAG to extract Topics for better analysis

How we built it

To build ScholarGraph, we employed a multi-step approach:

Step 0: Select Your Data We use the Arxiv HEP-TH (high energy physics theory) citation graph which covers all the citations within a dataset of 27,770 papers with 352,807 edges. Source - https://snap.stanford.edu/data/cit-HepTh.html

Step 1: Data Preparation & Data Augmentation We read the data and create a list of dictionary elements of all Nodes and Edges, extracting key information We then use structured LLMs to extract Topics that a Paper is relevant to and create a new collection of Topics, and Edges which link them to Papers

Step 2 & 3: Persist Data to ArangoDB and Load Graph to NetworkX We directly insert data to ArangoDB using collections, Create a Graph and then retrieve it to Networkx object

Step 4: Vector Indexes Build the Semantic Vector Index for Paper Abstract and All the Nodes

Step 5: Build the Agentic App Using LangGraph, Agentic App is created which has multiple tools to query Graph Data. Contextual Data is given in prompt using Vector Database, to improve the AQL & NetworkX Code Generation

The tools used for the same are as follows:

1. Semantic Search Uses Vector Index to search relevant Nodes and Abstracts

2. Text to AQL Convert Text to AQL for direct use In code later

3. AQL to NetworkX Algorithm Uses AQL Output in Python Code for Networkx Algorithm using cuGraph

4. Text to AQL to Text Returns direct Text answer using AQL Generation

5. Text to NetworkX Algorithm to Text Returns direct Text answer after running NetworkX Algorithm using cuGraph

Challenges we ran into

First one, was to make cuGraph run on a Windows machine with RTX 4050. We then went ahead with running it under WSL Ubuntu on the Windows machine and then the cuGraph and Network Algorithms ran like a breeze.

Accomplishments that we're proud of

ScholarGraph's sophisticated graph model, which integrates papers, authors, and topics. Our system efficiently extracts topics from paper abstracts and models them as nodes linked to papers. We also implemented a robust query system that allows users to explore the citation network, identify influential papers and authors, and perform complex queries such as ranking papers by citation impact using PageRank. Thereby, we create a Deep Research Agent capable of using these tools and agent to research on any topic the user will be interested in.

What we learned

Leart how GraphRAG powers truth grounded Answers and which can be enhanced through Hybrid RAG using Vector Databases. We learnt how cuGraph and ArangoDB enables efficient storing of Graph, querying and performing NetworkX algorithms on the fly, enabling an Agent for real use cases.

Built With

Share this project:

Updates