The corteXiv Journey

Inspiration

The inspiration for corteXiv came from a common challenge in academic research - the difficulty in efficiently exploring and understanding research papers. As researchers ourselves, we noticed three key pain points:

The arXiv site is pretty minimalistic when it comes to features.
Reading and understanding papers takes significant time.
There's no easy way to have an interactive dialogue with papers.

We saw an opportunity to leverage Snowflake's Cortex capabilities and LLMs to create a more intuitive and intelligent way to interact with academic papers on arXiv. This would have a very positive impact on the academic research progress.

What it does

corteXiv transforms how researchers interact with arXiv papers through three main features:

Personal Library with Smart Features: Users can build their library and use Cortex's semantic search to find papers based on concepts rather than just keywords. The search provides an AI-generated overview of how the top results relate to their query. The library also includes an intelligent "You May Like This Paper" feature that analyzes your existing papers to suggest new relevant research.
Interactive Paper Chat: Users can have natural conversations with papers, asking questions and getting contextual answers. The system also generates deep-dive questions and key insights to help users better understand the paper.

How we built it

We built corteXiv as a modern web application with several key components:

Frontend: Built with Streamlit for rapid development and a clean user interface
Backend Architecture:
- Snowflake for secure and scalable data storage of paper data and chunks of paper content.
- Snowflake Cortex Search for hybrid search and Mistral Large v2 for Cortex LLM capabilities
- Smart recommendation system using LLM-generated search phrases
RAG Implementation:
- Papers are automatically chunked upon addition to the library
- Hybrid search combines vector similarity with keyword matching
- Multi-step RAG for generating paper insights
- Context-aware chat system with conversation memory

Challenges we ran into

Document Processing: Finding the right chunking strategy for academic papers was challenging. Trulens allows for fast and intuitive experiments in this area.
RAG Quality: Initially, our RAG responses weren't consistently high quality. We had to experiment with different:
- Chunking strategies
- Prompt engineering approaches
- Context retrieval methods
User Experience: Balancing functionality with simplicity was tricky. We went through several iterations to find the right mix of features while keeping the interface intuitive.
Performance Optimization: We had to carefully manage:
- LLM context windows
- Search result pagination
- Processing state management
- Response generation time

Accomplishments that we're proud of

Intelligent Search Overview: Our semantic search not only finds relevant papers but also generates an overview explaining why they're relevant and how they connect.
Smart Paper Discovery: We developed an LLM-powered recommendation system that analyzes random samples from your library to generate targeted search phrases, helping discover new relevant papers you might have missed.
Multi-step RAG: We implemented an agentic RAG system where the LLM generates its own queries to create comprehensive paper insights.
Smooth User Experience: Despite the complex backend operations, we maintained a clean and responsive interface with clear feedback on all operations.
Robust Architecture: Our system efficiently handles paper processing, storage, and retrieval while maintaining context for chat interactions.

What we learned

RAG Development:
- The importance of chunking strategy in RAG quality
- How to balance context length with response quality
- Techniques for maintaining conversation coherence
LLM Integration:
- Effective prompt engineering strategies
- Managing LLM context efficiently
- Balancing response quality with generation speed
User Experience:
- The importance of clear feedback for AI operations
- How to present complex information simply
- Managing user expectations for AI interactions
RAG Experimentation:
- Trulens provides invaluable insights for RAG development
- Quick iteration on different chunking strategies
- Empirical feedback on retrieval quality
- Data-driven approach to improving response accuracy
- Importance of systematic testing in RAG development

What's next for corteXiv

Enhanced Discovery:
- Improved recommendation algorithms using citation networks
- Cross-paper relationship mapping
- More sophisticated content analysis
- Personalized suggestions based on reading patterns
Advanced RAG:
- Multi-paper synthesis
- Automated literature reviews
- Figure and equation understanding
Collaboration Features:
- Shared libraries
- Collaborative paper discussions
- Research team workspaces
Integration Possibilities:
- Reference manager integration
- Note-taking system connections
- Academic workflow tools

Our journey with corteXiv has shown us the potential of combining modern AI capabilities with academic research tools. We're excited to continue developing features that make research more efficient and insightful.

Built With

docling
langchain
mistral
python
snowflake
streamlit
trulens

Updates

Private user posted an update — Feb 01, 2025 05:12 PM EST

link to GitHub if it has not already been included in my submission https://github.com/nolanvo5894/corteXiv

Log in or sign up for Devpost to join the conversation.

Private user started this project — Jan 21, 2025 07:53 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.