Inspiration
We were inspired by the inherent difficulty in fully understanding complex documents like corporate annual reports. These PDFs are rich with information, but insights are often fragmented between dense paragraphs of text and crucial visual elements like charts, diagrams, and tables. Standard search tools struggle with this mix. We wanted to create InsightLens – a tool that could "see" beyond just the text, allowing users to ask questions and receive answers synthesized from both the textual narrative and the visual data representations within a single document, leading to faster, deeper, and more holistic comprehension.
What it does
InsightLens is an AI-powered chatbot designed to analyze PDF documents using a multimodal Retrieval-Augmented Generation (RAG) approach. It allows users to ask complex natural language questions about a report (using the LTIMindtree Annual Report as a primary example) and receive synthesized answers. Key capabilities include:
Answering questions based on the textual content (financials, strategy, ESG initiatives, etc.). Answering questions related to the content of images and diagrams within the PDF by leveraging pre-generated descriptions obtained via the Gemini Vision API. Providing a conversational interface for exploring and understanding dense reports more efficiently.
How we built it
InsightLens leverages Python, LangChain, and Google Gemini models:
Multimodal Pre-processing: The core innovation involves a pre-processing step. We use PyMuPDF to extract images page-by-page from the input PDF. Each image is then passed to the Gemini Vision API (google.generativeai SDK) to generate a detailed text description. Text Extraction & Chunking: Simultaneously, PyPDFLoader extracts the document's text, which is then split into manageable chunks using RecursiveCharacterTextSplitter. Unified Data Representation: We create LangChain Document objects for both text chunks and the AI-generated image descriptions, adding metadata like page numbers and type ('text' or 'image_description'). Embedding & Indexing: All these Document objects are embedded using GoogleGenerativeAIEmbeddings (embedding-001) and indexed into a FAISS vector store. This unified index contains vectors for both text and image content descriptions. RAG Implementation: A LangChain RetrievalQA chain uses the FAISS index to retrieve the most relevant context (which could be text chunks, image descriptions, or a mix) based on the user's query.
Challenges we ran into
Initial Setup: Overcoming initial hurdles with Google Cloud authentication and identifying correct, stable Gemini model API identifiers. API Limits: Encountering and managing API rate limits, necessitating model switching during development. Retrieval Tuning: Debugging instances where the retriever failed to surface the correct text chunks or image descriptions, requiring experimentation with chunking strategies, retrieval parameters (k), and search types (similarity vs. mmr). Synthesis Accuracy: Ensuring the LLM accurately interpreted the retrieved context, especially distinguishing between direct text and image descriptions, and dealing with potential summarization/truncation issues (particularly with Flash models). Multimodal Workflow: Implementing the image extraction and description generation loop robustly, including handling potential errors during PDF parsing or Vision API calls. Visual Data Fidelity: Recognizing the limitations – even with descriptions, precisely querying complex table structures or nuanced diagram relationships purely through text remains challenging.
Accomplishments that we're proud of
Successfully building an end-to-end RAG pipeline using LangChain and Gemini. Integrating multimodal capabilities by generating textual descriptions of images using Gemini Vision and incorporating them effectively into the RAG index. Creating a chatbot that demonstrably answers questions requiring information from both text and image descriptions within the PDF. Iteratively improving the system's accuracy through debugging and tuning retrieval/synthesis components. Demonstrating a practical approach to making visual information within documents accessible to language models.
What we learned
The practical workflow and components of building RAG systems with LangChain. Effective use of different Gemini models (text, embeddings, vision) and their respective APIs within a larger application. The crucial role of data pre-processing and structuring (chunking, metadata) in retrieval performance. Techniques for debugging complex LLM-based pipelines, especially identifying bottlenecks in retrieval vs. synthesis. A viable method (image description generation) for integrating multimodal information into predominantly text-based RAG systems, while also understanding its limitations compared to true end-to-end multimodal models.
What's next for InsightLens
Enhanced Image Understanding: Explore generating more structured data (e.g., table extraction) from images instead of just free-form descriptions, or experiment with multimodal embeddings for direct image retrieval. Improved Context Linking: Develop methods to better link user queries or retrieved text to specific images on a page for more targeted visual analysis. Advanced RAG Techniques: Investigate more sophisticated retrieval strategies or re-ranking methods. Table Handling: Improve the extraction and querying capabilities specifically for tabular data within the PDF. User Interface: Develop a web-based graphical user interface (GUI) for easier interaction. Evaluation: Implement more rigorous evaluation metrics to quantitatively measure performance improvements.
Log in or sign up for Devpost to join the conversation.