Inspiration

The challenge of finding the right document at the right time remains unresolved. In today’s digital-first world, individuals and organizations accumulate a vast number of digital documents—academic certificates, passports, immigration documents, office files, bills, medical records, contracts, and more. These documents are typically stored in cloud drives, email attachments, or local folders. While these storage solutions provide accessibility, the challenge of finding the right document at the right time remains unresolved.

What it does

The AI-Powered Personal Document Assistance System allows users to securely upload private documents (PDFs and images) and later retrieve relevant content using natural language queries.

How we built it

Built entirely on Google Cloud Platform (GCP), the system employs a combination of Document AI, Vertex AI, Cloud Run, and BigQuery to perform intelligent text extraction, embedding, and retrieval through vector similarity search.

This architecture ensures data privacy, scalability, and multilingual adaptability by leveraging Google’s advanced AI models and server-less infrastructure.

Challenges we ran into

  1. One of the key challenges was designing the event-driven architecture that connects the entire pipeline — from the document upload web application, to the Google Cloud Storage bucket, through EventArc triggers, and finally to Spring Boot microservices deployed on Cloud Run for document processing. The microservice extracts text using Document AI OCR, performs context-preserving chunking with overlaps, and passes each chunk to the pretrained text-multilingual-embedding model for vector generation.

EventArc internally uses Pub/Sub for event transportation. When an event is published, the subscriber (Cloud Run service) is expected to acknowledge it within roughly 10 seconds. If no acknowledgment is received, Pub/Sub retries the event, which initially caused duplicate processing for the same document.

To solve this, we implemented a deduplication mechanism based on the Cloud Event ID (ce-id) found in the event header. Since EventArc assigns a unique ce-id per event, even during retries, each incoming request’s ID was stored in Firestore. If a retry with the same ce-id appeared, the system detected it as a duplicate and returned a success response immediately, preventing redundant document processing.

  1. While using the Document AI (Document OCR Processor) Inference API to extract raw text from PDFs, I discovered that each API call can process only up to 15 pages at a time. For larger PDFs, this required splitting the document into batches of 15 pages per request and then merging the extracted text results. To automate this, I used the iTextPDF API to split PDFs efficiently into 15-page chunks before processing.

3.While working with the pre-published Vertex AI model text-multilingual-embedding, I encountered a request limit of 20,000 tokens per API call. To handle this efficiently, the raw text extracted from documents via Document OCR was divided into chunks of up to 1,024 characters, with a 200-character overlap to preserve context. Each API request processed 50 such chunks, ensuring we stayed within the token limit while maintaining contextual accuracy.

4.Storing the vector embeddings generated for each text chunk in BigQuery’s vector-supported tables and implementing cosine similarity queries to perform efficient semantic searches.

5.We implemented a resumable upload for large files to a GCS bucket. Initially attempting a direct client-side JavaScript implementation, we encountered Cross-Origin Resource Sharing (CORS) issues. The solution involved a hybrid approach: a Spring Boot backend service securely acquired the session URL (the upload URL), and the frontend then chunked the file data, sending 5 MB segments sequentially to this URL. We managed the upload flow by handling the 308 (Incomplete) and 200 (Success) response codes to ensure successful completion.

Accomplishments that we're proud of

Delivered a complete, end-to-end application built on an event-driven architecture, fulfilling the initial design vision. The system provides fast and highly accurate retrieval of relevant documents from natural language queries, resulting in significant time savings for both individuals and organizations.

What we learned

1)Designing and implementing event-driven architectures utilizing various Google Cloud services. 2)Developing and integrating a Retrieval-Augmented Generation (RAG) model pipeline. 3)Using Java client libraries to interface with Google Cloud AI/ML APIs. 4)Implementing Single Sign-On (SSO) for secure user authentication. 5)Mastering cosine distance for performing highly effective semantic/similarity search. 6)Deploying containerized applications efficiently on Google Cloud Run. 7)Managing and querying vector data and file metadata using Google BigQuery's vector support capabilities."

What's next for AI-Powered Personal Document Assistance System

Moving forward, we plan to significantly enhance the system's intelligence and collaborative capabilities: Enhancing User Accessibility & Efficiency: Generative AI Integration: Leverage the Gemini LLM to not only summarize retrieved documents but also offer these summaries in an audio format, catering to diverse user needs. Refining Search Accuracy: Introduce a final Gemini LLM processing step to deeply analyze the text of the top-ranked documents, ensuring the final answer is a highly contextual and refined response to the user's initial query. Expanding System Scope: Secure User Collaboration: Implement a document sharing mechanism to foster secure knowledge exchange between users. Broader File Support: Increase the utility of the system by rapidly adding support for additional document types beyond PDFs and images."

Built With

Share this project:

Updates