Inspiration
In industries like construction, insurance, and manufacturing, there is a critical need to verify "ground truth" against technical requirements. We identified a gap between what is written in documentation (blueprints, specs, contracts) and what is captured in visual evidence (site videos).
Standard tools often fail to bridge these modalities effectively. We were inspired to build VeraGate to act as an objective, AI-powered forensic auditor that can "see" the site and "read" the specs simultaneously, ensuring that truth is verified without the "blindness" caused by disconnected data sources.
What it does
VeraGate is a multimodal forensic audit engine that detects contradictions between video evidence and technical documentation in real-time.
It operates through a streamlined workflow:
- Ingestion: Users upload a video file (evidence) and a PDF document (technical specs).
- Video Analysis: The "Watcher" agent performs OCR, transcription, and spatial analysis on the video footage.
- Forensic Audit: The "Auditor" agent cross-references the visual data against the full text of the PDF to detect discrepancies.
- Contradiction Alerts: It identifies specific issues such as Spatial (wrong position), Temporal (time mismatch), Factual (conflicting info), and Specification (technical violation) errors.
- Thinking Log: It displays the AI's real-time reasoning process, showing exactly how it reached its conclusions.
How we built it
We utilized a Two-Agent Architecture orchestrated within a Next.js 16 application.
- Agent 1: The Watcher (
gemini-2.0-flash): We selected the Flash model for its speed and multimodal capabilities. It handles the heavy lifting of video processing, extracting text and spatial data from frames efficiently. - Agent 2: The Auditor (
gemini-2.0-pro-exp-02-05): We used the Pro model with Thinking Mode (thinkingLevel: HIGH) for the analysis. This agent ingests the entire PDF (up to 1M tokens) and the video analysis to perform deep deductive reasoning. - No-RAG Approach: Instead of using Retrieval-Augmented Generation (RAG), which breaks documents into chunks, we fed the full document context to the model to preserve the global context required for forensic accuracy.
- Tech Stack: The frontend is built with React, Tailwind CSS, and Framer Motion, communicating with the backend via Server-Sent Events (SSE) to stream the AI's "thinking" tokens to the UI.
Challenges we ran into
- Context Fragmentation: We initially struggled with how to feed large technical documents to the AI without losing nuance. We solved this by making the key technical decision to abandon RAG and rely on Gemini 3's massive context window for full document ingestion.
- Transparency: In forensics, a simple "yes/no" isn't enough. We needed to show why a contradiction was flagged. We overcame this by implementing a Thinking Log that streams the model's internal decision process to the user in real-time.
- Browser Hydration: We encountered hydration errors caused by browser extensions modifying the DOM, which required careful debugging and troubleshooting in the Next.js environment.
Accomplishments that we're proud of
- Multimodal Reasoning: Successfully combining video OCR/transcription with deep textual analysis to find complex contradictions (e.g., shadows indicating the wrong time of day).
- The Thinking Log: Implementing a visible "brain" for the application where users can watch the
gemini-2.0-promodel reason through evidence step-by-step. - Seamless Large File Handling: Integrating the Google Files API to handle large video uploads (up to 2GB) and PDF ingestion (up to 1M tokens) smoothly within the web interface.
What we learned
- The Power of Context: We learned that for audit tasks, providing the full document context is far superior to RAG, as it allows the model to understand the document as a cohesive whole.
- Specialized Agents: We discovered that separating concerns—using a fast model ("Flash") for perception and a deep model ("Pro") for reasoning—resulted in a more efficient and accurate system.
- Structured Output: We learned the importance of enforcing
responseMimeType: "application/json"to ensure that the complex "thinking" process ultimately resolves into structured, actionable data for the UI.
What's next for Veragate
- Model Refinement: We aim to continue refining the
gemini-2.0-pro-expintegration as the model moves from experimental to stable, potentially increasing the complexity of audits it can handle. - Enhanced Forensic Types: We plan to expand the system to detect even more subtle contradiction types beyond the current Spatial, Temporal, Factual, and Specification categories.
- Real-World Deployment: Moving from the current prototype status to a production-ready tool that can accept live video feeds for on-site auditing.
Log in or sign up for Devpost to join the conversation.