Knowledge annotation

Visual Intelligence OS (VI-OS)

Inspiration

The inspiration for Visual Intelligence OS came from a simple question: What if you could step inside any video and interact with its world? We were fascinated by sci-fi stories like The Matrix, Sword Art Online, and Cyberpunk 2077—worlds where you don’t just watch, but participate. We wanted to bring a piece of that vision to real-world video, making it explorable, searchable, and interactive for everyone.

What it does

Visual Intelligence OS transforms any video into an interactive dataset. Users can search, analyze, and interact with video content—extracting code, entities, and insights in real time.

Upload videos directly or provide a YouTube URL—both are processed and indexed.
Features include knowledge graph generation, overlays, quizzes, and deep-dive analysis.
Users can search for real-world items (e.g., “What is the person wearing and where can I get it?”), identify people in the video, analyze speech dynamics, and build node-based workflows.
The platform supports both local and YouTube videos, providing a seamless, interactive experience.

How we built it

Frontend: Next.js, React, Tailwind CSS, React Flow, D3.js, React Leaflet, React Force Graph
Backend: Python 3.11+, FastAPI, Celery, Redis, Supabase (PostgreSQL + pgvector), OpenCV, Pillow
AI/ML: Gemini 3 (via google-genai SDK) for multimodal analysis, entity extraction, claim verification, and overlay generation. Also uses Google Grounding and Google Maps tools.
Architecture: Modular controllers and pipelines for people, products, places, overlays, quizzes, and remixing. Google Cloud Storage (bucket) is used for video file storage.
Data: JSON video graphs, entity profiles, timeline metrics, overlays, quiz data

Challenges we ran into

Processing Large-Scale Video Data: Handling long videos and extracting meaningful information in real time was a major technical challenge.
Integrating Gemini 3: Adapting to the large context window and multimodal capabilities of Gemini 3 required new data pipelines and careful prompt engineering.
Real-Time Streaming: Ensuring smooth, real-time updates between backend analysis and frontend visualization pushed us to optimize our WebSocket and REST communication.
User Experience: Balancing advanced features with a simple, intuitive UI was an ongoing design challenge.
YouTube Integration: Processing YouTube videos was possible, but overlays were restricted. We solved this by adding a custom video player layer.

Accomplishments that we're proud of

End-to-end multimodal analysis and semantic graph generation
Real-time overlays and interactive features
Node-based workbench for custom analysis pipelines
Scalable, Google Cloud-based architecture
Modular, extensible design ready for AR/VR integration

What we learned

Multimodal AI: Leveraging Gemini 3 for deep multimodal analysis—combining video, audio, and text to extract rich, structured data.
Knowledge Graphs: Building semantic knowledge graphs from unstructured video content for entity extraction and relationship mapping.
User Experience: Designing a node-based, drag-and-drop interface (React Flow) for intuitive, visual programming.
Real-Time Processing: Optimizing backend/frontend communication for overlays and analysis pipelines.
Cloud Storage: Using Google Cloud Bucket for efficient video storage and retrieval.
Multimodal Agents: Integrating multiple AI agents for best-in-class results.

What's next for Visual Intelligence OS

Bridge static video and interactive simulation: Use Genie 3 and similar models to transform any video into a living, explorable world.
Gamification and decision-making: Let users make choices, explore alternate realities, and experience “what if” scenarios inside any video.
AR/VR integration: Step into videos, interact with characters and objects, and experience immersive learning, storytelling, and creativity.
World simulation: Enable agents and users to reason, problem-solve, and act in photorealistic, consistent environments generated from video content.
Pushing new frontiers: From education to entertainment, the potential to convert passive media into interactive, gamified, and explorable worlds is within reach.

Visual Intelligence OS is more than a tool—it’s a new way to see the world through the lens of AI.