Vigilant AI - From Footage to Intelligence
Inspiration
Security systems generate thousands of hours of footage - but when incidents happen, someone still has to manually scrub through timelines to find what matters. That bottleneck inspired us.
We asked:
What if surveillance footage could be searched like Google and queried like ChatGPT?
If each camera records ( h ) hours per day and there are ( n ) cameras, total daily footage becomes:
$$ F = n \times h $$
As ( n ) grows, manual review becomes impossible.
With modern vision-language models and serverless GPUs becoming accessible, we saw an opportunity to transform raw video into structured, searchable intelligence — in minutes, not hours.
What it does
Vigilant AI turns raw CCTV footage into actionable intelligence.
Upload a video and our system:
- Detects people and activities frame-by-frame
- Scores threat severity (Stage A → C )
- Generates a structured incident report
- Embeds every clip as a semantic vector
- Makes the entire video searchable in plain English
Users can ask:
- “Show me fights.”
- “When was the entrance crowded?”
- “Have we seen this behavior before?”
And get instant answers — without watching a single frame.
How we built it
We built a full-stack AI pipeline in under 48 hours.
🔹 GPU Inference (Modal)
We run three neural networks simultaneously on a serverless A100 GPU:
- YOLO11-L → Object detection & person counting
- CLIP ViT-L/14 → 768-dimensional semantic embeddings
- Qwen2.5-VL-7B → On-GPU vision-language captioning
Total GPU footprint:
$$ \text{YOLO} + \text{CLIP} + \text{Qwen} \approx 19\text{GB of VRAM} $$
Videos are split into overlapping chunks and processed in parallel using Modal’s .map() API.
🔹 Hybrid Search (Actian VectorAI + SQLite)
Every clip embedding ( c \in \mathbb{R}^{768} ) is stored in Actian VectorAI.
For a query embedding ( q ), similarity is computed using cosine similarity:
$$ \text{sim}(q, c) = \frac{q \cdot c}{|q||c|} $$
When a user types a query:
- We encode it using CLIP’s text encoder
- Run cosine similarity search in VectorAI
- Run a parallel keyword search in SQLite
- Merge and rank results in under 100ms
Results stream instantly via Server-Sent Events.
🔹 AI Synthesis & Memory
We use Gemini 2.5 Flash to generate:
- Overall risk level
- Behavioral intent
- Named key moments
To prevent hallucinations, we enforced a strict authority hierarchy:
$$ \text{Computer Vision Labels} > \text{VLM Flags} > \text{Caption Text} $$
After processing, structured intelligence is stored in Supermemory, enabling cross-video natural language queries.
Challenges we ran into
🔸 False Positives
Generic motion words triggered theft alerts during fights. We redesigned detection rules to use contextual multi-word phrases and confidence thresholds.
🔸 Model Authority Conflicts
Sometimes language models under-called real CV detections. We implemented post-synthesis validation to ensure structured signals remain authoritative.
🔸 Performance Constraints
Running three large models on one GPU while streaming chat responses required careful VRAM budgeting and parallel container orchestration.
Accomplishments that we're proud of
- Running three heavy neural networks simultaneously on a single A100
- Sub-100ms hybrid vector + keyword search
- Two-phase streaming chat (clips first, narrative second)
- A full production-style pipeline built in a weekend
- Persistent cross-video behavioral memory
Most importantly — it feels like a real product, not just a demo.
What we learned
- Open-weight models are powerful but require structured safeguards
- Hybrid search dramatically improves reliability over pure vector search
- Streaming UX makes AI feel fast and responsive
- Serverless GPUs remove massive infrastructure barriers
We learned that building intelligence systems isn’t just about models — it’s about orchestration.
What's next for Vigilant AI
- Real-time camera stream ingestion (not just uploads)
- Facial re-identification with privacy-preserving embeddings
- Multi-camera correlation and anomaly detection
- Role-based dashboards for security teams
- Edge-device deployment for low-latency environments
Our long-term vision:
$$ \text{Surveillance} \rightarrow \text{Proactive Intelligence} $$
Cameras shouldn’t just record. They should understand.
Built With
- actian
- fastapi
- gemini
- modal
- python
- qwen
- react
- supermemory
- typescript
- yolo11-l
Log in or sign up for Devpost to join the conversation.