Vigilant AI - From Footage to Intelligence


Inspiration

Security systems generate thousands of hours of footage - but when incidents happen, someone still has to manually scrub through timelines to find what matters. That bottleneck inspired us.

We asked:

What if surveillance footage could be searched like Google and queried like ChatGPT?

If each camera records ( h ) hours per day and there are ( n ) cameras, total daily footage becomes:

$$ F = n \times h $$

As ( n ) grows, manual review becomes impossible.

With modern vision-language models and serverless GPUs becoming accessible, we saw an opportunity to transform raw video into structured, searchable intelligence — in minutes, not hours.


What it does

Vigilant AI turns raw CCTV footage into actionable intelligence.

Upload a video and our system:

  • Detects people and activities frame-by-frame
  • Scores threat severity (Stage A → C )
  • Generates a structured incident report
  • Embeds every clip as a semantic vector
  • Makes the entire video searchable in plain English

Users can ask:

  • “Show me fights.”
  • “When was the entrance crowded?”
  • “Have we seen this behavior before?”

And get instant answers — without watching a single frame.


How we built it

We built a full-stack AI pipeline in under 48 hours.

🔹 GPU Inference (Modal)

We run three neural networks simultaneously on a serverless A100 GPU:

  • YOLO11-L → Object detection & person counting
  • CLIP ViT-L/14 → 768-dimensional semantic embeddings
  • Qwen2.5-VL-7B → On-GPU vision-language captioning

Total GPU footprint:

$$ \text{YOLO} + \text{CLIP} + \text{Qwen} \approx 19\text{GB of VRAM} $$

Videos are split into overlapping chunks and processed in parallel using Modal’s .map() API.


🔹 Hybrid Search (Actian VectorAI + SQLite)

Every clip embedding ( c \in \mathbb{R}^{768} ) is stored in Actian VectorAI.

For a query embedding ( q ), similarity is computed using cosine similarity:

$$ \text{sim}(q, c) = \frac{q \cdot c}{|q||c|} $$

When a user types a query:

  1. We encode it using CLIP’s text encoder
  2. Run cosine similarity search in VectorAI
  3. Run a parallel keyword search in SQLite
  4. Merge and rank results in under 100ms

Results stream instantly via Server-Sent Events.


🔹 AI Synthesis & Memory

We use Gemini 2.5 Flash to generate:

  • Overall risk level
  • Behavioral intent
  • Named key moments

To prevent hallucinations, we enforced a strict authority hierarchy:

$$ \text{Computer Vision Labels} > \text{VLM Flags} > \text{Caption Text} $$

After processing, structured intelligence is stored in Supermemory, enabling cross-video natural language queries.


Challenges we ran into

🔸 False Positives

Generic motion words triggered theft alerts during fights. We redesigned detection rules to use contextual multi-word phrases and confidence thresholds.


🔸 Model Authority Conflicts

Sometimes language models under-called real CV detections. We implemented post-synthesis validation to ensure structured signals remain authoritative.


🔸 Performance Constraints

Running three large models on one GPU while streaming chat responses required careful VRAM budgeting and parallel container orchestration.


Accomplishments that we're proud of

  • Running three heavy neural networks simultaneously on a single A100
  • Sub-100ms hybrid vector + keyword search
  • Two-phase streaming chat (clips first, narrative second)
  • A full production-style pipeline built in a weekend
  • Persistent cross-video behavioral memory

Most importantly — it feels like a real product, not just a demo.


What we learned

  • Open-weight models are powerful but require structured safeguards
  • Hybrid search dramatically improves reliability over pure vector search
  • Streaming UX makes AI feel fast and responsive
  • Serverless GPUs remove massive infrastructure barriers

We learned that building intelligence systems isn’t just about models — it’s about orchestration.


What's next for Vigilant AI

  • Real-time camera stream ingestion (not just uploads)
  • Facial re-identification with privacy-preserving embeddings
  • Multi-camera correlation and anomaly detection
  • Role-based dashboards for security teams
  • Edge-device deployment for low-latency environments

Our long-term vision:

$$ \text{Surveillance} \rightarrow \text{Proactive Intelligence} $$

Cameras shouldn’t just record. They should understand.

Built With

Share this project:

Updates