Vigilante AI

Artificial Intelligence

Vigilant AI - From Footage to Intelligence

Inspiration

Security systems generate thousands of hours of footage - but when incidents happen, someone still has to manually scrub through timelines to find what matters. That bottleneck inspired us.

We asked:

What if surveillance footage could be searched like Google and queried like ChatGPT?

If each camera records ( h ) hours per day and there are ( n ) cameras, total daily footage becomes:

$$ F = n \times h $$

As ( n ) grows, manual review becomes impossible.

With modern vision-language models and serverless GPUs becoming accessible, we saw an opportunity to transform raw video into structured, searchable intelligence — in minutes, not hours.

What it does

Vigilant AI turns raw CCTV footage into actionable intelligence.

Upload a video and our system:

Detects people and activities frame-by-frame
Scores threat severity (Stage A → C )
Generates a structured incident report
Embeds every clip as a semantic vector
Makes the entire video searchable in plain English

Users can ask:

“Show me fights.”
“When was the entrance crowded?”
“Have we seen this behavior before?”

And get instant answers — without watching a single frame.

How we built it

We built a full-stack AI pipeline in under 48 hours.

🔹 GPU Inference (Modal)

We run three neural networks simultaneously on a serverless A100 GPU:

YOLO11-L → Object detection & person counting
CLIP ViT-L/14 → 768-dimensional semantic embeddings
Qwen2.5-VL-7B → On-GPU vision-language captioning

Total GPU footprint:

$$ \text{YOLO} + \text{CLIP} + \text{Qwen} \approx 19\text{GB of VRAM} $$

Videos are split into overlapping chunks and processed in parallel using Modal’s .map() API.

🔹 Hybrid Search (Actian VectorAI + SQLite)

Every clip embedding ( c \in \mathbb{R}^{768} ) is stored in Actian VectorAI.

For a query embedding ( q ), similarity is computed using cosine similarity:

$$ \text{sim}(q, c) = \frac{q \cdot c}{|q||c|} $$

When a user types a query:

We encode it using CLIP’s text encoder
Run cosine similarity search in VectorAI
Run a parallel keyword search in SQLite
Merge and rank results in under 100ms

Results stream instantly via Server-Sent Events.

🔹 AI Synthesis & Memory

We use Gemini 2.5 Flash to generate:

Overall risk level
Behavioral intent
Named key moments

To prevent hallucinations, we enforced a strict authority hierarchy:

$$ \text{Computer Vision Labels} > \text{VLM Flags} > \text{Caption Text} $$

After processing, structured intelligence is stored in Supermemory, enabling cross-video natural language queries.

Challenges we ran into

🔸 False Positives

Generic motion words triggered theft alerts during fights. We redesigned detection rules to use contextual multi-word phrases and confidence thresholds.

🔸 Model Authority Conflicts

Sometimes language models under-called real CV detections. We implemented post-synthesis validation to ensure structured signals remain authoritative.