Inspiration
The safety of our educational and corporate campuses is paramount, yet traditional security systems are often reactive, recording events rather than preventing them. We were inspired by the potential of modern AI to create a proactive security solution. We envisioned a system that doesn't just watch, but understands—a smart guardian that can distinguish between a student studying late in the library and a potential intruder in a restricted zone. The launch of powerful multi-modal models like Google's Gemini 2.5 Pro was the catalyst, giving us the tool to build a system that could analyze complex scenarios and provide genuine intelligence, creating a truly safer environment for everyone.
What it does
VisionGuard AI is a comprehensive, AI-powered security platform designed for smart campus management. It transforms standard video footage into actionable security intelligence.
- Enhanced Video Analysis: It ingests video files and uses YOLOv8 to detect objects like people, vehicles, and backpacks in real-time.
- Advanced Face Recognition: The system distinguishes between authorized personnel and unauthorized individuals, using a semi-supervised approach to automatically cluster and learn unknown faces over time.
- Zone-Based Contextual Alerting: We've pre-configured distinct campus zones (Main Gate, Library, Construction Site, etc.) with specific rules. The AI's response changes based on the context—loitering is acceptable in the library but a high-priority alert at the main gate after midnight.
- Multi-Modal Surveillance: It analyzes audio tracks for anomalies like glass breaking, shouting, or alarms, correlating them with video events for higher accuracy.
- Gemini-Powered Smart Alerts: At its core, VisionGuard uses Gemini 2.5 Pro to analyze the combined data streams. It assesses threat levels, provides a natural language summary of events, and generates prioritized alerts (from Low to Critical) to eliminate false positives and focus on real threats.
- Privacy-First Design: To comply with regulations like GDPR, the system automatically blurs the faces of all unauthorized or unidentified individuals.
- AI Assistant: Security personnel can interact with the system using natural language, asking questions like, "Summarize all high-priority incidents from this week" or "Generate an incident report for the gate climbing event."
How we built it
We built VisionGuard AI on a robust, modern Python stack, creating a modular pipeline to handle the complex workflow from video input to intelligent alert.
- Frontend: The user interface is a clean and interactive dashboard built with Streamlit, allowing for easy video uploads, configuration, and alert management.
- Video & Object Detection: We use OpenCV to process videos and extract frames. The powerful YOLOv8 model runs on these frames to perform initial object detection.
- Face Recognition & Clustering: The
face_recognitionlibrary (built on dlib) handles face detection and creates embeddings. For efficiency, we use FAISS (Facebook AI Similarity Search) to create a highly optimized index of authorized faces, enabling near-instant lookups. For the semi-supervised learning component, we use the DBSCAN algorithm to automatically cluster embeddings of unknown faces without needing manual labels. - AI Reasoning Engine: The cornerstone of our project is the Gemini 2.5 Pro API. All the processed metadata—object detections, face matches, zone information, and audio anomalies—is compiled into a detailed prompt. Gemini analyzes this context to determine the event's significance, generate a human-readable summary, and assign a precise threat level.
- Backend & Data: The application logic is written in Python. Alerts and logs are managed as structured data (like JSON or CSV), ensuring easy storage, retrieval, and analysis.
- Deployment: The entire application is containerized using Docker, ensuring that all dependencies are managed and it can be deployed consistently on any system.
Challenges we ran into
Building a multi-faceted AI system came with its share of challenges.
- Performance Optimization: Processing video is incredibly resource-intensive. A 1-minute video at 30fps has 1800 frames. Analyzing every single one was not feasible. We overcame this by implementing a frame sampling strategy (e.g., analyzing every 30th frame), which provided a great balance between performance and detection accuracy.
- Managing API Rate Limits: The free tier of the Gemini API has request limits. We engineered a graceful fallback and queueing system to manage our API calls, ensuring the system remains functional and responsive even during heavy processing loads.
- Reducing False Positives: Initially, our system would flag benign events, like a group of students gathering, as a "crowd formation" threat. This was a major challenge. We solved it by making Gemini the final arbiter. Instead of relying on raw detection counts, we feed the full context to Gemini, which can reason that a "crowd" in front of the library during the day is normal, but the same "crowd" near a dormitory at 3 AM is a potential security issue.
- Complex Dependency Management: The computer vision libraries, especially
dlibandface_recognition, are notoriously difficult to install correctly across different operating systems. We spent significant time creating a robustrequirements.txtand aDockerfile, along with detailed troubleshooting steps in our README, to make the setup process as smooth as possible.
Accomplishments that we're proud of
We are incredibly proud of creating a system that is more than just a collection of AI models; it's a cohesive intelligence platform.
- True Multi-Modal Integration: We successfully merged video, audio, and spatial (zone) data into a single analytical pipeline. This holistic view provides a level of insight that a video-only system could never achieve.
- Semi-Supervised Learning for Scalability: Our automatic clustering of unknown faces is a game-changer. It means the system adapts and learns over time with minimal human intervention, making it practical for a real-world campus where new faces appear daily.
- Contextual Intelligence with Gemini: Our proudest accomplishment is how we leveraged Gemini 2.5 Pro. It elevates the system from a simple "detector" to a "reasoner." It understands the nuance of security, which drastically reduces false alarms and allows security staff to focus on what truly matters.
- Building a Privacy-Conscious Tool: We integrated face blurring from day one, demonstrating that powerful security can coexist with individual privacy.
What we learned
This project was a phenomenal learning experience. We learned how to architect a complex, end-to-end AI application, from data ingestion to a user-friendly frontend. We gained deep insights into the trade-offs between model accuracy and processing speed in computer vision tasks. Most importantly, we learned how to effectively use Large Language Models like Gemini as a reasoning engine to add a layer of contextual understanding on top of raw sensor data—a skill we believe is the future of applied AI.
What's next for VisionGuard AI
We see a bright future for VisionGuard AI and have a clear roadmap for its evolution.
- Real-Time Stream Processing: Our next major goal is to move from file uploads to processing live RTSP camera streams, enabling true real-time monitoring.
- Mobile Integration: We plan to develop a companion mobile app that sends push notifications for high-priority alerts directly to security personnel on the ground.
- Advanced Behavior Recognition: We want to train more sophisticated models to detect complex behaviors like fighting, medical emergencies (falls), or vandalism.
- Hardware Integration: We aim to integrate with existing campus security systems, allowing VisionGuard to automatically lock doors, turn on lights, or trigger alarms in response to critical threats.
Built With
- dlib
- face-recognition
- faiss
- googlegemini2.5proapi
- opencv
- pandas
- python
- scikit-learn
- streamlit
- yolov8
Log in or sign up for Devpost to join the conversation.