EyeSpy: AI-Powered Object Detection & Audio Transcription System

Inspiration

The idea for EyeSpy originated from the growing need for automated surveillance and real-time monitoring in environments where unauthorized mobile phone usage can lead to security risks, productivity issues, and policy violations.

With the increasing dependence on AI-driven automation, we wanted to create a smart monitoring tool that could:

Detect mobile phone usage in real-time in restricted environments (e.g., offices, classrooms, secure facilities).
Record video clips of detected activity for review and documentation.
Transcribe audio conversations for compliance, security, and analysis.
Enable remote playback and review through a user-friendly GUI interface.

By integrating computer vision, machine learning, and speech recognition, we envisioned EyeSpy as a cutting-edge AI-powered security tool capable of enhancing privacy enforcement, workplace efficiency, and compliance monitoring.

What It Does

EyeSpy is an AI-powered surveillance system that provides real-time object detection and audio transcription. The system consists of two primary functionalities:

1. Object Detection & Video Recording

Detects mobile phones in a live video stream using an advanced deep learning model (SSD MobileNet v3).
Draws bounding boxes around detected phones with confidence scores.
Automatically records a 15-second video clip upon detection.
Timestamps recorded events and stores them for later review.
Provides a GUI interface for easy video playback.

2. Audio Recording & Transcription

Continuously records audio in fixed intervals (default: 15 minutes per file).
Uses OpenAI Whisper for highly accurate speech-to-text conversion.
Generates time-stamped transcriptions for easy reference.
Stores transcribed text along with the original audio for further analysis.

How We Built It

EyeSpy was developed using a combination of AI-powered tools, computer vision, and speech recognition. We structured the system into separate modules, ensuring efficiency, scalability, and real-time performance.

1. Object Detection Module

Implemented SSD MobileNet v3 with OpenCV for real-time object detection.
Trained the model to recognize mobile phones using the COCO dataset.
Configured bounding boxes and confidence thresholds for optimal accuracy.
Used multithreading to ensure real-time detection and response.

2. Video Recording Module

Integrated OpenCV’s VideoWriter to record video only when a phone is detected.
Optimized storage management by limiting video duration to 15 seconds per event.
Implemented timestamp overlays for each recorded clip.

3. Audio Transcription Module

Captured high-quality audio using Sounddevice.
Processed audio through OpenAI Whisper, ensuring high transcription accuracy.
Implemented automatic file naming and timestamping for organized data storage.

4. Graphical User Interface (GUI)

Built a Tkinter-based interface for playing recorded videos.
Integrated status alerts and playback controls to improve usability.
Ensured a lightweight and intuitive user experience.

5. System Optimization

Implemented multithreading to handle video detection, audio recording, and GUI interactions without lag.
Designed error-handling mechanisms for hardware-related failures (e.g., missing microphone, disconnected camera).
Minimized resource consumption to allow the system to run efficiently on low-end hardware.

Challenges We Ran Into

Developing EyeSpy came with multiple challenges that required creative problem-solving and optimization techniques.

1. Real-Time Processing Bottlenecks

Problem: Running object detection, video recording, and transcription simultaneously led to high CPU/GPU usage.
Solution: Implemented threading and asynchronous execution to distribute tasks across different CPU cores.

2. Object Detection Accuracy

Problem: The COCO model sometimes failed to detect phones in certain lighting conditions.
Solution: Tuned confidence thresholds, improved image preprocessing, and tested alternative detection models.

3. Handling Noisy Environments for Transcription

Problem: The speech recognition model struggled with background noise and multiple overlapping speakers.
Solution: Applied noise reduction techniques and optimized microphone input settings.

4. Storage Management for Continuous Recording

Problem: Continuous video and audio recording resulted in large file sizes.
Solution: Implemented automated file management, including compression techniques and file deletion policies.

5. GUI Freezing During Processing

Problem: The Tkinter GUI became unresponsive when handling large video and audio files.
Solution: Optimized event-driven UI interactions to improve responsiveness.

Accomplishments That We’re Proud Of

Throughout the development of EyeSpy, we achieved several key milestones:

Successfully implemented real-time object detection with high accuracy.
Automated event-triggered video recording for surveillance and security.
Developed a reliable audio transcription system using state-of-the-art speech recognition.
Created a user-friendly GUI that enhances usability and interaction.
Optimized system performance to handle real-time detection, recording, and transcription efficiently.
Overcame hardware limitations by ensuring low memory and CPU consumption.

Designed a modular and scalable architecture, making it easy to add future enhancements.

What We Learned

Building EyeSpy was a valuable learning experience, allowing us to gain expertise in:

Deep Learning & Computer Vision: Implementing SSD MobileNet v3 for object detection.
Multithreading & Optimization: Balancing multiple real-time processes effectively.
Speech Recognition with Whisper AI: Understanding audio processing and text transcription.
Building Interactive UIs: Creating a responsive GUI using Tkinter.
Efficient Storage Management: Handling large video and audio files efficiently.
AI-Driven Security Applications: Learning about real-world use cases for automated surveillance.

These skills will be invaluable for future AI-driven security and automation projects.

What's Next for EyeSpy

While EyeSpy is already a powerful AI-driven surveillance tool, we envision several future enhancements to make it even more robust and versatile.

Planned Features & Improvements

Enhanced AI Models: Fine-tune detection accuracy with custom-trained deep learning models.
Cloud Storage & Remote Access: Enable secure video & audio storage on the cloud for remote monitoring.
Mobile App Integration: Develop a companion mobile app for real-time alerts & playback.
Customization Settings: Allow users to adjust detection thresholds, recording duration, and UI preferences.
Automated Alerts & Notifications: Send email & SMS alerts when unauthorized phone usage is detected.
Live Audio Translation: Convert spoken words into multiple languages for global applications.

With these enhancements, EyeSpy can become a comprehensive AI-powered surveillance solution for businesses, educational institutions, and security agencies.

Final Thoughts

Developing EyeSpy was both a challenging and rewarding experience. By combining real-time AI-powered object detection and speech recognition, we have created a versatile tool that enhances security, compliance, and automation.

We are excited to continue improving EyeSpy and exploring its potential applications in the fields of AI surveillance, workplace security, and digital forensics.