Inspiration

We live in a world full of a kind of dumb cameras. From the CCTV collecting dust in a warehouse to the webcam on an old laptop, billions of sensors are recording pixels but understanding nothing. Security guards can't watch 100 screens at once, and small businesses can't afford expensive, proprietary AI hardware.

Importantly, we thought some cameras have enough data to automate the processes like inventory management, project management etc. Gemini's accurate image processing qualities offered advantages on another side.

We asked ourselves: What if we could turn any camera into an intelligent agent? What if a simple webcam could audit inventory, ensure safety compliance, or track construction progress, and do reporting in the institutions resources (Chat rooms, ERP, documentation etc.), all by just looking and thinking like a human?

With the release of Gemini 3, we saw an opportunity to build a "Visual Operating System": a platform that decouples the "Eyes" (cheap cameras) from the "Brain" (multimodal AI), allowing anyone to spin up a custom visual agent in seconds.

What it does

Camai is a hardware-agnostic Visual Intelligence Platform. It connects to any image source- RTSP streams, USB webcams, or API uploads, and uses Gemini 3 to analyze the visual data against user-defined goals. It offers 3 classes or modes to align or automate the user goals.

It operates in three distinct Intelligence modes as below:

  1. The Quantifier (ideal for Business): Audits physical assets. It can count inventory on a shelf, estimate the fullness of a buffet tray, or track parking lot occupancy. It compares the current state against an ideal reference image to detect discrepancies (e.g., "Low Stock", "Misplaced Item"). User can define the threshold limit for the alert, interval between the updates for each camera.

  2. The Detector (ideal for Safety & Security): Enforces compliance. Users define natural language rules (e.g., "Warn me if a worker is missing a helmet" or "Check for potholes on road"). The system flags violations instantly.

  3. The Process Monitor (ideal for Industrial): Tracks change over time. It monitors slow processes like biomass drying, fermentation, or construction, estimating percentage completion and detecting anomalies (e.g., "Mold detected").

Key Features we implemented are:

Universal "Bridge" Architecture: A downloadable Python script allows users to connect local, behind-firewall cameras to our cloud securely without complex networking.

Natural Language Logic: Users don't write code; they type rules like "WhatsApp me if Red Bull stock is < 8 cans" and Gemini understands.

Smart Scheduler: Polling intervals can be set from 30 seconds (Safety) to 24 hours (Inventory), optimizing costs.

Event-Driven Integration: The system isn't just passive. It exposes Webhooks that allow external systems (like a Billing POS) to trigger a visual scan, closing the loop between the digital and physical worlds.

How we built it

We adopted a Hybrid Cloud-Edge Architecture to balance performance and feasibility:

Cloud Backend: Built with Python (Flask) and hosted on Render. It orchestrates the logic, managing the Gemini 3 API calls for visual reasoning. We implemented a "Smart Scheduler" that handles thousands of monitors efficiently by checking intervals locally before making API calls.

Frontend: A modern, responsive dashboard built with React (Vite) and Tailwind CSS, deployed on Vercel. It provides real-time logs, analytics charts, and a configuration wizard.

Edge Connectivity: We solved the "CORS/Firewall" issue by generating a custom "Bridge Script" for each monitor. This script uses OpenCV to capture frames locally on the user's device and pushes them securely to our cloud via a REST API. User can download the script for each camera and run it on the camera hardware.

Optimization: To prevent "API burn," we implemented a Motion Gate algorithm (using NumPy) that calculates pixel difference. If a scene hasn't changed, the system skips the expensive AI call, saving costs.

Challenges we ran into

Early on, our local scripts would crash but leave the camera resource "locked" (light staying on). We had to implement strict context managers and error handling to ensure resources were released even after failures.

Deploying to Render broke our capture logic because cloud servers don't have webcams! We had to refactor the entire scheduler to become "Headless," relying on the Bridge Script for capture while the cloud handled pure logic.

Initially, the model would hallucinate counts in complex scenes. We improved accuracy by implementing a "Sectioning" strategy—prompting Gemini to break the image into grid zones (e.g., "Top Shelf", "Bottom Bin") before counting.

Accomplishments that we're proud of

The "One-Click" bridge was a big step. We are incredibly proud of the dynamic script generator. Seeing a user click "Download," run a script, and watch their local webcam feed appear on a cloud dashboard 500 miles away was a "magic moment."

By combining the Motion Gate with variable polling intervals, we reduced the theoretical operating cost by over 90% compared to traditional "always-on" video analytics.

True multi-modality and the mode classification is adopted for real use cases. We aren't just doing object detection. We are doing visual reasoning like comparing an "Ideal State" image to a "Current State" image to derive context, which is only possible with models like Gemini 3.

What we learned

AI needs a model state for better results. Visual analysis is useless in a random. The model became exponentially more useful when we fed it the "Ideal State" reference image alongside the current frame, allowing it to perform differential analysis.

Edge layer is essential for scalable solution and offers business model sustainability. Pure cloud computer vision is expensive and latency-heavy. A thin edge layer (our Bridge Script) is critical for filtering data before it hits the LLM.

We used Gemini 3 to help generate the React UI components, allowing us to focus on the complex backend logic and architecture.

What's next for Camai

Immediate step is the SaaS tiering through firebase Auth, billing & business logic. Implementing multi-tenancy and Stripe integration to charge businesses based on "Visual Checks" per month.

Also, actionable alerts integration is priority. Integrating directly with Twilio for SMS and Slack for enterprise alerts, moving beyond simple logging.

Edge AI Integration: Moving the "Motion Gate" logic to a small local model (like TensorFlow Lite) inside the Bridge Script to further reduce bandwidth.

Built With

Share this project:

Updates