Inspiration

By moving beyond traditional GPS, we aim to transform the vehicle from a passive transport tool into an intelligent partner that enhances situational awareness and reduces cognitive load. Synthesizing real-time computer vision with deep semantic understanding, CLEO allows users to experience their surroundings—not just drive through them. The objective is to provide a curated, interactive narrative of the physical world, making every commute a discovery.

What it does

CLEO is a sophisticated Edge-to-Cloud Perceptual Engine that synchronizes computer vision with geospatial intelligence to deliver three core value propositions:

  1. Dynamic Landmark & Infrastructure Recognition: Performs real-time semantic segmentation of the visual field to identify historical markers, architectural styles, and complex road signage that traditional OCR misinterprets.

  2. Live Web-Grounded Spatial Intelligence: Bridges the gap between the physical world and the digital "pulse." It notifies users of ephemeral events (pop-up markets, festivals) and verifies operational statuses (wait times, menu highlights) within a 1-mile radius using live web data.

  3. Predictive Environmental Modeling: Analyzes vehicle heading and velocity to pre-fetch data for landmarks appearing within a mile, ensuring zero-latency information delivery as objects enter the user's field of view.

  4. Contextual Dialogue & Narrative Synthesis: Enables a multi-turn, "hands-free" conversation. Users can ask anaphoric questions (e.g., "Tell me more about that building") thanks to a rolling interaction memory buffer.

How we built it

CLEO utilizes a decoupled, high-performance architecture designed for low-latency mobile environments.

Tier 1: The Edge Layer (YOLOv26 / Python) High-frequency visual telemetry is processed locally using a custom-trained YOLOv26 model (exported as best.pt via PyTorch). This layer handles deterministic object detection and Object ID Persistence, ensuring a landmark is tracked across multiple frames without redundant cloud triggers.

Tier 2: The Logic Layer (Amazon Nova Lite VLM) High-entropy frames (Entry Events) are dispatched via the Amazon Bedrock Converse API. The Nova Lite Vision-Language Model (VLM) interprets complex visual nuances and architectural styles, providing high-reasoning semantic metadata.

Tier 3: The Interaction Layer (FastAPI & React) Backend: A FastAPI kernel manages asynchronous WebSocket streams for video and synchronous REST endpoints for location research. Frontend: A sleek, terminal-inspired React (TSX) HUD built with Vite and TailwindCSS, utilizing the Web Speech API for a seamless voice-to-voice loop.

Challenges we ran into

  1. Stateful Deduplication (Track-ID Gating): To optimize API egress and compute costs, cloud inference is only triggered upon the unique "Entry Event" of a tracking ID, preventing the re-analysis of the same object in a 30fps stream.

  2. Precision Stochastic Control: To filter environmental noise, we utilized strict system-level constraints and a Temperature setting of 0.1, ensuring deterministic, safety-optimized outputs.

  3. Asynchronous I/O Management: Managed the "Inference Gap" between local CV and cloud reasoning using asyncio and WebSocket protocols to ensure audio insights synchronize with the vehicle’s physical progression.

Accomplishments that we're proud of

  1. Seamless Multimodal Pipeline: Bridged local high-speed CV with cloud-based VLM reasoning in a unified, sub-second latency pipeline.

  2. Autonomous Tour Guide Synthesis: Developed a "Luxury AI Guide" persona that translates raw coordinates and visual tokens into vivid, concise narratives.

  3. Environmental Robustness: Achieved high-accuracy interpretation across varying weather conditions and complex urban geometries.

What we learned

Developing CLEO highlighted that the true value of Automotive AI lies in Semantic Relevance. We discovered that by offloading the "Search and Identification" task to a background agent, the driver is free to enjoy a more enriched, attentive experience. The harmony between high-speed edge detection and high-reasoning cloud logic is the blueprint for the next generation of ambient vehicle technology.

What's next for C.L.E.O.

  1. Nova ACT (Agentic Framework): Implementing a proactive memory layer that remembers user preferences (e.g., "Book my parking space") to build a personalized world model.

  2. Enterprise Fleet Integration: Scaling the stateless backend to support logistics and fleet management, providing drivers with "Intelligent Spotters" for site-specific instructions.

  3. Vehicle-Specific Logic: Expanding the perceptual stack to include heavy-vehicle alerts, such as Wide-Turn Analytics for trucks and height-clearance verification for industrial transport.

Built With

Share this project:

Updates