About the Project

Inspiration

Car dealerships spend days manually photographing, writing copy, and building web pages for every vehicle on their lot — and the result is still a static gallery with a spec table. We asked: what if a dealer could select a car from inventory, press one button, and get a fully produced interactive showroom experience — cinematic imagery, synchronized voiceover, interactive hotspots, 3D WebGL tech views, and an AI chatbot that can answer questions from the car's manual and book test drives by voice — all generated in under two minutes?

That vision required not just one AI model, but an entire production crew of specialized agents working together. Amazon Nova's combination of fast reasoning (Nova 2 Lite), multimodal embeddings, video generation (Nova Reel), and real-time speech-to-speech (Nova Sonic) made it possible to build the full pipeline on a single cloud platform.

What It Does

AutoSight is a seven-agent AI pipeline that turns a car database entry into a polished, interactive digital showroom:

  1. Analyst Agent (Nova 2 Lite) — profiles the car and identifies a buyer persona and marketing strategy
  2. Director Agent (Nova 2 Lite) — plans a cinematic storyboard with layouts, camera angles, and 3D tech view placements, locking all scenes into a consistent visual world
  3. Scriptwriter Agent (Nova 2 Lite) — writes headlines, body copy, voiceover scripts, and suggests interactive hotspots for each scene, constrained by camera perspective (no interior parts on exterior shots)
  4. Visualizer Agent — generates photorealistic scene images, places interactive hotspots via OWLv2 zero-shot object detection with six accuracy-boosting techniques, and produces intro videos via Amazon Nova Reel
  5. Badge Collector — verifies real-world certifications (NHTSA, IIHS, EPA, awards) from three parallel sources: rule-based logic, government APIs, and AI-analyzed web search
  6. Audio Agent — produces neural text-to-speech voiceover (Deepgram Aura-2) with word-level subtitle alignment, using Nova Micro for intelligent subtitle grouping
  7. QA Agent — validates and enriches the final story JSON with real car specs for 3D rendering

Beyond generation, AutoSight provides:

  • Live Dealer Editor + Copilot — dealers edit any scene in real time; the Copilot (Nova 2 Lite with tool calling) can add scenes, retheme the entire experience, translate to another language, or regenerate images through natural language
  • Multimodal RAG Chatbot — a four-tier retrieval pipeline (Nova 2 Lite + Nova Multimodal Embeddings) that answers buyer questions from uploaded PDF brochures and car manuals, with inline test-drive booking via tool calling
  • Speech-to-Speech Voice Agent — Amazon Nova Sonic streams bidirectional audio over LiveKit WebRTC, enabling buyers to ask questions and book test drives entirely by voice
  • RAG Document Ingestion — a Python PDF shredder extracts text, tables, and images from dealer-uploaded documents; Amazon Nova Multimodal Embeddings generates 1024-dim vectors stored in Supabase pgvector

How We Built It

The backend is a Node.js/Express 5 server with LangChain orchestrating all Amazon Nova calls through ChatBedrockConverse. Each of the seven pipeline agents is a separate module with its own Zod schema for structured output validation. The frontend is React 19 with Zustand state management, Framer Motion animations, and React Three Fiber for the 3D WebGL tech views.

Amazon Nova models used:

Model Role
Nova 2 Lite All 7 pipeline agents, Copilot tool calling, RAG chatbot tiers 0-3
Nova Micro Subtitle grouping (fast, cheap, short-context task)
Nova Reel Image-conditioned intro video generation (6s clips)
Nova Sonic Real-time speech-to-speech voice agent with tool calling
Nova Multimodal Embeddings 1024-dim vectors for RAG document retrieval

The pipeline runs asynchronously — the API returns immediately with a story ID, and the frontend polls for progress while showing which agent is currently working. Each agent's output feeds into the next, creating a true multi-agent orchestration chain rather than independent model calls.

Key architectural decisions:

  • World-locking — the Director picks one setting and lighting family, and a deterministic post-processor enforces it across all scenes, preventing the "stock photo rotation" problem
  • Camera perspective enforcement — the Scriptwriter is explicitly told whether the camera is EXTERIOR or INTERIOR, and interior car parts are forbidden on exterior shots (framed as physical impossibility, which LLMs respect more reliably than soft preferences)
  • Vision-enriched prompting — before generating images, Nova Lite analyzes the dealer's actual inventory photo to extract physical details (rim style, grille shape, trim accents), which are injected into every image prompt
  • Six-technique hotspot detection — OWLv2 object detection with caption-wrapped CLIP prompts, contrastive distractor labels, per-part confidence thresholds, bounding box area constraints, aspect-ratio padding, and two-pass interior zoom
  • Tiered RAG — the chatbot checks if the question can be answered from structured car JSON before hitting the vector store, saving 60% of embedding calls

Challenges We Faced

  • Structured output consistency — Nova 2 Lite occasionally returns empty {} for unused scene-type fields in the Scriptwriter output. We solved this with Zod's .optional().catch(undefined) pattern, which gracefully absorbs empty objects without crashing the pipeline.
  • Nova Sonic tool calling — the inputSchema.json field must be JSON.stringify(schema), not a plain object. This undocumented requirement caused "Unable to parse input chunk" errors until we discovered the correct serialization.
  • LiveKit WebRTC + Nova Sonic pacing — Nova Sonic sends TTS audio chunks as fast as the network delivers them. Without a dedicated playback queue with await captureFrame() pacing, all frames dump into the AudioSource buffer instantly, causing audio speedup and glitching.
  • Image format normalization — Supabase Storage serves many dealer photos as AVIF, but Amazon Bedrock only accepts JPEG/PNG/GIF/WebP. We use sharp metadata inspection to detect the actual byte format and transparently convert unsupported formats before sending to Nova Lite vision.
  • Hotspot label normalization — LLM-generated labels like "Front Bumper" need underscore-delimited keys (front_bumper) for detection threshold lookups. The original regex didn't convert spaces, causing all lookups to silently fall through to defaults.

What We Learned

  • Amazon Nova's tool-calling works best with MANDATORY-style prompt phrasing (physical obligation vs. soft preference) — this dramatically increases tool-call reliability
  • Structured output with Zod schemas is the most reliable way to get consistent JSON from Nova 2 Lite across hundreds of pipeline runs
  • Nova Multimodal Embeddings produce high-quality 1024-dim vectors that work well for cross-modal retrieval (text queries matching table and image content from PDFs)
  • Nova Sonic's bidirectional streaming protocol requires careful lifecycle management — AbortController, double-stop guards, and async producer-consumer queues are essential for clean teardown

Built With

  • amazon-bedrock
  • amazon-nova-2-lite
  • amazon-nova-micro
  • amazon-nova-multimodal-embeddings
  • amazon-nova-reel
  • amazon-nova-sonic
  • deepgram
  • express.js
  • framer-motion
  • javascript
  • langchain
  • livekit
  • node.js
  • owlv2
  • pdfplumber
  • pgvector
  • pollinations-ai
  • postgresql
  • pymupdf
  • python
  • react
  • react-three-fiber
  • sharp
  • supabase
  • three.js
  • transformers.js
  • vite
  • zod
  • zustand
Share this project:

Updates