Joy AI: Project Summary & Technical Description

1. Summary of Project's Features and Functionality

Joy is a multimodal "Happiness Assistant" designed to act as a real-time, low-latency co-pilot for users during their commutes with handsfree Bluetooth voice calls. Delivered with a warm female Australian accent and relatable Aussie slang, Joy operates as a fully grounded positive psychology coach equipped with deep memory, tool execution, and cellular-resilient audio streaming. You can call and SMS with Joy at +1-903-CALLJOY (1-903-225-5569). Landing page at https://www.besomeone.vip/joy

Core Features:

Ultra-Low Latency Voice (Gemini 2.5 Flash Live Native Audio): Full-duplex 16kHz PCM audio streaming with native interruption handling and autonomous self-recovery (_reconnect()). Joy restored states silently via Session Resumption handles, ensuring a "Zero-Failure UX" during cellular handoffs.
Natural Interruption Handling (Barge-in): Joy supports full-duplex conversational flow. Using Gemini Live's native Voice Activity Detection (VAD), Joy instantly detects when a user speaks over her, triggers an asynchronous "clear" event to the Telnyx telephony buffer to stop technical audio spillover, and pivots her response immediately to the user's new input.
Mobile-First Connectivity Resilience: Commuters experience spotty cell service. A 90-second "Reaper" holds the Vertex session state open, allowing cryptographic resumption handles to invisibly reconnect the AI session where it left off when the socket reconnects or when the user calls back.
Multimodal Intelligence & Ingestion: Joy "sees" the world via SMS. Photos sent via MMS trigger a two-step vision pipeline: identification (e.g., article, book cover, or bookshelf scanning) followed by automated research. Beyond images, Joy ingests news articles and YouTube URLs, utilizing Flash-Lite to scrape content or sample video frames+transcripts to generate high-density summaries. These results are tagged as (SUMMARIZED) and instantly injected into the active voice briefing. To save context tokens, the detailed content summary is saved in SMS history where it can be retrieved as needed via tool calling.
The "Bounded Map" Memory Loop: To prevent context drift and latency spikes, a standalone LTM Map Worker uses Gemini Flash Lite to compress transcripts into a high-density, 500-word "Current State Map." This map is injected into System Instructions, achieving 100% recall of emotional states and goals without bloating the context window (which is compressed every 1500 tokens/1 minute of audio context).
Tool Calling & Autonomy: Joy utilizes tool caps and strict "Verbatim Title" anti-hallucination rules to manage Google Search and SMS history retrieval.

2. Technologies Used

AI Intelligence Stack (The "Brain Hierarchy"):
- Level 1 (Gemini 2.5 Flash Native Audio Live API): High-speed audio core for the primary user interaction loop.
- Level 2 (Gemini 2.5 Flash): High-fidelity multimodal reasoning for vision extraction (MMS) and complex context analysis.
- Level 3 (Gemini 2.5 Flash-Lite): The cost-effective background "Ops Brain." Handles SMS thread summarization, keyword extraction, native YouTube video sampling, and the Bounded Map compression.
- Level 4 (Google Search): Real-time web access for grounding Joy's advice in live data.
Application & Transport Layer: Python and FastAPI handling asynchronous HTTP Webhooks (SMS) and high-concurrency WebSockets (Voice).
Telephony Provider: Telnyx (Translates SIP/Cellular voice into 16kHz WebRTC/WebSockets and bridges SMS/MMS to REST APIs).
Persistence: Google Firestore (NoSQL database managing User Profiles, long-term memory briefings, SMS threads, and call telemetry).
Infrastructure: Google Cloud Run (Containerized, distroless, scale-to-zero compute) orchestrated by automated CLI deployment scripts.
Development & Orchestration: Built using Antigravity, a powerful agentic AI coding assistant. Antigravity managed the end-to-end development lifecycle: from architecting the multi-model "Brain Hierarchy" and implementing the Bounded Map memory loop to performing complex real-time debugging, refactors, GCS cloud deployments, and ensuring cellular-resilient transport protocols.

3. Information About Any Other Data Sources Used

Internal Happiness Knowledge Base: Joy is grounded using custom, static Markdown documents outlining established positive psychology frameworks (PERMA, Atomic Habits, etc).
Google Search Grounding: Joy dynamically queries the live internet to resolve URLs deeply, answer factual queries (like current weather or event details), and pull summaries of specific psychological books or authors mentioned by the user.
Dynamic User Briefings: The Firestore database maintains a living "Context Cache" for every user. This cache is a data source that is re-baked after every call, feeding Joy the user's name, past struggles, recent text messages, and follow-ups.

4. Findings and Learnings

Building a scalable, production-grade voice AI uncovered significant engineering lessons:

Operational Governance (The <$0.04/min Engine): We engineered a "Sliding Window" geometry for context compression pruning (Trigger: Baseline + 3,000 tokens | Target: Baseline + 1,500 tokens). Combined with model tiering, this keeps total operational costs (Telnyx + Google Cloud + Gemini) consistently below $0.04 per minute—roughly 1/3rd the cost of unoptimized agents.
Latency vs. Context (The "Bounded Map" Solution): To achieve cellular-speed response times, we had to architect a pattern where Flash-Lite continuously compresses the past into a 500-word high-density state update, allowing the primary model to focus on fresh interaction without context bloat.
Admin Observability: We built custom utilities like call_stats.py to provide per-minute cost transparency, auditing every model turn and token type to manage the "Cost of Intelligence."
Taming the "Eager" Tool Caller: We learned to implement strict programmatic constraints—capping tool retries to 2 per turn and utilizing system prompt directives to prioritize conversational flow over exhaustive web searches or SMS history retrieval.
Cellular Reality is Messy: Testing in the real world revealed that 4G/5G connections drop packets constantly. Building the active session registry and utilizing transparent socket resumption was a massive breakthrough; it proved that the complexity of production AI lies just as much in the transport handling layer as it does in the prompt engineering. At first we thought we needed to tweek the Gemini Live VAD (Voice Activity Detection) settings to account for car background noise, but it turned out the VAD was working perfectly, it was the network connectivity that was the problem. Telnyx also helps by providing noise suppression on the incoming audio signal.
The 10DLC Compliance Hurdle: Getting an AI agent approved for carrier messaging (10DLC) is a non-trivial compliance task. We learned to implement strict programmatic guardrails—such as carrier-mandated keyword suppression (STOP, HELP, UNSUBSCRIBE) and proactive opt-out management, including a required landing page with full legal terms such as a Privacy Policy and Terms of Service at https://www.besomeone.vip/joy#legal.