Jemmie

Jemmie, Talk to the future. Not a text box.
Backend Architecture

AI shouldn't feel like a chatbot, it should feel like a friend. Jemmie is a warm, brilliant AI companion just a phone call away. Speak naturally, share your day, and never type prompts again.

About Jemmie

💡 What Inspired Us

Current AI companions feel like exactly that—machines. They live inside text boxes, require clunky “push-to-talk” buttons, and enforce rigid, robotic turn-taking. We realized that typing limits human expression and emotion. We wanted to build something for the post-text era of AI: an always-on, deeply empathetic companion that feels exactly like taking a phone call from a close friend. The goal with Jemmie was to break the fourth wall and make interacting with AI natively fluid, allowing you to speak naturally, interrupt, and share your environment effortlessly.

🛠️ How We Built It

We built Jemmie using a true voice-first, native architecture to minimize latency and maximize immersion.

Frontend (iOS): We built a native Swift application leveraging SwiftUI for fluid animations. We deeply integrated Apple's CallKit so that talking to Jemmie looks and acts exactly like receiving a standard VoIP call from the lock screen.
Backend: We created a Python-based asynchronous gateway using FastAPI, bridged by bi-directional WebSockets.
The Brains: At the core of Jemmie is the Gemini Live API (StreamingMode.BIDI). We stream raw PCM audio buffers directly from the iOS microphone to Gemini in real-time, receiving voice and transcript streams back. We also integrated Multimodal Vision, allowing users to share their camera feed so Jemmie can see and intelligently react to their physical environment.

⚠️ Challenges We Faced

Building real-time bi-directional voice applications is incredibly difficult, and we ran into two major engineering hurdles:

The Acoustic Echo Loop: Initially, when Jemmie spoke out of the iPhone’s speaker, the microphone would pick up her own voice, causing the AI to accidentally interrupt itself. We solved this by diving deep into AVFoundation, enforcing AVAudioSession.Mode.voiceChat, and unlocking Apple's strict hardware-level Acoustic Echo Cancellation (AEC).
Trigger-Happy Interruptions: Humans pause to take breaths when they speak. Early on, Gemini’s Voice Activity Detection (VAD) was too eager and would cut the user off mid-sentence. We overcame this by modifying the RealtimeInputConfig on the backend, increasing the silence_duration_ms threshold to 1200ms and lowering sensitivity, allowing for completely natural, relaxed conversation flows.

📚 What We Learned

This project was a masterclass in real-time audio engineering and human-computer interaction. We learned how to manipulate raw audio buffers in Swift, how to properly interface with the iOS audio hardware layer, and how to maintain persistent async WebSocket bridges. More significantly, we learned how radically prompt engineering changes when you are designing for voice instead of text. By curating a warm, exceptionally empathetic persona with polyglot capabilities, we discovered how subtle changes in an LLM's system prompt drastically shape the tone, cadence, and comfort of a spoken conversation.

Built With

avfoundation
callkit
cloud-firestore
docker
fastapi
github
google-cloud-run
google-gemini-multimodal-live-api
google-genai-sdk
python
swift
swiftui
websockets

Updates

Supachod Trakansirorut started this project — Mar 16, 2026 05:33 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.