Project Story: AI Phone — Your Autonomous Communication Agent

The Inspiration

In a world of constant digital noise, our phones have become sources of disruption. Between relentless marketing calls, promotional spam, and the anxiety of "unknown numbers," we are losing our most valuable asset: focus.

As a developer and an introvert, I realized that many of us face major hurdles:

The Noise: Endless random calls that break our deep work.
The Social Drain: The mental energy required for "small-talk" errands when we just want to stay focused on what matters.
The Broadcast Burden: The difficulty of sharing information with many people simultaneously without losing hours to manual calling.
The Memory Gap: Trying to recall exactly what was discussed in a call weeks or months ago.

I was inspired to build AI Phone — a "Communication Shield" that doesn't just transcribe, but acts as your professional double. It makes and receives calls on your behalf, remembers every detail, and resolves errands while you live your life.

What It Does

AI Phone is a full-stack mobile application that delegates phone calls to an intelligent AI agent. At its core, it offers:

Real-Time AI Voice Calls

Initiate outbound calls through natural language missions ("Call my dentist and reschedule my appointment")
AI speaks naturally using Gemini's Multimodal Live API with voice synthesis
Bidirectional audio streaming with real-time transcription
Automatic call completion with AI-generated summaries

Customizable AI Agents

Create multiple AI personas with distinct personalities
Choose from 10 unique voice profiles: Aoede (casual), Charon (professional), Kore (calm), Fenrir (energetic), Leda (youthful), Orus (authoritative), Puck (playful), Zephyr (breeze), Vale (warm), Sage (British accent)
Configure tone, behavior guidelines, and caller information
Set language restrictions (single or multi-language support)
Designate a primary agent for quick calls

Context-Aware Memory

Every call is transcribed and stored with a structured summary
AI can access previous conversations with the same contact
Ask your AI: "What did we discuss last time?" — it knows

Knowledge Base Chat

Chat interface to query your entire call history
Gemini-powered RAG (Retrieval-Augmented Generation) for intelligent answers
Referenced calls displayed alongside responses

Unified Call Log

Seamlessly merges AI calls with your device's native call history
Contact integration with fast Trie-based search
Filter by call type (AI vs. Device)

How I Built It

The project is built on an industrial-grade, full-stack architecture designed for real-time interaction:

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Flutter Mobile App                       │
│   (Riverpod State Management • 12 Screens • Contact Integration)│
└─────────────────────────┬───────────────────────────────────────┘
                          │ REST + WebSocket
┌─────────────────────────▼───────────────────────────────────────┐
│                     Serverpod Backend                           │
│  (Dart-first ORM • Real-time Streams • FutureCalls Scheduling)  │
└───────────┬─────────────────────────────────────┬───────────────┘
            │                                     │
┌───────────▼───────────┐           ┌─────────────▼───────────────┐
│     Twilio Voice      │           │    Gemini Multimodal Live   │
│  (Telephony + Media   │◀─────────▶│    (Speech-to-Speech AI)    │
│   Stream WebSocket)   │  μ-law↔PCM │                             │
└───────────────────────┘  Transcoding└─────────────────────────────┘

Backend (The Engine)

I used Serverpod as the backbone. Its Dart-first ORM and high-performance capabilities allowed me to build a seamless bridge between the database and the telephony logic. Key services include:

MediaStreamHandler: Bidirectional WebSocket bridge between Twilio and Gemini
AudioTranscoder: Real-time μ-law ↔ PCM conversion with upsampling/downsampling
GeminiLiveService: Manages WebSocket connections to Gemini's real-time API
CallSchedulerService: Handles scheduled calls via Serverpod's FutureCalls
CallEventService: WebSocket broadcasting for live UI updates

The Intelligence

Gemini 2.5 Flash serves as the reasoning core via the Multimodal Live API. Using a RAG (Retrieval-Augmented Generation) system, I gave the AI a long-term memory by storing call histories in a PostgreSQL database. The AI has access to tool functions:

get_call_history() — Retrieve previous conversations with the same contact
end_call(reason, summary) — Autonomously terminate calls when mission complete

Telephony

Twilio Programmable Voice handles the global telephony infrastructure:

REST API for call initiation with TwiML webhooks
Media Streams for bidirectional audio via WebSocket
Automatic call recording with MP3 storage
Status callbacks for real-time call state tracking

Frontend

A clean Flutter interface with Riverpod state management across 12 screens:

DialerScreen: Contact integration with Trie-based search
CallAgentsScreen: Create and manage AI personas
ActiveCallMonitorScreen: Live transcript and status updates
CallHistoryScreen: Unified AI + device call log
ChatScreen: Knowledge base queries with referenced calls

Challenges I Faced: Mastering the Conversation

Challenge 1: The Audio Mismatch

The most significant hurdle was the Audio Format Gap. Telephony standards operate at 8,000 Hz μ-law (Narrowband), while Gemini requires high-fidelity PCM at 16,000 Hz for input and outputs at 24,000 Hz.

Solution — Custom Audio Transcoder:

Twilio → Server:  μ-law 8kHz → decode → upsample 2x → PCM 16kHz → Gemini
Gemini → Twilio:  PCM 24kHz → downsample 3:1 → encode → μ-law 8kHz → Twilio

I built a real-time audio transcoding pipeline with:

Pre-computed μ-law decode/encode tables (256 entries) for O(1) conversion
Linear interpolation for upsampling (8kHz → 16kHz)
3:1 averaging for downsampling (24kHz → 8kHz)
20ms audio chunks (~200 bytes) processed in real-time

Challenge 2: The Latency Paradox

For conversations to feel natural, the AI's first response must arrive in under 3 seconds. Initial tests showed 5+ second delays.

Solution — Gemini Pre-initialization: Instead of waiting for Twilio to connect before initializing Gemini, I start the Gemini WebSocket connection during call setup. By the time the recipient answers, the AI is ready to speak.

Metric	Before	After
First Response	~5s	~2s
Audio Roundtrip	~800ms	~300ms

Challenge 3: Transcript Persistence

Real-time transcripts were being lost when calls ended abruptly. The solution involved:

Debounced database updates (500ms) to avoid excessive writes
Saving transcript before broadcasting status changes
FutureCall-based post-call analysis with 5-second delay

Challenge 4: Memory Retrieval at Scale

The AI needs instant access to call history for context-aware conversations:

[ T(n) = O(\log n) ]

By optimizing PostgreSQL indexes on (userId, phoneNumber, completedAt), the AI retrieves relevant history from thousands of calls in milliseconds.

What I Learned

I learned that the future of AI isn't about replacing human connection; it's about filtering the noise. Building this taught me:

Technical Insights

Real-time systems are unforgiving: Every millisecond matters. A 100ms delay in audio processing compounds into awkward pauses.
State synchronization is hard: Coordinating state between Flutter, Serverpod, Twilio, and Gemini required careful event-driven architecture.
Audio engineering is its own discipline: Understanding μ-law encoding, sample rates, and interpolation algorithms opened a new world.

Product Insights

Context is everything: An assistant that remembers previous conversations is infinitely more valuable than one that just talks.
Voice UX differs from chat UX: Users expect immediate responses. Silence feels like failure.
Customization breeds adoption: Letting users create their own AI personas with unique voices and behaviors dramatically increases engagement.

Architecture Insights

Dart everywhere works: Having Flutter, Serverpod, and shared models all in Dart eliminated entire categories of bugs.
WebSocket > Polling: Real-time updates via WebSocket broadcasting transformed the user experience.
Pre-computation pays off: Lookup tables for audio conversion, Trie structures for contact search — these optimizations compound.

The Call Flow

Here's how a typical AI call works end-to-end:

┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: INITIATION                                                      │
├──────────────────────────────────────────────────────────────────────────┤
│ User enters: Phone Number + Mission + Agent Selection                    │
│ → Flutter calls initiateCall endpoint                                    │
│ → Backend creates CallSession (status: pending)                          │
│ → Gemini connection pre-initialized in background                        │
│ → Twilio REST API initiates call → returns Call SID                      │
└──────────────────────────────────────────────────────────────────────────┘
                                    ↓ (~2 seconds)
┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 2: MEDIA STREAM SETUP                                              │
├──────────────────────────────────────────────────────────────────────────┤
│ Twilio dials recipient → Recipient answers                               │
│ → Twilio requests TwiML → Returns WebSocket URL                          │
│ → WebSocket upgrades at /media-stream                                    │
│ → MediaStreamHandler bridges Twilio ↔ Gemini                             │
│ → Status: active                                                         │
└──────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 3: REAL-TIME CONVERSATION                                          │
├──────────────────────────────────────────────────────────────────────────┤
│ Loop until call ends:                                                    │
│   Recipient speaks → Twilio sends μ-law 8kHz                             │
│   → Transcode to PCM 16kHz → Send to Gemini                              │
│   → Gemini generates response + transcript                               │
│   → Receive PCM 24kHz → Transcode to μ-law 8kHz                          │
│   → Send back to Twilio → Recipient hears AI                             │
│   → Transcript saved (debounced 500ms)                                   │
│   → Flutter UI updates via WebSocket broadcast                           │
└──────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 4: POST-CALL ANALYSIS                                              │
├──────────────────────────────────────────────────────────────────────────┤
│ Call ends (AI calls end_call or recipient hangs up)                      │
│ → Status: completed                                                      │
│ → FutureCall scheduled (5-second delay)                                  │
│ → GeminiService generates structured summary                             │
│ → Summary saved: Outcome, Key Info, Action Items                         │
│ → Flutter displays final summary                                         │
└──────────────────────────────────────────────────────────────────────────┘

Tech Stack Summary

Layer	Technology	Purpose
Mobile App	Flutter 3.32	Cross-platform iOS/Android
State Management	Riverpod 2.5	Reactive state with providers
Backend	Serverpod 3.2	Dart-first backend with ORM
Database	PostgreSQL	Persistent storage with indexes
AI Engine	Gemini 2.5 Flash	Multimodal Live API for speech
Telephony	Twilio Voice	Programmable voice + media streams
Real-time	WebSocket	Bidirectional audio + UI updates
Auth	Serverpod Auth	JWT with email/Google OAuth

Future Roadmap

AI Phone is just the beginning. The next 12 months will focus on:

Near-Term (Q1-Q2)

Inbound Call Handling: Let AI answer calls on your behalf with caller ID screening
Call Transfer: Seamless handoff to human operator with full context
Multi-party Broadcasts: Call multiple recipients with the same message

Mid-Term (Q3-Q4)

Emotional Intelligence: Adapt AI's tone based on caller's urgency or mood
Autonomous Scheduling: Calendar integration to resolve booking conflicts automatically
Advanced Analytics: Success metrics, failure analysis, conversation insights

Long-Term (Year 2)

Visual Context: AI can "see" documents or images shared during calls
Language Expansion: Localized voice profiles for any language or dialect
Barge-in Detection: Allow user to interrupt and take over mid-call

Project Statistics

Metric	Value
Flutter Screens	12 major screens
Backend Services	38+ Dart files
Database Models	11+ protocol definitions
Voice Options	10 unique AI voices
Supported Languages	Multi-language with restrictions
Target Latency	<3s first response
Achieved Latency	~2s with pre-initialization