Project Story: AI Phone — Your Autonomous Communication Agent

The Inspiration

In a world of constant digital noise, our phones have become sources of disruption. Between relentless marketing calls, promotional spam, and the anxiety of "unknown numbers," we are losing our most valuable asset: focus.

As a developer and an introvert, I realized that many of us face major hurdles:

  • The Noise: Endless random calls that break our deep work.
  • The Social Drain: The mental energy required for "small-talk" errands when we just want to stay focused on what matters.
  • The Broadcast Burden: The difficulty of sharing information with many people simultaneously without losing hours to manual calling.
  • The Memory Gap: Trying to recall exactly what was discussed in a call weeks or months ago.

I was inspired to build AI Phone — a "Communication Shield" that doesn't just transcribe, but acts as your professional double. It makes and receives calls on your behalf, remembers every detail, and resolves errands while you live your life.


What It Does

AI Phone is a full-stack mobile application that delegates phone calls to an intelligent AI agent. At its core, it offers:

Real-Time AI Voice Calls

  • Initiate outbound calls through natural language missions ("Call my dentist and reschedule my appointment")
  • AI speaks naturally using Gemini's Multimodal Live API with voice synthesis
  • Bidirectional audio streaming with real-time transcription
  • Automatic call completion with AI-generated summaries

Customizable AI Agents

  • Create multiple AI personas with distinct personalities
  • Choose from 10 unique voice profiles: Aoede (casual), Charon (professional), Kore (calm), Fenrir (energetic), Leda (youthful), Orus (authoritative), Puck (playful), Zephyr (breeze), Vale (warm), Sage (British accent)
  • Configure tone, behavior guidelines, and caller information
  • Set language restrictions (single or multi-language support)
  • Designate a primary agent for quick calls

Context-Aware Memory

  • Every call is transcribed and stored with a structured summary
  • AI can access previous conversations with the same contact
  • Ask your AI: "What did we discuss last time?" — it knows

Knowledge Base Chat

  • Chat interface to query your entire call history
  • Gemini-powered RAG (Retrieval-Augmented Generation) for intelligent answers
  • Referenced calls displayed alongside responses

Unified Call Log

  • Seamlessly merges AI calls with your device's native call history
  • Contact integration with fast Trie-based search
  • Filter by call type (AI vs. Device)

How I Built It

The project is built on an industrial-grade, full-stack architecture designed for real-time interaction:

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        Flutter Mobile App                       │
│   (Riverpod State Management • 12 Screens • Contact Integration)│
└─────────────────────────┬───────────────────────────────────────┘
                          │ REST + WebSocket
┌─────────────────────────▼───────────────────────────────────────┐
│                     Serverpod Backend                           │
│  (Dart-first ORM • Real-time Streams • FutureCalls Scheduling)  │
└───────────┬─────────────────────────────────────┬───────────────┘
            │                                     │
┌───────────▼───────────┐           ┌─────────────▼───────────────┐
│     Twilio Voice      │           │    Gemini Multimodal Live   │
│  (Telephony + Media   │◀─────────▶│    (Speech-to-Speech AI)    │
│   Stream WebSocket)   │  μ-law↔PCM │                             │
└───────────────────────┘  Transcoding└─────────────────────────────┘

Backend (The Engine)

I used Serverpod as the backbone. Its Dart-first ORM and high-performance capabilities allowed me to build a seamless bridge between the database and the telephony logic. Key services include:

  • MediaStreamHandler: Bidirectional WebSocket bridge between Twilio and Gemini
  • AudioTranscoder: Real-time μ-law ↔ PCM conversion with upsampling/downsampling
  • GeminiLiveService: Manages WebSocket connections to Gemini's real-time API
  • CallSchedulerService: Handles scheduled calls via Serverpod's FutureCalls
  • CallEventService: WebSocket broadcasting for live UI updates

The Intelligence

Gemini 2.5 Flash serves as the reasoning core via the Multimodal Live API. Using a RAG (Retrieval-Augmented Generation) system, I gave the AI a long-term memory by storing call histories in a PostgreSQL database. The AI has access to tool functions:

  • get_call_history() — Retrieve previous conversations with the same contact
  • end_call(reason, summary) — Autonomously terminate calls when mission complete

Telephony

Twilio Programmable Voice handles the global telephony infrastructure:

  • REST API for call initiation with TwiML webhooks
  • Media Streams for bidirectional audio via WebSocket
  • Automatic call recording with MP3 storage
  • Status callbacks for real-time call state tracking

Frontend

A clean Flutter interface with Riverpod state management across 12 screens:

  • DialerScreen: Contact integration with Trie-based search
  • CallAgentsScreen: Create and manage AI personas
  • ActiveCallMonitorScreen: Live transcript and status updates
  • CallHistoryScreen: Unified AI + device call log
  • ChatScreen: Knowledge base queries with referenced calls

Challenges I Faced: Mastering the Conversation

Challenge 1: The Audio Mismatch

The most significant hurdle was the Audio Format Gap. Telephony standards operate at 8,000 Hz μ-law (Narrowband), while Gemini requires high-fidelity PCM at 16,000 Hz for input and outputs at 24,000 Hz.

Solution — Custom Audio Transcoder:

Twilio → Server:  μ-law 8kHz → decode → upsample 2x → PCM 16kHz → Gemini
Gemini → Twilio:  PCM 24kHz → downsample 3:1 → encode → μ-law 8kHz → Twilio

I built a real-time audio transcoding pipeline with:

  • Pre-computed μ-law decode/encode tables (256 entries) for O(1) conversion
  • Linear interpolation for upsampling (8kHz → 16kHz)
  • 3:1 averaging for downsampling (24kHz → 8kHz)
  • 20ms audio chunks (~200 bytes) processed in real-time

Challenge 2: The Latency Paradox

For conversations to feel natural, the AI's first response must arrive in under 3 seconds. Initial tests showed 5+ second delays.

Solution — Gemini Pre-initialization: Instead of waiting for Twilio to connect before initializing Gemini, I start the Gemini WebSocket connection during call setup. By the time the recipient answers, the AI is ready to speak.

Metric Before After
First Response ~5s ~2s
Audio Roundtrip ~800ms ~300ms

Challenge 3: Transcript Persistence

Real-time transcripts were being lost when calls ended abruptly. The solution involved:

  • Debounced database updates (500ms) to avoid excessive writes
  • Saving transcript before broadcasting status changes
  • FutureCall-based post-call analysis with 5-second delay

Challenge 4: Memory Retrieval at Scale

The AI needs instant access to call history for context-aware conversations:

[ T(n) = O(\log n) ]

By optimizing PostgreSQL indexes on (userId, phoneNumber, completedAt), the AI retrieves relevant history from thousands of calls in milliseconds.


What I Learned

I learned that the future of AI isn't about replacing human connection; it's about filtering the noise. Building this taught me:

Technical Insights

  • Real-time systems are unforgiving: Every millisecond matters. A 100ms delay in audio processing compounds into awkward pauses.
  • State synchronization is hard: Coordinating state between Flutter, Serverpod, Twilio, and Gemini required careful event-driven architecture.
  • Audio engineering is its own discipline: Understanding μ-law encoding, sample rates, and interpolation algorithms opened a new world.

Product Insights

  • Context is everything: An assistant that remembers previous conversations is infinitely more valuable than one that just talks.
  • Voice UX differs from chat UX: Users expect immediate responses. Silence feels like failure.
  • Customization breeds adoption: Letting users create their own AI personas with unique voices and behaviors dramatically increases engagement.

Architecture Insights

  • Dart everywhere works: Having Flutter, Serverpod, and shared models all in Dart eliminated entire categories of bugs.
  • WebSocket > Polling: Real-time updates via WebSocket broadcasting transformed the user experience.
  • Pre-computation pays off: Lookup tables for audio conversion, Trie structures for contact search — these optimizations compound.

The Call Flow

Here's how a typical AI call works end-to-end:

┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: INITIATION                                                      │
├──────────────────────────────────────────────────────────────────────────┤
│ User enters: Phone Number + Mission + Agent Selection                    │
│ → Flutter calls initiateCall endpoint                                    │
│ → Backend creates CallSession (status: pending)                          │
│ → Gemini connection pre-initialized in background                        │
│ → Twilio REST API initiates call → returns Call SID                      │
└──────────────────────────────────────────────────────────────────────────┘
                                    ↓ (~2 seconds)
┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 2: MEDIA STREAM SETUP                                              │
├──────────────────────────────────────────────────────────────────────────┤
│ Twilio dials recipient → Recipient answers                               │
│ → Twilio requests TwiML → Returns WebSocket URL                          │
│ → WebSocket upgrades at /media-stream                                    │
│ → MediaStreamHandler bridges Twilio ↔ Gemini                             │
│ → Status: active                                                         │
└──────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 3: REAL-TIME CONVERSATION                                          │
├──────────────────────────────────────────────────────────────────────────┤
│ Loop until call ends:                                                    │
│   Recipient speaks → Twilio sends μ-law 8kHz                             │
│   → Transcode to PCM 16kHz → Send to Gemini                              │
│   → Gemini generates response + transcript                               │
│   → Receive PCM 24kHz → Transcode to μ-law 8kHz                          │
│   → Send back to Twilio → Recipient hears AI                             │
│   → Transcript saved (debounced 500ms)                                   │
│   → Flutter UI updates via WebSocket broadcast                           │
└──────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌──────────────────────────────────────────────────────────────────────────┐
│ PHASE 4: POST-CALL ANALYSIS                                              │
├──────────────────────────────────────────────────────────────────────────┤
│ Call ends (AI calls end_call or recipient hangs up)                      │
│ → Status: completed                                                      │
│ → FutureCall scheduled (5-second delay)                                  │
│ → GeminiService generates structured summary                             │
│ → Summary saved: Outcome, Key Info, Action Items                         │
│ → Flutter displays final summary                                         │
└──────────────────────────────────────────────────────────────────────────┘

Tech Stack Summary

Layer Technology Purpose
Mobile App Flutter 3.32 Cross-platform iOS/Android
State Management Riverpod 2.5 Reactive state with providers
Backend Serverpod 3.2 Dart-first backend with ORM
Database PostgreSQL Persistent storage with indexes
AI Engine Gemini 2.5 Flash Multimodal Live API for speech
Telephony Twilio Voice Programmable voice + media streams
Real-time WebSocket Bidirectional audio + UI updates
Auth Serverpod Auth JWT with email/Google OAuth

Future Roadmap

AI Phone is just the beginning. The next 12 months will focus on:

Near-Term (Q1-Q2)

  • Inbound Call Handling: Let AI answer calls on your behalf with caller ID screening
  • Call Transfer: Seamless handoff to human operator with full context
  • Multi-party Broadcasts: Call multiple recipients with the same message

Mid-Term (Q3-Q4)

  • Emotional Intelligence: Adapt AI's tone based on caller's urgency or mood
  • Autonomous Scheduling: Calendar integration to resolve booking conflicts automatically
  • Advanced Analytics: Success metrics, failure analysis, conversation insights

Long-Term (Year 2)

  • Visual Context: AI can "see" documents or images shared during calls
  • Language Expansion: Localized voice profiles for any language or dialect
  • Barge-in Detection: Allow user to interrupt and take over mid-call

Project Statistics

Metric Value
Flutter Screens 12 major screens
Backend Services 38+ Dart files
Database Models 11+ protocol definitions
Voice Options 10 unique AI voices
Supported Languages Multi-language with restrictions
Target Latency <3s first response
Achieved Latency ~2s with pre-initialization

The Impact

AI Phone is for the busy professional, the introvert, and anyone tired of the noise. Whether you need to:

  • Handle a tedious customer service call while you focus on work
  • Follow up with leads without the mental drain of repetitive conversations
  • Schedule appointments while your AI remembers all the details
  • Query your call history — "What did the insurance company say last month?"

AI Phone ensures you can reclaim your time and stay focused on what actually matters.


Try It Yourself

The project demonstrates:

  • Full-stack Dart development (Flutter + Serverpod)
  • Real-time WebSocket communication
  • Audio engineering with format transcoding
  • AI integration with tool calling
  • Production-grade state management

Built with passion for developers who value their focus.

Built With

+ 4 more
Share this project:

Updates