VoiceHost: AI Phone Assistant for Restaurants

Try it out!

Call Korean Fried Chicken restaurant BonChan +1-669-201-5051 for table reservation and order pick up

+1-669-201-5051

===================================

💡 Inspiration

Walking into my favorite Korean fried chicken spot, I noticed something
frustrating: three people waiting on hold while one overwhelmed staff member
juggled the phone, cash register, and takeout orders. The owner later told me
they spend over $1,000/month on phone staff alone—and still miss 30% of calls
during dinner rush.

That's when it hit me: What if AI could handle every single call?

Restaurants don't need another app customers won't download. They need something that works with what customers already do: pick up the phone and call.

🎯 What It Does

VoiceHost is an AI phone receptionist that answers restaurant calls 24/7. When a customer calls, they hear a natural voice that:

Takes pickup orders: "I'd like medium wings with soy garlic sauce"
Books reservations: "Table for 4 tomorrow at 7 PM"
Answers questions: "What are your hours?" "What's on the menu?"
Confirms everything: Sends SMS via Square Bookings API

The magic? Customers don't know they're talking to AI. It sounds human, handles interruptions naturally, and never makes booking errors.

🛠️ How We Built It

Architecture

The system connects five technologies into one seamless voice pipeline:

Customer Call → Twilio (telephony) ↓ Deepgram STT (speech → text) ↓ OpenAI GPT-4o-mini (conversation logic + function calling) ↓ Square Bookings API (create reservations/orders) ↓ Deepgram TTS (text → speech) ↓ Twilio → Customer hears response

Tech Stack ┌──────────┬────────────────────┬──────────────────────────────────────────┐ │ Layer │ Technology │ Why We Chose It │ ├──────────┼────────────────────┼──────────────────────────────────────────┤ │ Phone │ Twilio │ Industry standard, WebSocket streaming │ ├──────────┼────────────────────┼──────────────────────────────────────────┤ │ Voice │ Deepgram │ 95%+ accuracy, real-time STT/TTS │ ├──────────┼────────────────────┼──────────────────────────────────────────┤ │ AI │ OpenAI GPT-4o-mini │ Function calling for API integration │ ├──────────┼────────────────────┼──────────────────────────────────────────┤ │ Bookings │ Square API │ Production-ready, auto SMS confirmations │ ├──────────┼────────────────────┼──────────────────────────────────────────┤ │ Backend │ FastAPI + Python │ Async WebSocket support │ └──────────┴────────────────────┴──────────────────────────────────────────┘ Key Implementation Details

Real-time Audio Streaming WebSocket receives audio chunks from Twilio (mulaw 8kHz) async for message in websocket.iter_text(): audio_bytes = base64.b64decode(payload) await deepgram.send_audio(audio_bytes) # → Speech recognition
Function Calling for Bookings The AI decides when to call APIs based on conversation context: tools = [ "check_availability(date, time, party_size)", "create_booking(date, time, name, phone)", "create_pickup_order(items, pickup_time, name, phone)" ]
Echo Suppression Calculate TTS audio duration to mute incoming audio while agent speaks:

$$\text{speech_duration} = \frac{\text{audio_bytes}}{8000 \text{ bytes/sec}}$$

Then suppress transcripts for speech_duration + 0.5s buffer.

🚧 Challenges We Faced

Challenge #1: Timezone Chaos

Problem: Square stores bookings in UTC, but users say "6 PM" meaning PST. Our first version compared UTC dates to PST dates—bookings were invisible!

Example Bug:

User books "today at 6 PM PST" (Feb 16, 18:00 PST)
Square stores: 2026-02-17T02:00:00Z (Feb 17, 2 AM UTC)
Our code checked: "Does 2026-02-17 == 2026-02-16?" → ❌ Not found

Solution: Convert UTC → PST before any date comparisons: utc_dt = datetime.strptime(start_at, "%Y-%m-%dT%H:%M:%SZ") local_dt = utc_dt - timedelta(hours=8) # UTC → PST booking_date = local_dt.strftime("%Y-%m-%d") # Now compare

Challenge #2: The Echo Problem

Problem: Agent says "Your reservation is confirmed" → Twilio plays it → Phone mic picks it up → Deepgram transcribes "your reservation is confirmed" → AI responds again → Infinite loop! 😱

Solution: Track when the agent is speaking and suppress transcripts during that window: speech_duration = len(audio_bytes) / 8000.0 agent_speaking_until = now + speech_duration + 0.5 # Ignore all transcripts until agent_speaking_until

Challenge #3: Phone Numbers Get Chopped

Problem: User says "669-290-9767" but pauses mid-number. With 300ms endpointing:

Transcript 1: "six six nine two nine zero" → AI: "Is 669290 correct?" ❌
Transcript 2: "nine seven six seven" → User confused

Solution:

Validate phone numbers have 10 digits before confirming
If len(digits) < 10, ask: "And the rest of the number?"

Challenge #4: Finding the Goldilocks Endpointing

Too short (200ms) = cuts users off mid-sentence Too long (800ms) = slow, awkward pauses Just right: 300ms ✨

🏆 Accomplishments We're Proud Of

✅ Shipped a production MVP in one session – Real Twilio number, real Square API, real SMS confirmations

✅ Natural conversation flow – Handles "I want wings" → "What size?" → "Medium" → "Sauce?" without getting lost

✅ Zero booking errors – Double-confirmation before submitting, timezone-safe, phone validation

✅ 70% cost reduction – $1,000/month (human staff) → $299/month (VoiceHost)

✅ Solved the hardest problem – Echo suppression without expensive VAD hardware

📚 What We Learned

Voice UX ≠ Text UX

Text chatbot: "Here's a list of options: \n- Wings (Small: $16.55) \n- Boneless (Small: $16.95)..."

Voice AI: "Our most popular are wings or bulgogi. Which sounds good?"

Rules we discovered:

❌ No markdown (bold sounds like "asterisk asterisk bold")
❌ No bullet lists (people can't remember 5 options spoken aloud)
✅ One question at a time
✅ Max 1-2 sentences per response

Timezone Handling is Mission-Critical

Every datetime operation needs explicit timezone awareness. We fixed 3 separate timezone bugs before bookings worked reliably.

Modern AI APIs Are Production-Ready

Deepgram: 95%+ accuracy on real phone calls, even with background noise
OpenAI Function Calling: Reliably calls create_booking() at the right moment
Twilio: Rock-solid WebSocket streaming, handles reconnections gracefully

We went from idea → working phone number in under 8 hours. The infrastructure exists—you just have to wire it together.

Endpointing is an Art

The difference between 300ms and 500ms wait time changes the entire conversation feel. Too fast = interrupts; too slow = awkward silences. We A/B tested on real calls to find 300ms optimal.

🚀 What's Next for VoiceHost

Immediate (Next 2 Weeks)

☁️ Deploy to Railway/Render for 24/7 uptime (currently runs locally)
📊 Analytics dashboard (call volume, peak hours, conversion rate)
🌐 Multi-language support (Spanish for Latino communities)

Short-term (3 Months)

🔌 Integrate more POS systems (Toast, Clover, Lightspeed)
🤖 Upselling AI: "Would you like to add fries for $3?"
📱 SMS/WhatsApp ordering (voice beyond phone calls)

Long-term (6-12 Months)

🏢 Expand to adjacent markets:
- Hair salons (300K+ in US)
- Dental offices (200K+)
- Fitness studios (40K+)
🧠 Sentiment analysis (detect angry customers → escalate to human)
🎯 Goal: 1,000 paying customers, $300K MRR

Business Model
┌─────────┬─────────┬─────────────┬─────────────────────────┐ │ Tier │ Price │ Calls/Month │ Target Customer │
├─────────┼─────────┼─────────────┼─────────────────────────┤ │ Starter │ $99/mo │ 500 │ Small restaurants │ ├─────────┼─────────┼─────────────┼─────────────────────────┤ │ Pro │ $199/mo │ 1,500 │ Mid-size restaurants │ ├─────────┼─────────┼─────────────┼─────────────────────────┤ │ Premium │ $299/mo │ 3,000 │ High-volume restaurants │ └─────────┴─────────┴─────────────┴─────────────────────────┘ ROI Calculation

Starter Plan: $$\text{Monthly Savings} = $1{,}000 - $99 = $901$$ $$\text{Annual ROI} = \frac{$10{,}812}{$1{,}188} \times 100 = 910%$$

Pro Plan: $$\text{Monthly Savings} = $1{,}000 - $199 = $801$$ $$\text{Annual ROI} = \frac{$9{,}612}{$2{,}388} \times 100 = 402%$$

Premium Plan: $$\text{Monthly Savings} = $1{,}000 - $299 = $701$$ $$\text{Annual ROI} = \frac{$8{,}412}{$3{,}588} \times 100 = 234%$$

Cost reduction: 70-90% vs hiring a receptionist

🎬 Conclusion

VoiceHost proves that AI can handle real customer interactions today—not in 5 years, not after more research, but right now.

We built a system that:

Saves restaurants 91% on phone costs
Never misses a call
Books reservations with zero errors
Sounds indistinguishable from a human

The future of restaurant operations isn't hiring more staff—it's giving every restaurant an AI teammate that works 24/7, never calls in sick, and costs less than a part-time employee.