VoiceHost: AI Phone Assistant for Restaurants
Try it out!
Call Korean Fried Chicken restaurant BonChan +1-669-201-5051 for table reservation and order pick up
+1-669-201-5051
===================================
π‘ Inspiration
Walking into my favorite Korean fried chicken spot, I noticed something
frustrating: three people waiting on hold while one overwhelmed staff member
juggled the phone, cash register, and takeout orders. The owner later told me
they spend over $1,000/month on phone staff aloneβand still miss 30% of calls
during dinner rush.
That's when it hit me: What if AI could handle every single call?
Restaurants don't need another app customers won't download. They need something that works with what customers already do: pick up the phone and call.
π― What It Does
VoiceHost is an AI phone receptionist that answers restaurant calls 24/7. When a customer calls, they hear a natural voice that:
- Takes pickup orders: "I'd like medium wings with soy garlic sauce"
- Books reservations: "Table for 4 tomorrow at 7 PM"
- Answers questions: "What are your hours?" "What's on the menu?"
- Confirms everything: Sends SMS via Square Bookings API
The magic? Customers don't know they're talking to AI. It sounds human, handles interruptions naturally, and never makes booking errors.
π οΈ How We Built It
Architecture
The system connects five technologies into one seamless voice pipeline:
Customer Call β Twilio (telephony) β Deepgram STT (speech β text) β OpenAI GPT-4o-mini (conversation logic + function calling) β Square Bookings API (create reservations/orders) β Deepgram TTS (text β speech) β Twilio β Customer hears response
Tech Stack ββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ β Layer β Technology β Why We Chose It β ββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ€ β Phone β Twilio β Industry standard, WebSocket streaming β ββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ€ β Voice β Deepgram β 95%+ accuracy, real-time STT/TTS β ββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ€ β AI β OpenAI GPT-4o-mini β Function calling for API integration β ββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ€ β Bookings β Square API β Production-ready, auto SMS confirmations β ββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ€ β Backend β FastAPI + Python β Async WebSocket support β ββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββ Key Implementation Details
Real-time Audio Streaming WebSocket receives audio chunks from Twilio (mulaw 8kHz) async for message in websocket.iter_text(): audio_bytes = base64.b64decode(payload) await deepgram.send_audio(audio_bytes) # β Speech recognition
Function Calling for Bookings The AI decides when to call APIs based on conversation context: tools = [ "check_availability(date, time, party_size)", "create_booking(date, time, name, phone)", "create_pickup_order(items, pickup_time, name, phone)" ]
Echo Suppression Calculate TTS audio duration to mute incoming audio while agent speaks:
$$\text{speech_duration} = \frac{\text{audio_bytes}}{8000 \text{ bytes/sec}}$$
Then suppress transcripts for speech_duration + 0.5s buffer.
π§ Challenges We Faced
Challenge #1: Timezone Chaos
Problem: Square stores bookings in UTC, but users say "6 PM" meaning PST. Our first version compared UTC dates to PST datesβbookings were invisible!
Example Bug:
- User books "today at 6 PM PST" (Feb 16, 18:00 PST)
- Square stores: 2026-02-17T02:00:00Z (Feb 17, 2 AM UTC)
- Our code checked: "Does 2026-02-17 == 2026-02-16?" β β Not found
Solution: Convert UTC β PST before any date comparisons: utc_dt = datetime.strptime(start_at, "%Y-%m-%dT%H:%M:%SZ") local_dt = utc_dt - timedelta(hours=8) # UTC β PST booking_date = local_dt.strftime("%Y-%m-%d") # Now compare
Challenge #2: The Echo Problem
Problem: Agent says "Your reservation is confirmed" β Twilio plays it β Phone mic picks it up β Deepgram transcribes "your reservation is confirmed" β AI responds again β Infinite loop! π±
Solution: Track when the agent is speaking and suppress transcripts during that window: speech_duration = len(audio_bytes) / 8000.0 agent_speaking_until = now + speech_duration + 0.5 # Ignore all transcripts until agent_speaking_until
Challenge #3: Phone Numbers Get Chopped
Problem: User says "669-290-9767" but pauses mid-number. With 300ms endpointing:
- Transcript 1: "six six nine two nine zero" β AI: "Is 669290 correct?" β
- Transcript 2: "nine seven six seven" β User confused
Solution:
- Validate phone numbers have 10 digits before confirming
- If len(digits) < 10, ask: "And the rest of the number?"
Challenge #4: Finding the Goldilocks Endpointing
Too short (200ms) = cuts users off mid-sentence Too long (800ms) = slow, awkward pauses Just right: 300ms β¨
π Accomplishments We're Proud Of
β Shipped a production MVP in one session β Real Twilio number, real Square API, real SMS confirmations
β Natural conversation flow β Handles "I want wings" β "What size?" β "Medium" β "Sauce?" without getting lost
β Zero booking errors β Double-confirmation before submitting, timezone-safe, phone validation
β 70% cost reduction β $1,000/month (human staff) β $299/month (VoiceHost)
β Solved the hardest problem β Echo suppression without expensive VAD hardware
π What We Learned
- Voice UX β Text UX
Text chatbot: "Here's a list of options: \n- Wings (Small: $16.55) \n- Boneless (Small: $16.95)..."
Voice AI: "Our most popular are wings or bulgogi. Which sounds good?"
Rules we discovered:
- β No markdown (bold sounds like "asterisk asterisk bold")
- β No bullet lists (people can't remember 5 options spoken aloud)
- β One question at a time
- β Max 1-2 sentences per response
- Timezone Handling is Mission-Critical
Every datetime operation needs explicit timezone awareness. We fixed 3 separate timezone bugs before bookings worked reliably.
- Modern AI APIs Are Production-Ready
- Deepgram: 95%+ accuracy on real phone calls, even with background noise
- OpenAI Function Calling: Reliably calls create_booking() at the right moment
- Twilio: Rock-solid WebSocket streaming, handles reconnections gracefully
We went from idea β working phone number in under 8 hours. The infrastructure existsβyou just have to wire it together.
- Endpointing is an Art
The difference between 300ms and 500ms wait time changes the entire conversation feel. Too fast = interrupts; too slow = awkward silences. We A/B tested on real calls to find 300ms optimal.
π What's Next for VoiceHost
Immediate (Next 2 Weeks)
- βοΈ Deploy to Railway/Render for 24/7 uptime (currently runs locally)
- π Analytics dashboard (call volume, peak hours, conversion rate)
- π Multi-language support (Spanish for Latino communities)
Short-term (3 Months)
- π Integrate more POS systems (Toast, Clover, Lightspeed)
- π€ Upselling AI: "Would you like to add fries for $3?"
- π± SMS/WhatsApp ordering (voice beyond phone calls)
Long-term (6-12 Months)
- π’ Expand to adjacent markets:
- Hair salons (300K+ in US)
- Dental offices (200K+)
- Fitness studios (40K+)
- π§ Sentiment analysis (detect angry customers β escalate to human)
- π― Goal: 1,000 paying customers, $300K MRR
Business Model
βββββββββββ¬ββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββββββ
β Tier β Price β Calls/Month β Target Customer β
βββββββββββΌββββββββββΌββββββββββββββΌββββββββββββββββββββββββββ€
β Starter β $99/mo β 500 β Small restaurants β
βββββββββββΌββββββββββΌββββββββββββββΌββββββββββββββββββββββββββ€
β Pro β $199/mo β 1,500 β Mid-size restaurants β
βββββββββββΌββββββββββΌββββββββββββββΌββββββββββββββββββββββββββ€
β Premium β $299/mo β 3,000 β High-volume restaurants β
βββββββββββ΄ββββββββββ΄ββββββββββββββ΄ββββββββββββββββββββββββββ
ROI Calculation
Starter Plan: $$\text{Monthly Savings} = $1{,}000 - $99 = $901$$ $$\text{Annual ROI} = \frac{$10{,}812}{$1{,}188} \times 100 = 910%$$
Pro Plan: $$\text{Monthly Savings} = $1{,}000 - $199 = $801$$ $$\text{Annual ROI} = \frac{$9{,}612}{$2{,}388} \times 100 = 402%$$
Premium Plan: $$\text{Monthly Savings} = $1{,}000 - $299 = $701$$ $$\text{Annual ROI} = \frac{$8{,}412}{$3{,}588} \times 100 = 234%$$
Cost reduction: 70-90% vs hiring a receptionist
π¬ Conclusion
VoiceHost proves that AI can handle real customer interactions todayβnot in 5 years, not after more research, but right now.
We built a system that:
- Saves restaurants 91% on phone costs
- Never misses a call
- Books reservations with zero errors
- Sounds indistinguishable from a human
The future of restaurant operations isn't hiring more staffβit's giving every restaurant an AI teammate that works 24/7, never calls in sick, and costs less than a part-time employee.
Log in or sign up for Devpost to join the conversation.