Inspiration

I relocated to Spain and quickly ran into a real problem: I needed to make important local calls (schools, clinics, services), but I don’t speak Spanish yet. Many of these interactions still require a phone call, not an app or website. I built Habla to remove that immediate language barrier.

What it does

Habla has two modes:

  • Live Call Mode: real-time translation during a 1:1 phone call
  • Agent Mode: an AI phone agent that calls on my behalf and handles the conversation

It also includes:

  • Live transcription and transcript updates
  • Critical info detection/confirmation (for things like names, dates, phone numbers, addresses, amounts)
  • Verified facts summary after/during calls
  • Context memory (remembers caller preferences and past context for future calls)

How we built it

I built it as a solo founder across three parts:

  • habla-core (main backend): FastAPI + Twilio Voice/Media Streams + Amazon Nova 2 Sonic
  • habla-ios (client): SwiftUI app with WebSocket audio streaming, call UX, history, summaries, and memory
  • habla-accounts (microservice): AWS Lambda + API Gateway + DynamoDB for secure caller-ID ownership by device

For Live Call Mode, the backend runs dual streaming sessions (both directions) and bridges audio between iOS and PSTN. For Agent Mode, I built a dedicated call manager with real-time status, transcript events, instruction injection, critical-info tracking, and call lifecycle handling.

Challenges we ran into

  • 1:1 latency is still high. It is usable in practice, but model response time is the main bottleneck
  • Agent call endings were tricky. Early versions could linger or fail to close naturally
  • Telephony/audio bridging required careful handling of codecs, sampling rates, and streaming reliability
  • Balancing speed with trust/safety features (critical confirmations + verified summaries) added complexity

Accomplishments that we're proud of

  • As a solo founder, I shipped a full working product with both Live Translation and Agent Mode
  • Agent Mode feels smooth in real usage
  • Live mode has noticeable latency, but it is still practical for real conversations
  • I implemented high-value trust features: transcription, critical info checks, verified facts, and context memory
  • I tested it myself end-to-end in realistic scenarios

What we learned

  • In real-time voice AI, system engineering matters as much as prompting
  • Model response time dominates user experience in live translation
  • Prompting alone is not enough for stable phone agents; runtime guardrails and explicit end-call logic are necessary
  • For sensitive calls, users need structured outputs (transcript + verified facts), not only raw audio

What's next for Habla

  • Reduce live-call latency further (especially model-response bottlenecks)
  • Improve agent completion reliability and closure behavior
  • Expand context memory so follow-up calls feel more personalized and efficient
  • Broaden language/support coverage and harden production reliability
  • Client-side: sync with contacts, sync data with iCloud

To try this app, please use the following link: https://testflight.apple.com/join/PkUSuqZm

Built With

Share this project:

Updates