Inspiration
I relocated to Spain and quickly ran into a real problem: I needed to make important local calls (schools, clinics, services), but I don’t speak Spanish yet. Many of these interactions still require a phone call, not an app or website. I built Habla to remove that immediate language barrier.
What it does
Habla has two modes:
- Live Call Mode: real-time translation during a 1:1 phone call
- Agent Mode: an AI phone agent that calls on my behalf and handles the conversation
It also includes:
- Live transcription and transcript updates
- Critical info detection/confirmation (for things like names, dates, phone numbers, addresses, amounts)
- Verified facts summary after/during calls
- Context memory (remembers caller preferences and past context for future calls)
How we built it
I built it as a solo founder across three parts:
- habla-core (main backend): FastAPI + Twilio Voice/Media Streams + Amazon Nova 2 Sonic
- habla-ios (client): SwiftUI app with WebSocket audio streaming, call UX, history, summaries, and memory
- habla-accounts (microservice): AWS Lambda + API Gateway + DynamoDB for secure caller-ID ownership by device
For Live Call Mode, the backend runs dual streaming sessions (both directions) and bridges audio between iOS and PSTN. For Agent Mode, I built a dedicated call manager with real-time status, transcript events, instruction injection, critical-info tracking, and call lifecycle handling.
Challenges we ran into
- 1:1 latency is still high. It is usable in practice, but model response time is the main bottleneck
- Agent call endings were tricky. Early versions could linger or fail to close naturally
- Telephony/audio bridging required careful handling of codecs, sampling rates, and streaming reliability
- Balancing speed with trust/safety features (critical confirmations + verified summaries) added complexity
Accomplishments that we're proud of
- As a solo founder, I shipped a full working product with both Live Translation and Agent Mode
- Agent Mode feels smooth in real usage
- Live mode has noticeable latency, but it is still practical for real conversations
- I implemented high-value trust features: transcription, critical info checks, verified facts, and context memory
- I tested it myself end-to-end in realistic scenarios
What we learned
- In real-time voice AI, system engineering matters as much as prompting
- Model response time dominates user experience in live translation
- Prompting alone is not enough for stable phone agents; runtime guardrails and explicit end-call logic are necessary
- For sensitive calls, users need structured outputs (transcript + verified facts), not only raw audio
What's next for Habla
- Reduce live-call latency further (especially model-response bottlenecks)
- Improve agent completion reliability and closure behavior
- Expand context memory so follow-up calls feel more personalized and efficient
- Broaden language/support coverage and harden production reliability
- Client-side: sync with contacts, sync data with iCloud
To try this app, please use the following link: https://testflight.apple.com/join/PkUSuqZm
Built With
- amazon-dynamodb
- amazon-gateway
- amazon-lambda
- amazon-web-services
- python
- redux
- swift
- swiftui
- twilio
Log in or sign up for Devpost to join the conversation.