Inspiration
We all know the frustration of having to call a company, navigate through endless automated menus, and then have to repeat our information multiple times when a human finally answers. We were inspired by the idea of reversing this paradigm: what if AI could do that tedious work for us? Discovering the innovative intermodal capabilities of Amazon Nova 2 Sonic, which allows for voice processing and the generation of text metadata representing DTMF tones (the beeps on a telephone keypad), we realized we could create an agent capable of interacting not only with humans but also of autonomously "hacking" and navigating other companies' automated systems (IVRs). Our inspiration was to solve the "last mile problem" in outbound call automation.
What it does
Our project is a fully autonomous, real-time outbound call agent. The system can initiate a call to a provider, listen to and navigate their automated menus (e.g., "press 1 for support"), interpret the audio, and generate the corresponding DTMF tones. Once a human operator answers, the agent interacts with a fluent, expressive, and natural voice to complete complex tasks, such as scheduling an appointment or confirming a logistics status. Furthermore, it features multilingual voices, enabling code-switching; if the human operator suddenly switches from English to Spanish, our agent instantly changes language while maintaining the same identity and tone of voice.
How we built it
We implemented a two-part, full-stack architecture. The frontend was built with Next.js to provide the user interface. The backend was programmed in Python, using the open-source LiveKit Agents framework to orchestrate communication. LiveKit acts as our WebRTC server, handling complex low-latency audio routing, voice activity detection (VAD), and noise suppression. The agent's intelligent core is Amazon Nova 2 Sonic, integrated via the AWS SDK for LiveKit (livekit-plugins-aws[realtime]) and hosted on Amazon Bedrock. We leveraged advanced features of the model, such as asynchronous tool calling, which allows the agent to query databases in the background without interrupting the conversation, keeping it lively and seamless with the human.
Challenges we ran into
Managing real-time audio infrastructure and bidirectional streaming is notoriously complex due to latency, codec optimization, and routing issues. We overcame this by integrating LiveKit, which eliminated the need to build WebRTC channels from scratch. Another major challenge was getting the AI to interact with outdated robotic telephone systems. We solved this by leveraging Nova 2 Sonic's intermodal input/output feature, which allows the model to emit metadata that the system interprets as numeric keypad (DTMF) inputs. Finally, making the agent feel natural and avoid interrupting the user required calibration; we used Nova Sonic's turn-taking controllability feature to adjust the sensitivity of pauses (up to 2 seconds), giving the human user enough time to finish speaking or hesitate without being cut off.
Accomplishments that we're proud of
We are incredibly proud to have built a conversational flow that overcomes the limitations of traditional voice systems. Legacy bots used "cascaded" systems (Voice-to-Text -> LLM -> Text-to-Speech) that created bottlenecks and high latency. By using Nova 2 Sonic's unified speech-to-speech architecture, we achieved imperceptible latency and a natural compositional flow that preserves prosody and pitch. Watching our agent successfully navigate an IVR, negotiate with a human, and invoke a background tool all in a single session was an incredible milestone.
What we learned
We learned that latency is the biggest enemy of conversational AI and that unified foundational (speech-to-speech) models represent a complete paradigm shift in the industry. We also learned how to manage WebRTC infrastructures using platforms like LiveKit, discovering how to delegate the handling of audio sessions to focus on the AI's logic and experience. Finally, we learned how to train models to use cross-context (simultaneous voice and text) to enrich interactions.
What's next for AI Telephone Agent
The next crucial step is to move the agent beyond a web interface, connecting them directly to the public switched telephone network (PSTN). We plan to leverage Nova 2 Sonic's native integrations with leading telecom platforms like Twilio, Vonage, and Amazon Connect so the agent can programmatically dial real phone numbers. We also plan to integrate a human-in-the-loop transfer feature; if the call becomes too complex, the agent will transfer the call to a company employee, passing on all the previously gathered context to ensure a quick and seamless resolution.
Log in or sign up for Devpost to join the conversation.