Inspiration

Y Combinator and other investors gave 20,000,000$ to a Conversational AI startup called VAPI. They started with a lightning-quick data processing pipeline allow for conversational AI over the telephone, and then created a diverse ecosystem of models and configurations for users. VAPI today is worth over 120,000,000 dollars. What's so special about conversational AI via telephony? Well first off its - thousands of jobs that require human-level speech skills, and there may not be enough people to fill these rolls. One missed call could be a loss of potential income. Additionally, this technology could be used for educational reasons or for humanitarian purposes such as a teacher or friend.

My goal throughout this project was to match Vapi's speed, and of course a MVP of conversational AI over the phone.

What it does

This project is a website that allows you build phone agents. You can assign them a phone number, ElevenLabs voice id and a prompt.

How I built it

I went with PostgreSQL to store auth data and conversation history. The most interesting part was the speech to speech pipeline. It starts with Twilio, which forwards incoming phone calls via WebSocket to the fastapi server. Twilio streams the caller's audio in real-time detects when they stop speaking, then sends the audio to Gemini Flash 2.0 which handles both speech-to-text and response generation in a single streaming call. As Gemini streams text back, we pipe it directly into ElevenLabs TTS, which outputs audio in Twilio's native format. Each audio chunk is sent back through the WebSocket immediately, so the caller hears the AI respond in real-time. The end to end latency is about 2-3 seconds

Challenges I ran into

Twilio will not allow you to use SMS unless you have a business with an actual business account number that you filed in your taxes. This was a huge headache and for the presentation I will be using whatsapp instead. Research was also a big part of this project. There were countless models I couldve used and im sure many would wonder why I am not using the Gemini 3.0 Flash model which is supposedly faster. Well it turns out that Gemini 2.0 flash is almost 2x faster than 3 at processing audio.

Latency was a big problem I had a couple of ideas to reduce latency. The most effective way was to stream a sentence from Gemini and send it to Elevenlabs. While Elevenlabs is speaking, the rest of the text will be generated and sent to Elevenlabs again without any sort of pause.

Accomplishments that I'm proud of

I am proud that it's working honestly - this was a nightmare, but I've been wanting to make it for a while. If I didn't figure out the LLM pipeline this wouldn't have been possible. I genuinely think this pipeline is faster than Gemini Live AND it has tool-calling capabilities as well.

What I learned

I learned a lot about Twilio, the Gemini API, and ElevenLabs

What's next for ArchAgents

  • Persistent memory potentially with GraphRag
  • Hosted Website
  • SMS
  • Better user experience and ease of use
  • Tool calling support for Agents

Built With

Share this project:

Updates