Our Story: Building Virgo's Whisper AI
The Inspiration
This project was inspired by the incredible cognitive load that first responders face every day. They are trained professionals, but they are also human. In a high-stress crisis, they have to manage their own stress, communicate with dispatch, remember complex, multi-step protocols, and take mental notes, all at the same time.
I wanted to build a tool not to replace their training, but to act as a hands-free "co-pilot" or an external "short-term memory." The goal was to build an AI that could passively listen, reduce cognitive load, and provide calm, clear, one-step-at-a-time guidance, allowing the officer to focus on the situation in front of them.
The Initial Plan & The "Brick Wall" (The Challenges)
My initial plan for hackathon was to use the stack: LiquidMetal AI, Vultr, and Cerebras. I immediately hit a wall.
The Credit Card Blocker: The Vultr services required a credit card, which I did not have.
The Tech Blocker: The LiquidMetal hosting required WSL (Windows Subsystem for Linux), which was broken on my system and, after hours of debugging, remained unfixable.
This project is the story of that pivot. How do you build a powerful, low-latency, real-time AI application with zero budget and critical technical blockers? This forced a complete redesign and led to the "No Credit Card" stack: PythonAnywhere, Firebase, AssemblyAI, Cerebras, and ElevenLabs.
How We Built It: The Technical Journey
The project was built in three main phases, with each new version solving a critical bug from the last.
v1.0: The "Brain" and "Ears"
I started by building the "brain" (a Flask app) on PythonAnywhere and the "memory" (Firestore) on Firebase. The first challenge was just getting them to talk to each other. After debugging file paths, CORS policies, and environment variables, the server was live.
The next step was integrating the "ears" (AssemblyAI). We hit our second major blocker: the webhook system was unreliable. My server logs showed that AssemblyAI was transcribing the audio but wasn't sending the text back to our server.
The solution was to refactor the entire logic. Instead of the server passively listening for a webhook, I made it proactive. The final architecture has the client (our web demo) upload the audio file directly to our Flask server. The server then calls AssemblyAI and waits for the transcription. This "direct control" model was the first major breakthrough and proved far more reliable.
v2.0: The "Intelligence" and "Voice"
With a stable transcription pipeline, I added the "intelligence" (Cerebras) and "voice" (ElevenLabs). We added our first features:
Stress Detection: The AI would analyze any passive speech for stress and respond with "Deep breath. Focus."
Summarization: The AI would listen for "Virgo, summarize comms," query Firebase for all recent chatter, and send it to Cerebras to be summarized.
This worked, but we quickly realized a "one-and-done" answer isn't a co-pilot. It's just a glorified search.
v3.0: The Moment: Stateful Conversation
This was the true challenge and the core of the project. How do you have a stateful, multi-turn conversation on a stateless server (like PythonAnywhere's free tier)?
The solution was to use Firebase as our "short-term memory."
We created two new collections: protocols (the AI's "expert knowledge") and conversations (its "active memory").
When a user triggers a protocol (e.g., "Shots fired, I am hurt!"), the server now:
Checks protocols for the keywords.
Creates a new document in the conversations collection, saving the protocol_id and setting the state to "active."
Calls Cerebras with the protocol steps and the user's transcript.
Generates the first question (e.g., "Where are you hurt?") and sends it.
When the user replies ("On my leg!"), the server:
Checks conversations and finds the "active" session.
Retrieves the chat history and the protocol.
Calls Cerebras with the full context ("We are in a 'Shots Fired' protocol. We just asked 'Where are you hurt?' The user replied, 'On my leg.' What is the next step?").
The AI gives the next instruction ("Okay, on your leg. Apply pressure.").
The server updates the conversation document with this new history.
This "state machine" architecture allows for a robust, multi-turn conversation on a completely stateless server.
What I Learned
The final 10% of the project was all about "prompt engineering"—the most significant challenge of all.
The AI got stuck in repetitive loops, so we had to update the prompt to teach it to rephrase questions and acknowledge the user's answers.
AssemblyAI confused "log" and "look," so we changed the command to the more distinct "take a note."
The AI didn't know when to "hang up," so we added a master "over and out" command.
I learned that a good "brain" (the AI model) is useless without a good "nervous system" (the architecture). The real challenge wasn't just calling an AI; it was building the state management, intent routing, and memory systems to make those AI calls intelligent and reliable.
Built With
- ai
- assemblyai
- cerebras
- cerebras-cloud-sdk
- css3
- elevenlabs
- es6+)
- firebase
- firebase-admin
- firestore
- flask-cors
- generation:
- javascript
- python-3-flask-html5
- python-dotenv
- pythonanywhere
- transcription
- voice
Log in or sign up for Devpost to join the conversation.