CallShield: Passive Voice Authentication
Inspiration
We’re entering an era where you can no longer trust what you hear. In 2024, banking fraud using AI voice clones increased by over 200%, allowing scammers to impersonate customers perfectly and drain accounts in seconds. Witnessing how these clones can bypass biometric security and deceive even close family members, we realized the traditional "trust but verify" model is broken. We were inspired by this alarming rise in "vishing" to create a solution that evolves as fast as the attackers do. Moving beyond static passwords to dynamic, real-time AI analysis, we built CallShield to be a firewall for your phone calls.
What it does
CallShield is a real-time security platform that protects voice communications against deepfakes and social engineering. It acts as a secure layer between the caller and the agent/user, performing three critical checks simultaneously:
- Biometric Verification: It continuously verifies the speaker's identity against their enrolled voice fingerprint, ensuring the person speaking is who they claim to be.
- Deepfake Detection: It analyzes audio artifacts to detect synthetic or AI-generated speech, flagging potential voice clones instantly.
- Social Engineering Analysis: It listens to the context of the conversation to detect malicious intent, such as high-pressure tactics, requests for OTPs, or threats, alerting the user before they make a mistake.
How we built it
We architected CallShield, combining modern web technologies with a sophisticated AI pipeline:
- Frontend: Built with Next.js 14, Tailwind CSS, and Shadcn UI for a responsive, mission-control style dashboard. We used WebSockets to stream audio and risk data in real-time, ensuring the UI updates instantly as the call progresses.
- Backend: A high-performance FastAPI server handles the audio streams. We implemented a custom session manager to handle audio buffering and windowing.
- The AI Pipeline:
- Voice Biometrics: We utilized SpeechBrain's ECAPA-TDNN model to generate and compare speaker embeddings on the fly.
- Deepfake Detection: We integrated with Aurigin.AI to provide state-of-the-art detection of synthetic speech patterns.
- Social Engineering Detection: We leveraged Google Gemini and Fish Audio speech-to-text to analyze conversation transcripts for social engineering patterns, adding a semantic layer of security.
- Simulation: To test our system, we built a realistic call simulator using Fish Audio for high-quality text-to-speech, allowing us to simulate attacks and verify our defenses.
Challenges we ran into
- Real-time Latency: Processing audio through multiple AI models while maintaining a conversational flow was our biggest hurdle. We had to optimize our audio chunking and run analysis asynchronously to prevent lag in the dashboard.
- AI Audio Detection: Finding a fast and robust API to detect whether our audio chunks were generated by an AI voice model proved to be difficult and required many tries, comparing and testing different platforms.
- Integration Complexity: Orchestrating three different AI services (Voice fingerprinting, Deepfake, LLM) to work in harmony required a robust state management system on the backend.
Accomplishments that we're proud of
- Seamless Real-time Dashboard: We're incredibly proud of the UI. Seeing the "Voice Match" and "Fraud Risk" meters react live to a user's voice feels like magic and the alert system would be perfect for call center associates to get live security alerts.
- Defense in Depth: We didn't just build a deepfake detector; we built a holistic security system. Successfully integrating biometric, synthetic, and semantic analysis into a single risk score is a major achievement.
- The "Aha!" Moment: The first time we tested the system with a real voice vs. a recorded deepfake and saw the dashboard instantly flag the attack was a huge win for the team.
What we learned
- The Threat is Real: Working with voice cloning tools showed us just how easy it is to create convincing fakes, reinforcing the urgency of our solution.
- WebSockets are Powerful: We gained deep expertise in using WebSockets for bi-directional streaming, a skill that will be invaluable for future real-time apps.
What's next for CallShield
- Mobile App: Bringing voiceprint enrollment to mobile banking apps rather than our demo application.
- Secure fingerprinting: Creating a secure fingerprinting storage for the purpose of scaling up our local system in a secure way without revealing PII (Personally Identifiable Information)
- Video Integration: Expanding our deepfake detection to include calls, protecting users on platforms like Zoom and Teams.
- Enterprise API: Releasing our risk engine as an API so other developers can integrate CallShield's protection into their own communication platforms.
Built With
- aurigin.ai
- fastapi
- fish.audio
- framer
- gemini
- lucide
- next.js
- numpy
- python
- pytorch
- react
- shadcn
- speechbrain
- supabase
- tailwind
- typescript
- uvicorn
- websockets
Log in or sign up for Devpost to join the conversation.