DOSafe - AI safety platform

Voice AI Rating System

Inspiration

Deepfake-related losses hit $1.56 billion globally in 2025 - 3x the previous year - and Deloitte projects $40 billion by 2027. Voice cloning is the fastest-growing attack vector: a 3-second sample is enough to clone any voice. Banks are primary targets - impersonators call loan underwriting hotlines, bypass OTP and manual checks, and authorize fraudulent transactions before a human agent notices.

We saw three compounding problems:

AI-generated content is indistinguishable to the human eye (and ear) - cloned voices, face-swapped photos, machine-written articles, synthetic videos. People cannot tell anymore.
Online fraud scales faster than enforcement - millions of phishing domains, spoofed phone numbers, fake crypto wallets. Only 32% of victims even report it.
No single product covers all modalities - existing tools detect one thing. GPTZero does text. Hive does images. Pindrop does voice. Nobody combines multi-modal AI detection with threat intelligence and voice + face biometrics in one platform.

DOSafe combines all of the above - and for Shinhan's SB8 challenge, it directly addresses their need: real-time caller identity verification using voice biometric AI. Existing solutions like Pindrop handle voice biometrics but lack multi-modal AI detection and threat intelligence integration.

What it does

DOSafe is a multi-modal AI detection and online safety platform - detecting AI-generated content across text, image, video, and audio, backed by a 3.93M+ entry threat intelligence database from 19 global sources.

AI Detection runs four parallel pipelines:

Modality	What it detects	Key metrics
Text	ChatGPT, Gemini, Claude, 79+ LLMs	AUROC 0.99 - ModernBERT + E5-small/DivEye ensemble + Qwen LLM rubric judge
Image	DALL-E, Midjourney, Flux, SD3, 20+ generators	AUROC 0.99 - C2PA/SynthID + EXIF/DCT forensics + DINOv3/SPAI/CommFor ensemble
Video	Sora, Kling, Runway, synthetic video	7-layer pipeline - frame ensemble + BEATs/mHuBERT audio + Qwen visual judge
Audio	Cloned voices, TTS systems, 79 speech synthesizers	AUROC 0.97 - BEATs + mHuBERT + SSLAM + EAT ensemble, 100% on Kling 3.0

Shinhan Use Case Coverage

DOSafe addresses 5 Shinhan use cases across 3 entities:

Track	Use Case	DOSafe Solution	Coverage
SB8	AI Voice Biometrics for Fraud Prevention	Voice enroll → real-time streaming verify → deepfake detect → threat intel check	Primary - Full
SF3	Video Call eKYC Enhancement	Face enroll → liveness detection → deepfake detect → 1:N dedup matching	Full
SF7	AI for ICT Cyber Security	Threat intel (3.93M+ entries) + AI detection across all 4 modalities	Full
SL2	Voice AI Rating System	Upload recording → SSE streaming: speaker verify + AI detection + Qwen3-ASR transcription + speaker diarization + Qwen 3.5 script compliance with per-step LLM notes. Progressive rendering — verify scores appear in ~15s, compliance analysis follows ~9s later	Full
SF2	Falsified Document Detection	Image AI detection pipeline (extensible to document forensics)	Extensible

Call ID for Banking (SB8) - our audio + video pipeline extended with speaker and face verification for real-time call screening. This directly addresses Shinhan's SB8 use case: "Verify caller identity in real time during loan underwriting calls and hotline interactions using voice biometric AI."

Layer	What it does
Voice Enrollment	Bind a customer's voiceprint to their identity during onboarding. Anti-spoofing runs in parallel - AI-generated or cloned voices are rejected before enrollment.
Speaker Verification	Verify caller identity by matching live audio against enrolled 768-dimensional voiceprint (ERes2Net 512d + w2v-BERT 256d). Individual model similarities displayed for transparency.
Face Enrollment	Bind a customer's faceprint using Alibaba Tongyi Lab's TransFace ViT (512-dim embeddings). DamoFD detects the face, FLRGB checks liveness from a single RGB frame (99% interception rate, no IR camera needed).
Face Verification	Verify caller identity during video calls by matching live camera frame against enrolled 512-dimensional faceprint. Anti-spoof rejects photos, screens, and masks.
Alibaba Cloud eKYC Fallback	When self-hosted face verification is unavailable, the system auto-switches to Alibaba Cloud eKYC API (CompareFaces, DetectLivingFace) - zero downtime for banking operations.
Live Video Mode	Camera captures frame every 3 seconds → face detection → anti-spoof → similarity match → real-time overlay with match status
Real-time Voice Streaming	WebSocket-based continuous verification - browser streams audio chunks every 2s, sequential verify loop prevents request pileup. First cycle runs full ensemble + LLM analysis, subsequent cycles deliver scores only for minimal latency.
Progressive Verification	Scores arrive in under 3 seconds; Qwen 3.5 security narrative follows asynchronously. Users see results immediately without waiting for LLM.
1:N Face Dedup	Every enrollment triggers a cosine similarity search against all existing faceprints (pgvector HNSW index). Flags potential multi-account fraud when the same face enrolls under different identities.
Audio Anti-Spoofing	Detect AI-cloned or synthetic voices using MMS-300M-NDA (NII Yamagishi, 1107-language multilingual wav2vec2 backbone). Rejects if spoofing probability > 50%
Threat Intel Check	Cross-reference caller phone against 3.93M+ threat intel entries (scam, phishing, spam, fraud, malware) in real time
Compliance Check (SL2)	Upload call recording → SSE streaming with progressive rendering: partial results (~15s) show speaker verify + AI detection + transcript, complete results (~9s later) show diarized dialogue (Tư vấn viên / Khách hàng Telegram-style bubbles) + Qwen 3.5 script compliance analysis with per-step LLM notes → COMPLIANT/NON-COMPLIANT verdict (requires both script compliance AND speaker match)
Identity Binding	Voiceprint/faceprint linked to DOS.Me identity via biometric stamps - increases user trust score

No OTP. No security questions. Voice and face are the credentials - verified in real time, protected against deepfakes.

Why Alibaba Tongyi Lab models? The voice and face biometric models all come from Alibaba's research labs, available on ModelScope under MIT/Apache 2.0 licenses. The 3D-Speaker project (ERes2Net, w2v-BERT) provides state-of-the-art speaker verification. DamoFD (ICLR 2023), TransFace (ICCV 2023), and FLRGB provide a complete face verification pipeline with built-in liveness detection - no IR camera required, ideal for browser-based video calls.

Threat Intelligence aggregates 3.93M+ entries from 19 sources, scoring entities with a weighted multi-source model that considers source reliability, data freshness, and corroboration across independent sources. No single source can override the verdict.

Available everywhere: web app (dosafe.io), Chrome extension, Telegram bot (@DOSafeBot), mobile app with call screening, and a Partner API for banks and fintech platforms.

Pre-existing vs. hackathon-built

Pre-existing components (built before April 11):

Threat intelligence pipeline - 3.93M+ entries from 19 sources, auto-sync infrastructure
Self-hosted Qwen 3.5-35B inference via vLLM on dedicated GPU
AI detection microservices - text, image, video, audio deepfake detection ensembles
Web platform skeleton (dosafe.io) and Go API gateway

Built during the hackathon (April 11–17):

Voice enrollment and speaker verification pipeline (ERes2Net + w2v-BERT ensemble, pgvector storage)
Face enrollment and verification pipeline (DamoFD + TransFace + FLRGB liveness)
Real-time WebSocket voice streaming with sequential verify loop
Progressive verification - fast scores first, async Qwen 3.5 analysis follows
1:N face dedup matching (pgvector HNSW cosine search across all faceprints)
Anti-spoofing at enrollment - reject AI voices and spoofed faces before biometric registration
Qwen 3.5 real-time call analysis integration (first cycle LLM, subsequent scores-only)
Live video call mode with face verification overlay
Call ID banking UI (enroll, verify, live call, video call, compliance check)
Individual model score display (ERes2Net + w2v-BERT shown separately)
Alibaba Cloud eKYC fallback for face verification
Compliance check pipeline - Qwen3-ASR transcription + speaker diarization + Qwen 3.5 script compliance analysis (6 criteria) with per-step LLM notes
Compliance SSE streaming - progressive rendering: partial results (~15s) then complete results (~9s later), diarized dialogue as Telegram-style chat bubbles
Compliance 4-box grid UI - upload recording + script template on top row, voice scores + script compliance checklist below after check
Kling 3.0 audio detection fix (retrained classifiers after accuracy dropped)

How we built it

Layer	Technology
Web app	Next.js 16 + React 19 + TypeScript, deployed on Vercel
API Gateway	Go on Google Cloud Run
Database	Supabase PostgreSQL with pgvector for voiceprint/faceprint storage + HNSW index for 1:N dedup search
AI Inference	Self-hosted Qwen3.5-35B via vLLM
Detection	FastAPI microservices for each modality
Speaker Verification	ERes2Net-large (512d, 0.52% EER) + w2v-BERT-2.0_SV (256d, 0.14% EER) - 768d ensemble, weighted cosine similarity, voiceprints stored in pgvector. Models from Alibaba Tongyi Lab / 3D-Speaker (ModelScope, Apache 2.0)
Face Verification	DamoFD (ICLR 2023, face detection) + TransFace ViT (ICCV 2023, 512d recognition) + FLRGB (anti-spoof, 99% interception). All from Alibaba Tongyi Lab (ModelScope, MIT). Faceprints stored in pgvector. Alibaba Cloud eKYC as automatic fallback
Bot platform	Supabase Edge Functions (Deno)
Mobile	Flutter + native call screening (Android/iOS)

Qwen is core to the solution:

LLM Meta-Judge - Qwen3.5-35B acts as a final judge after neural ensembles produce scores, catching edge cases with structured reasoning.
Real-time Call Analysis - during live voice/video calls, Qwen3.5-35B generates security assessments asynchronously after each verification cycle. Scores arrive instantly; AI narrative follows seconds later - no blocking.
Compliance Transcription + Diarization - Qwen3-ASR (DashScope) transcribes call recordings with speaker diarization (Tư vấn viên / Khách hàng), then Qwen3.5-35B analyzes transcript against 6 compliance criteria with per-step LLM notes. SSE streaming delivers results progressively — verify scores in ~15s, compliance analysis ~9s later.
Threat Intelligence Summarization - synthesizes multi-source threat reports into concise risk summaries for agents and end users.

Open-weight, on-premise ready. All models are MIT/Apache 2.0 (Qwen, Alibaba Tongyi Lab). The full inference stack deploys on a bank's own GPU servers - no data leaves the network, no vendor lock-in. Vietnamese banking regulation requires biometric data stays on-premise; DOSafe is built for this.

Why Qwen? Open weights, sub-second latency for real-time calls, zero per-query cost at scale, deployable anywhere. Alibaba Cloud Model Studio as optional fallback for zero downtime.

Challenges we ran into

Kling 3.0 broke our audio detection - accuracy dropped from 95% to 30%. Retrained classifiers, redesigned heuristic. Result: 100% on Kling 3.0.
Real-time streaming without request pileup - each verify takes 2-3s on GPU. Solved with sequential verify loop: first cycle full LLM analysis, subsequent cycles scores-only.
Image false positives on compressed photos - recalibrated thresholds and added forensic pre-filters.
Threat intel at scale - 3.93M entries from 19 sources with different formats required custom ETL with dedup, cluster linking, and corroboration-based scoring.

Accomplishments that we're proud of

Four-modality AI detection - text, image, video, audio (AUROC 0.97–0.99). 100% Kling 3.0 detection - fixed from 30% in under a week.
Real-time voice + face biometrics - WebSocket streaming verification during live calls, RGB-only anti-spoof (no IR camera), 1:N face dedup catches multi-account fraud, anti-spoofing blocks deepfakes at enrollment.
Compliance SSE streaming with speaker diarization - progressive rendering shows verify scores in ~15s, then diarized dialogue (Telegram-style bubbles) + per-step LLM compliance notes ~9s later. No blocking — users see results as they arrive.
3.93M+ threat intel entries from 19 sources, auto-synced every 6 hours.
Fully open-weight, on-premise ready - all models are MIT/Apache 2.0. Self-hosted inference = near-zero per-query cost and full banking regulation compliance.
Six channels live - web app, Chrome extension, Telegram bot, mobile app, Partner API, DOS.Me identity.

What we learned

Ensemble diversity beats individual accuracy. No single model catches everything. Combining models that analyze different dimensions (temporal, spectral, semantic) creates robustness no single model achieves alone.
Open-weight models are non-negotiable for banking. Vietnamese financial regulation requires biometric data to stay on-premise. By using open-weight models (Qwen, ERes2Net, w2v-BERT, DamoFD, TransFace, FLRGB - all MIT/Apache 2.0), the inference stack deploys inside the bank's own infrastructure. No cloud API dependency, no data leaving the network, full regulatory compliance.
Freshness decay is essential for threat intel. A phone number reported 2 years ago is not the same threat as one reported yesterday by 3 sources. Time-weighted scoring turns a noisy blacklist into a calibrated risk engine.
New generators break old detectors. Every major model release can crater accuracy. Modular architecture where models can be swapped without rebuilding the system is essential.
Voice is the most natural biometric for banking. Customers already call - using their voice as the credential removes friction (no OTP, no security questions) while adding a layer that's harder to fake than knowledge-based authentication.

What's next for DOSafe

Speaker diarization in live calls - extend diarization from compliance (done) to real-time streaming calls, detect mid-call speaker handoff attacks
Upgraded face models - ArcFace/AdaFace recognition + multi-frame anti-spoof ensemble for production accuracy
Scam message classifier - paste SMS/Zalo/Facebook messages → AI classifies scam type + auto-checks entities
Multimodal scam detection - unified risk score combining image + text + entity signals in one analysis
On-premise deployment - Helm + Docker Compose package for banks to run the full inference stack on their own GPU servers
Bank integration SDK - drop-in library wrapping WebSocket streaming API with session management and compliance logging

Try it yourself

Web App - dosafe.io

Call ID (SB8 demo) - dosafe.io/call-id

Sign in with a DOS.Me account (free)
Voice Enroll - record 3+ seconds of speech → voiceprint created (768d embedding)
Voice Verify - record again → similarity score + anti-spoofing + Qwen 3.5 analysis
Live Call - start streaming → real-time scores update every 2-3s, LLM analysis on first cycle
Face Enroll - allow camera → liveness check → faceprint stored. Try enrolling a second account with the same face to see 1:N dedup alert
Video Call - live face verification with real-time overlay (VERIFIED / NOT MATCHED)
Compliance Check (SL2 tab) - upload a call recording (or use Sample A) → SSE streaming shows verify scores first (~15s), then diarized dialogue + script compliance checklist with per-step LLM notes (~9s later). 4-box grid: upload + script template on top, voice scores + compliance results below

AI Detection - dosafe.io

Text - paste any ChatGPT/Claude output → AI probability score + Qwen rubric analysis
Image - upload a Midjourney/DALL-E image → forensic + neural ensemble verdict
Audio - upload an AI-cloned voice sample → 4-model ensemble score
Video - upload a synthetic video → frame + audio analysis

Entity Check - dosafe.io/check

Enter a phone number, URL, email, or crypto wallet → multi-source risk score from 3.93M+ threat intel entries

Telegram Bot - @DOSafeBot

Open t.me/DOSafeBot and try:

/callid - open Call ID menu → voice enroll, voice verify, face enroll, face verify (send voice note or photo when prompted)
/callreset - reset your enrolled voiceprint/faceprint
/check +84xxxxxxxxx - check a phone number against threat intel
/check https://example.com - check a URL for phishing/scam reports
Send any text message → AI-generated text detection
Send an image → AI-generated image detection
Send a voice message → AI-generated audio detection

Partner API

POST https://api.dos.ai/v1/dosafe/voice/verify
Headers: X-Api-Key: dsk_xxx
Body: FormData { file: audio.wav, user_id: "customer_id" }
Response: { verified, similarity, spoofing_risk, threat_score, analysis }

API documentation and key provisioning at app.dosafe.io.

Built With

cloud-run
cloudflare
deno
dinov3
docker
fastapi
flutter
go
nestjs
nextjs
pgvector
postgresql
python
pytorch
qwen
spai
supabase
typescript
vercel
vllm
websocket

Updates

Anh Le started this project — Apr 09, 2026 10:32 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.