Inspiration

Deepfake voice scams are exploding; banking OTP theft, CEO-fraud, and family-in-distress calls. We wanted a fast, practical check that anyone could run to answer one question in seconds: “Is this voice likely AI or human?” That became Voice Guard.

What it does

Classifies speech as Human vs AI and shows a confidence for each.

Generates an explanation heatmap over the spectrogram so users see why.

Works with microphone recordings or file uploads.

Clean web UI (record/upload → analyze) + transparent “why” line (rule, threshold, replay score).

How we built it

Data: Programmatically generated ~200 diverse sentences across multiple ElevenLabs voices; converted/normalized to 16kHz WAV. Added real human clips.

Models: Started with a CNN on log-mel; moved to a wav2vec2-based classifier for better generalization. Trained on a 4GB RTX 3050 using AMP, grad-accumulation, and lightweight augmentations.

Inference: FastAPI backend; mic-aware preprocessing (band-pass, RMS/peak normalization, noise gate). Added a replay-attack heuristic (cepstrum + HF roll-off) to nudge AI scores when a clip sounds like speaker→mic re-recording.

Deployment: We deployed a Streamlit implementation on HuggingFace Spaces Platform.

Explainability: Saliency-style heatmap aligned to the spectrogram.

Challenges we ran into

Microphone access glitches in some browsers/OS setups.

Gradient/Grad-CAM pitfalls (no-grad contexts, wrong layer hooks) causing backprop errors—fixed with correct hooks and enabling grads only where needed.

Audio augmentation mismatches (library API changes, missing optional deps like fast_mp3_augment).

Threshold vs. intuition: Cases where AI prob looked high but sat just below mic threshold, labeling as Human. We exposed the decision rule and thresholds in the UI to avoid confusion.

Unresolved during hack: Robustly catching speaker→mic replayed AI in all rooms/devices. Heuristics helped, but we didn’t have time to gather/train on a full physical-replay dataset.

Accomplishments that we're proud of

End-to-end system: data → training → inference API → web app in a weekend.

Works on a 4GB GPU with mixed precision and gradient accumulation.

Transparent decisions: heatmaps + “why” line (probabilities, thresholds, replay score, rule).

Modular code: easy to swap models (CNN ↔ wav2vec2) and tune thresholds.

What we learned

Anti-spoofing is domain sensitive: mic recordings behave differently than direct files.

Decision policy matters: argmax vs threshold vs hybrid changes the UX more than expected.

Small model/infra tricks (RMS/peak norm, band-pass, VAD/trim, AMP) deliver big stability gains.

Good telemetry (showing why) turns “black box” AI into a trustworthy tool.

Collecting the right data (physical replay, varied rooms/devices) is as important as architecture.

What's next for Voice Guard

Data & training: Build a physical-replay dataset (multiple rooms, speakers, devices, distances); train anti-spoof backbones (AASIST/ECAPA/RawNet-style) and a small learned replay head.

Stronger features: Add LFCC/phase/cepstral features and fuse with wav2vec2; longer context with VAD.

Calibration: Source-specific thresholds, temperature scaling, and risk bands (Likely AI / Uncertain / Likely Human).

Product: Prompt users to upload the original file when mic + high AI but below threshold; add enterprise API & audit logging.

Accessibility & security: Multi-language support, on-device inference option, and clearer privacy controls.

Known limitation (tracked): Robust detection of replayed AI via mic remains challenging; we’ve documented it and scoped it as a post-hack upgrade with data collection and retraining.

Built With

Share this project:

Updates