VoiceFuzz

Inspiration

Voice agents fail in ways text testing never catches. A negation bug — agent ignores "don't" and does the opposite of what the user said — is just the tip of the iceberg. What about slang, sarcasm, double negation, spoken punctuation?
Manual QA doesn't scale. Traditional fuzzing is random noise. Neither understands language.

What it does

Takes one observed bug and discovers how deep it goes. An LLM hypothesizes what other inputs might trigger similar
failures, generates test cases, runs them in parallel, and reports which failure patterns exist — including bugs that hide themselves by accidentally producing correct output.

How we built it

Claude API — Generates creative test hypotheses from a seed failure. Not just obvious variants, but linguistic edge
cases humans wouldn't think to test.

Daytona — Runs all hypotheses in parallel sandboxes. 7 sandboxes created in 0.3s, tests executed in 0.5s. 28x faster
than sequential.

ElevenLabs — Speech-to-text in the voice agent. Real audio through the full voice pipeline catches bugs that only
exist in spoken language — punctuation becoming words, format changes through TTS/STT.

Sentry — Error tracking with full NLU context. Production failures become seeds for the fuzzer.

Pipeline: Seed failure → Claude hypothesizes → TTS generates audio → Daytona runs parallel sandboxes → ElevenLabs STT processes → Analyze results.

Challenges we ran into

Daytona SDK needed Python 3.10+, we had 3.9. Went direct to REST API — sandbox creation, toolbox proxy execution,
parallel teardown. Worked cleanly.

Making the system find genuinely non-obvious bugs, not just synonyms of the original failure. Required careful prompt engineering to push Claude toward creative linguistic reasoning.

Accomplishments that we're proud of

The system finds bugs that hide themselves — cases where the agent gives the correct answer for the wrong reason,
which every traditional test marks as passing. Hypothesis-driven testing flags these as suspicious.

Full voice pipeline testing end-to-end with real audio, real STT, real agent responses. Found voice-specific bugs
invisible to text testing.

What we learned

LLMs are better hypothesis generators than fuzzers. They reason about language structure, not just character
mutations. The most dangerous bugs are the ones that accidentally pass — you need to understand why something passes, not just that it passes.

What's next for Failure Amplifier