Chaos Engine V3: Autonomous Multi-Agent Game QA System

Watch Gemini 3 agents reason through complex logic in real-time"

Inspiration

Traditional static analysis tools are good at finding syntax errors. Human testers excel at finding functional bugs. But there's a critical gap: understanding intent and logic across complex systems. We were inspired by the idea that AI agents, powered by Gemini 3's advanced reasoning capabilities, could act as specialized "Red Teams"—not just checking if code runs, but actively trying to break, exploit, and optimize systems across multiple domains.

The Gemini 3 Hackathon challenged us to push boundaries, and we saw an opportunity to create something that turns dry code analysis into an engaging, multi-agent adventure that serves real-world needs.

What it does

Chaos Engine V3 is a universal autonomous QA system that deploys domain-specific AI agents to analyze your code:

🎮 Game QA: Exploit hunters and performance optimizers for AAA or indie game logic
💻 Software/Web: Security auditors and architecture reviewers for enterprise applications
🎓 Learning/Education: Mentors and concept analyzers to help developers grow
🎧 Customer Support: Bug reproducers and diagnostic experts to solve user issues

The system:

Auto-detects your programming language (Python, JavaScript, TypeScript, C#, C++, Java)
Selects appropriate AI agents based on your chosen domain
Uses Gemini 3's thinking_budget feature to simulate complex logic transitions
Provides live reasoning logs showing how agents think through edge cases
Generates automated fix proposals with side-by-side code diffs

How we built it

Frontend: Next.js 15 with Tailwind CSS, Framer Motion for animations, and a cyberpunk-inspired UI that dynamically responds to domain selection. We built a real-time log viewer to show Gemini 3's thinking process.

Backend: FastAPI with Google GenAI SDK 1.0+, leveraging Pydantic V2 for robust data validation. We engineered a sophisticated agent orchestration system that:

Routes analysis requests to domain-specific prompts
Manages Gemini 3 Pro/Flash model selection based on complexity
Streams thinking logs in real-time via WebSocket-like connections

AI Integration: We extensively experimented with Gemini 3's new thinking mode, tuning thinking_budget parameters to balance depth vs. speed. Each domain has custom system prompts that shape agent personalities and analysis approaches.

DevOps: Created a zero-config deployment system (start.sh) that automatically handles Python virtual environments, Node.js dependencies, and server synchronization.

Challenges we ran into

Thinking Budget Optimization: Finding the right balance between reasoning depth and response time was tricky. Too low and agents missed subtle issues; too high and responses became slow.
Multi-Domain Prompt Engineering: Creating distinct agent "personalities" that felt authentic across Game QA, Security, Education, and Support required extensive iteration and testing.
Real-time Log Streaming: Displaying Gemini 3's internal reasoning process without overwhelming the UI required careful UX design and data throttling.
Language Detection Accuracy: Building a reliable auto-detection system that works across Python, JavaScript, TypeScript, C#, C++, and Java edge cases.
Demo vs. Production Mode: Designing a seamless experience that works impressively in demo mode while gracefully upgrading to real Gemini 3 analysis when API keys are provided.

Accomplishments that we're proud of

Universal Platform: A single system that genuinely serves 4 distinct use cases with domain-specific intelligence
Thinking Transparency: Successfully exposing Gemini 3's reasoning process in an engaging, understandable way
Zero-Config Launch: One command (./start.sh) gets everything running—no manual setup required
Production-Ready UI: A polished, responsive interface that feels like a premium developer tool
Automated Fixes: Not just finding bugs, but showing exactly how to fix them with code diffs

What we learned

Gemini 3's Thinking Mode is Powerful: The ability to see intermediate reasoning dramatically improves trust and debuggability
Domain-Specific Prompts Matter: Generic AI agents are good; specialized agents with context are exceptional
Developer Experience is Critical: Even the most powerful AI is useless if the UX is confusing—we invested heavily in making the tool intuitive
Agent Orchestration is Complex: Managing multiple AI personalities, streaming responses, and maintaining context requires careful architecture
The Gap Between Demo and Production: Building something that impresses in 2 minutes AND delivers value over 2 months requires different design thinking

What's next for Chaos Engine V3

Team Collaboration Features: Allow multiple developers to run coordinated Red Team analyses on shared codebases
Custom Agent Training: Let users fine-tune agents with their own coding standards and domain knowledge
CI/CD Integration: Automated Chaos Engine runs on every pull request with configurable severity thresholds
Expanded Domain Support: Add agents for Mobile App QA, API Testing, Database Optimization, and DevOps auditing
Multi-Model Orchestration: Combine Gemini 3 with specialized models for code generation, vulnerability detection, and performance profiling
Historical Analysis: Track how code quality evolves over time with trend analytics and regression detection

Built With

bash
css
fastapi
flash
framer
genai
javascript/typescript
libraries
lucide
motion
next.js
node.js
pro
pydantic
react
sdk
tailwind
tools:**

Submitted to

Gemini 3 Hackathon

Created by

I served as the Lead Architect and Full-Stack Developer for Chaos Engine V3. My primary role involved:

System Design: Architecting the Multi-Agent framework and defining the specialized roles for the Griefer, Speedrunner, and Auditor agents.

Logic & Integration: Implementing the Gemini 3 SDK and fine-tuning the 'Thinking Budget' parameters to ensure deep reasoning in code audits.

Multimodal Orchestration: Designing the UI/UX in Next.js and ensuring seamless integration between Vision and Text analysis.

Technical Refinement: I utilized advanced AI collaborative tools (including Claude and Gemini) as a 'Pair Programming' workforce to accelerate boilerplate coding, allowing me to focus on high-level system logic, prompt engineering, and solving complex API integration hurdles.

ANIRBAN ROY

Updates

ANIRBAN ROY started this project — Feb 09, 2026 06:51 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.