💡 InspirationAs we transition from passive chatbots to active "X-Agents" that can read emails and execute tools, the security surface area explodes. I was inspired by the subtle but dangerous threat of Indirect Prompt Injection—where an agent reads a malicious email and unknowingly executes harmful instructions. I wanted to prove that we don't just need better prompts; we need an autonomous "immune system" for our agents. AXIOM (Adversarial X-agent Intelligence & Optimization Manager) was born from the idea that the best defense against an AI agent is another, more specialized AI agent.

🛠️ How I built it I architected AXIOM as a closed-loop arena containing two distinct entities: The Victim (OfficeManager): A helpful assistant configured with intentional vulnerabilities (trusted access to a mock internal database). The Guardian (AXIOM): A "Red Team" agent that analyzes inputs, detects logic flaws, and generates defensive patches.

The core workflow relies heavily on Gemini 3 Pro's reasoning capabilities. Instead of just flagging a threat, AXIOM analyzes the reasoning trace of the victim agent to understand why it failed. I used a purely local state management system (JSON-based) to ensure the demo is deterministic and "Appleix-ready" for future mobile conversion.

🚧 Challenges I faced The biggest challenge was the "Safety Paradox." To demonstrate a security tool, I had to build an agent capable of generating attacks. The Safety Filter: Early versions of the Attacker Agent were constantly blocked by safety filters. I solved this by strictly framing the system prompt as a "Security Research Simulation" within a sandboxed environment. Structured Reasoning: Getting the model to output structured analysis (Assumptions vs. Decisions) without leaking the raw Chain-of-Thought was difficult. I implemented a strict parsing layer to separate the internal logic from the user-facing dashboard.

📚 What I learned I learned that defensive reasoning is harder than offensive generation. Generating a "patch" that blocks an attack without breaking legitimate functionality requires a high degree of nuance. I also formalized the metric for agent security using Mean Time to Exploit (MTTE), which we calculate as:$$MTTE = t_{exploit} - t_{start}$$Where $t_{exploit}$ is the timestamp of the secret leak and $t_{start}$ is the initialization of the attack vector. AXIOM effectively drives this value to infinity (blocked) after the auto-patch phase.

Built With

Share this project:

Updates