💡 Inspiration The rapid rise of LLM-powered agents like OpenAI's Operator, LangChain, and others brings immense capabilities — but also increased risk. As these systems become more autonomous and widely adopted, the potential for misuse through prompt injection or jailbreak attacks grows. Inspired by the AgentHacks AI Safety & Control track, we set out to red-team LLM agents and build a tool that proactively identifies vulnerabilities before they can be exploited in the real world.
🛠️ What We Built We built a jailbreak evaluation framework for LLM agents. This system:
Curates a set of adversarial prompts designed to bypass safety boundaries of LLMs.
Automatically evaluates responses from models like GPT-4o against these prompts.
Flags unsafe or inappropriate outputs using rule-based and optional semantic evaluation.
Is modular and built to support multi-LLM backends (e.g., Claude, Gemini).
Outputs logs and metrics that can be used to measure and visualize model safety performance.
🧠 What We Learned How different LLM architectures and providers handle adversarial prompts.
How to build red-teaming tools for generative models and agents.
Tradeoffs between flexibility, interpretability, and safety in prompt evaluation.
How to integrate LangChain, OpenAI APIs, and build scalable tools for evaluation.
🧱 How We Built It Built on Python, with OpenAI’s latest openai>=1.x SDK.
Used LangChain for agent simulation and orchestration.
Created a JSON/YAML-based prompt storage system for running batch tests.
Designed an evaluation engine that takes model outputs and applies rules to determine whether safety policies are violated.
Documented everything and made it plug-and-play for other LLMs.
🚧 Challenges We Faced Models evolve constantly, making jailbreaks inconsistent across versions.
LLMs often respond subtly to adversarial prompts, requiring nuanced evaluation logic.
Rate-limiting and API access made bulk evaluations slower than expected.
Some tools like LangChain and OpenAI SDK had breaking changes, which required frequent debugging and upgrades.
Built With
- api
- javascript
- llm
- openai
- python
Log in or sign up for Devpost to join the conversation.