Detecting Visual Adversarial Examples Jailbreaking LLMs

This study explores how adversarial examples can influence Large Language Models (LLMs), particularly focusing on their ability to jailbreak or bypass model safeguards. Our work reimplements key ideas and methods presented in the original paper. In addition, we built a classification model that helps to recognize adversarial patterns in images and differentiate clean and adversarial examples.

Detecting adversarial examples is crucial for building more secure, reliable AI systems that can withstand attacks.As the LLMs integrate more modalities, they bring in vulnerabilities related to computer vision, adversarial examples in particular. This creates an opportunity for malicious parties to bypass the safety guardrails just by adding some noise to an image. The success of the algorithm means a catastrophic failure for the safety mechanisms of the targeted LLMs, which could have severe consequences because the uncontrolled generation of harmful content by LLMs can cause a lot of misuse of the models and harm to society.

Built With

pytorch
tensorflow

Updates

Yasmin Kadyrova started this project — Dec 10, 2025 03:59 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.