This study explores how adversarial examples can influence Large Language Models (LLMs), particularly focusing on their ability to jailbreak or bypass model safeguards. Our work reimplements key ideas and methods presented in the original paper. In addition, we built a classification model that helps to recognize adversarial patterns in images and differentiate clean and adversarial examples.
Detecting adversarial examples is crucial for building more secure, reliable AI systems that can withstand attacks.As the LLMs integrate more modalities, they bring in vulnerabilities related to computer vision, adversarial examples in particular. This creates an opportunity for malicious parties to bypass the safety guardrails just by adding some noise to an image. The success of the algorithm means a catastrophic failure for the safety mechanisms of the targeted LLMs, which could have severe consequences because the uncontrolled generation of harmful content by LLMs can cause a lot of misuse of the models and harm to society.
Built With
- pytorch
- tensorflow
Log in or sign up for Devpost to join the conversation.