🤖 Poisoned Perception: Attacking VLA Models

💡 Inspiration: Why We Looked Deeper

The rapid advancement of Visual-Language-Action (VLA) models promises truly autonomous robotics. However, our initial question was: How vulnerable are these multi-modal systems when the attack is subtle, targeting meaning rather than just pixels? We realized existing adversarial attacks focused on low-level noise, ignoring the complex semantic alignment that links vision, language, and action. This led us to explore Harmful Semantic Visual Injection (HSVI), an attack designed to corrupt the robot's fundamental understanding of its goal, even when the image looks clean to a human. We were inspired to build a more robust future by first uncovering its deepest vulnerabilities. 🛡️

🏗️ How We Built It: The HSVI Mechanism

Our project centered on crafting the Harmful Semantic Visual Injection perturbation, which is the slight, calculated visual noise, to minimize the semantic distance between the robot's intended target and the malicious target.

The Visual-Language-Action Model operates in three stages: Vision, Language Mapping, and Action Planning. We specifically targeted the second stage, Language Mapping, to break the critical semantic alignment. We formulated the attack as an optimization problem: we sought to find the perturbation that minimizes our specially developed Semantic Loss function. This function measures how far the Visual-Language-Action Model's output, when given the original clean image plus the perturbation, deviates from the output associated with the malicious target outcome.

This meticulous approach allowed us to generate imperceptible visual noise that drastically shifted the robot's computed action trajectory, proving the attack's effectiveness. 🛠️

🎓 What We Learned: The Vulnerable Core

Our research revealed a critical insight: the most significant vulnerability in VLA models lies not in the raw pixel-level processing, but in the semantic interpretation layer. We learned that even a slight, calculated perturbation can push the robot's learned action manifold far away from the normal path (visualized clearly in our 3D action trajectory plots). Specifically, the Interpretability Heatmaps confirmed that HSVI successfully makes the model shift its attention or misinterpret the meaning of objects, leading it to plan a completely different, often dangerous, action. This deepened our understanding of the fragile link between perception and high-level reasoning in robotics. 🧠

⛰️ Challenges We Faced: Bridging Modalities

The biggest challenge was the complexity of the VLA architecture itself. Optimizing a perturbation that must survive the Vision Encoder, pass through the Semantic Language Mapper, and ultimately influence the Action Planner proved difficult. We struggled initially with gradient calculation across the non-differentiable parts of the system. Another hurdle was achieving high attack success rates while keeping the perturbation visually imperceptible—a classic trade-off in adversarial attacks. Overcoming these challenges required careful tuning of the optimization process and precise formulation of the semantic loss function. 💪

Built With

Share this project:

Updates