Fireboy & Watergirl: Collaborative Reinforcement Learning

Inspiration

While the AI landscape has exploded with work on human-robot collaboration and reactive AI systems, a critical gap remains: true agent-to-agent collaboration. Most multi-agent systems operate independently or competitively, but what if AI agents could genuinely cooperate—predicting each other's movements, coordinating strategies, and maximizing collective efficiency?

We drew inspiration from the classic Fireboy & Watergirl puzzle game, where two characters must work together to solve environmental challenges. Unlike human-AI collaboration where one entity adapts to the other, we envisioned symmetric cooperation: two AI agents learning to trust, communicate, and synchronize their actions through pure environmental understanding. This mirrors real-world scenarios where autonomous systems must collaborate without explicit communication channels—like warehouse robots coordinating deliveries or drones executing search-and-rescue missions.

What it does

Our system trains two reinforcement learning agents to master cooperative puzzle-solving in a physics-based 2D environment inspired by Fireboy & Watergirl. The agents must:

Coordinate pressure plate activation: Both agents must simultaneously hold down separate switches to unlock pathways, requiring perfect timing and sustained cooperation
Perceive the environment in 360°: An 18-directional "LIDAR-like" clearance system gives agents spatial awareness, helping them navigate complex platformer geometry
Navigate hazards intelligently: Fire agent avoids water pools while Water agent can traverse them safely—each agent must understand their unique constraints and their partner's capabilities
Predict landing trajectories: Using physics-based simulation, agents forecast their jump trajectories, avoiding hazards before committing to risky moves
Learn emergent communication: Through a shared neural network backbone, agents develop implicit coordination strategies without explicit message-passing

How we built it

Architecture: Hybrid Parameter Sharing with Mutual Accountability

We implemented a novel cooperative architecture where agents share a common "world understanding" backbone but maintain individual decision-making heads. This creates an interesting dynamic: when one agent fails to cooperate, the shared weights penalize both agents, forcing them to internalize the importance of teamwork.

Key Components:

52-Dimensional Enhanced State Space
- 30D base state: position, velocity, plate/goal distances, directional vectors
- 6D safety predictions: per-action hazard forecasts using trajectory simulation
- 8D hazard proximity: directional clearance to deadly obstacles
- 8D partner awareness: real-time tracking of teammate state and progress
Two-Phase Reward Structure
- Phase 1 - Collaboration First: Zero goal progress rewards until both plates active
- Phase 2 - Goal Pursuit: Full distance rewards + maintained cooperation bonuses
- Sustained cooperation bonus: +20 points for holding plate 30+ consecutive frames
- No-jump stability bonus: +5 points/frame for stationary plate holding
Synchronized Training with Mutual Penalties
- Shared backbone optimizer updates both agents simultaneously
- Cooperation penalties flow through shared weights (ΔL = L_fire + L_water + L_cooperation)
- If Water fails to activate its plate, Fire's shared features degrade equally
- Gradient clipping (max_norm=1.0) prevents training instability

Technology Stack

PyTorch: Neural network architecture and training
Pygame: Real-time visualization and physics simulation
NumPy: Efficient state vector computations
Weights & Biases: Training metrics, loss curves, and cooperation analytics

Challenges we ran into

The Control Channel Odyssey

Choosing how to control the agents became our biggest architectural decision. We explored three radically different approaches:

Imitation Learning: We recorded human gameplay and attempted behavioral cloning. Problem: Humans don't play both characters simultaneously—our model learned human coordination patterns but failed at true multi-agent sync.
*LLM Control *: We experimented with Anthropic Claude 4 to generate action sequences based on game state descriptions. Problem: 500ms+ latency per decision made real-time gameplay impossible, and language models struggled with precise spatial reasoning about pixel-perfect jumps.
*Reinforcement Learning *: Deep Q-Learning evolved into Actor-Critic, then into our final hybrid shared-backbone system. Success, but: Training took 8-12 hours, requiring us to train overnight and iterate slowly. Our individual systems lacked the hardware computing power to train efficiently, and could not delivery training of a large number of episodes to fully refine the RL model

The "Going Left When Should Go Right" Debugging

At episode 200, our Fire agentmoved left toward Water's plate instead of right toward its own. The fix required:

Separating partner state from self-objectives, self-objectives weigh higher than partner state
1.5x horizontal movement bonus toward assigned plates

Hazard Avoidance: The Jump-of-Death Problem

Early agents loved jumping into water pools. Physics-based trajectory prediction solved this, but introduced a new challenge: raycast intersection math for arbitrary rectangles. Our first_intersection_t_with_rect() function required careful handling of edge cases where rays graze corners vs. penetrate surfaces.

Accomplishments that we're proud of

Building RL from First Principles

We didn't use pre-built multi-agent libraries (no RLlib, no PettingZoo). Every component—from the physics engine to the safety system—was hand-crafted:

18-Directional "LIDAR" System: Casts 360° rays at 20° intervals, detecting obstacles up to 500px away. Agents use this for spatial awareness, open-space rewards, and hazard avoidance—all without explicit game knowledge.

Trajectory Prediction Engine: 100-frame lookahead simulation using real game physics (gravity, terminal velocity, collision detection). This is not a learned model—it's deterministic physics giving agents "mental simulation" abilities.

Emergent Cooperation Behavior: By episode 500, agents exhibit sophisticated teamwork:

Water rushes to left plate, Fire to right plate (learned role assignment)
Both hold plates for 30+ frames without explicit "wait" commands
Fire avoids jumping on its plate (learned stability = better cooperation)
Success rate climbs from 0% → 90% without hardcoded coordination logic

What we learned

Technical Insights

Shared Backbones Are Powerful: Forcing agents to share world-understanding weights creates natural cooperation. The mutual penalty system acts like a "shared fate" mechanism—you can't succeed unless your partner does.
Physics Beats Learned Models (Sometimes): Our trajectory predictor is deterministic physics, not a learned neural network. It's 100% accurate, requires zero training, and generalizes perfectly. Lesson: Don't overcomplicate—sometimes classical methods win.
Reward Shaping is Everything: Our two-phase reward structure (collaboration → goal pursuit) was the breakthrough. Early experiments with "always reward goal distance" created selfish agents that ignored teammates.

What's next for Fireboy Watergirl: Collaborative Reinforcement Learning

Immediate Extensions

Multi-Level Curriculum Learning: Train on progressively harder levels, transferring learned cooperation skills to more complex puzzles with elevators, teleporters, and timing-based challenges.

Heterogeneous Agent Teams: Extend beyond two agents—what about Fire, Water, and Ice agents with unique abilities? This mirrors real-world robot teams with specialized tools.

Human-AI Hybrid Mode: Let a human control one agent while the AI adapts to their play style in real-time. This tests whether our cooperative architecture generalizes to unpredictable partners.

Industry Applications

Our architecture directly applies to real-world multi-agent coordination problems:

Warehouse & Logistics Automation

Amazon Robotics-Style Systems: Multiple autonomous forklifts must coordinate to move pallets, avoid collisions, and optimize pathfinding. Our shared-backbone approach could enable implicit communication without bandwidth-heavy message-passing.
Delivery Drone Swarms: Coordinating battery levels, package priorities, and airspace conflicts requires agents that understand collective objectives.

Humanoid Robot Collaboration

Cargo Handling: Two Boston Dynamics Atlas robots carrying a heavy beam must synchronize gait, adjust grip based on partner stability, and navigate obstacles together. Our trajectory prediction + mutual penalties framework applies directly.
Assembly Line Coordination: Industrial robots must sequence tasks (e.g., "you weld while I hold") without explicit orchestration. Our phase-based rewards (collaboration → individual goals) mirror this workflow.

Search & Rescue Operations

Disaster Response: Drone teams must coordinate coverage areas, share discoveries (implicitly via shared world model), and adapt when teammates fail or get damaged.
Underwater Exploration: AUVs with limited communication must work together using only local environmental observations—exactly our agent's scenario.

Research Directions

Emergent Communication Protocols: Can we add a learned "communication channel" where agents develop their own signaling language through the shared backbone?

Adversarial Cooperation: Train against agents that sometimes defect. Can our mutual penalty system create robust cooperation even with unreliable partners?

Transfer to Physical Robots: Our sim-to-real gap is smaller than typical RL (we already use real physics). Next step: deploy on wheeled robots solving physical puzzle tasks.

Vision Statement

We believe the future of AI isn't just smarter individual agents—it's agents that can trust each other. In a world where autonomous systems will increasingly work together (self-driving cars coordinating at intersections, robots collaborating in factories, drones executing rescue missions), the ability to cooperate without explicit communication is essential.

Fireboy & Watergirl isn't just a game—it's a testbed for the collaborative intelligence that will power tomorrow's multi-agent systems. And we've proven it can be built from scratch, trained on consumer hardware, and achieve human-level cooperation in just 500 episodes.

The future is collaborative. We just taught AI how to hold the door open for each other.

Built With

claude
python
pytorch
q-learn
typescript
wandb

Updates

Mclaren Tsang started this project — Oct 12, 2025 04:26 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.