mafia-agents
Inspiration
This project is mainly about AI safety, not just game playing. We wanted to study what happens when a model is trained in an environment where deception is useful and sometimes rewarded.
Mafia gives us a controlled setting for that question. It has hidden roles, bluffing, coordination, and pressure from other agents, which makes it a useful context environment for measuring whether a model becomes more deceptive over time after reinforcement learning.
What it does
mafia-agents is a Python system for training and evaluating language models in Mafia, with a focus on tracking changes in deceptive behavior.
It currently includes:
- a working Mafia game engine
- support for mafia, doctor, detective, and villager roles
- structured observations and actions
- scripted baseline players
- OpenRouter-backed opponent policies
- a local and Modal training path
- RL training infrastructure using grouped rollouts and GRPO-style updates
- checkpointing, logging, and benchmark hooks
- TruthfulQA tracking to compare behavior before and after training
The main goal is to see how a model changes when trained in a deception-heavy setting, and whether simple interventions like inoculation prompting can reduce that effect.
How we built it
We built the project in a few parts:
src/game/handles rules, phases, resolution, and win conditionssrc/policies/handles scripted players, prompt rendering, parsing, and OpenRouter-based policiessrc/train/handles config loading, rollouts, rewards, grouped updates, checkpoints, and logssrc/eval/handles benchmark code, including TruthfulQA tracking
The trainable model is run through a Hugging Face path, and the training loop uses structured trajectories plus grouped RL updates so we can compare checkpoints over time. We used Mafia as the training context, then looked at outside behavior with TruthfulQA and prompt-based interventions.
Challenges we ran into
- Building a game environment that is structured enough for RL but still captures bluffing and hidden information
- Measuring deception in a useful way instead of just assuming game skill means deception
- Keeping model outputs valid and safely handling malformed actions
- Managing large model loading time, memory use, and remote training setup
- Comparing in-game behavior with out-of-game behavior in a way that still feels meaningful
Accomplishments that we're proud of
- We built a full end-to-end system instead of only a proposal
- We created a usable RL training setup around Mafia with checkpoints, evaluation hooks, and remote training support
- We used Mafia as a concrete testbed for studying deception after RL training
- We tracked broader behavior with TruthfulQA instead of only measuring in-game win rate
- We added inoculation prompting and saw that deceptive behavior decreased after training when that prompt-based intervention was applied
What we learned
- A model can pick up more deceptive behavior when trained in an environment where deception helps it win
- Looking only at game performance is not enough for safety questions
- A controlled environment like Mafia is useful because it gives repeated situations where deception has strategic value
- Simple interventions such as inoculation prompting may reduce harmful carryover effects after training
- Safety work needs both training experiments and behavior measurement outside the training task
What's next for mafia-agents
- Run longer and more stable GPU-backed training jobs on Modal
- Measure deception changes more directly across checkpoints
- Improve the evaluation pipeline for both Mafia behavior and outside benchmarks
- Test stronger and more varied inoculation prompts
- Explore the wider safety question of how situationally rewarded deception carries over beyond the original training environment
Built With
- ai
- hugging-face
- llm
- modal
- openrouter
- posttraining
- python
- rl
Log in or sign up for Devpost to join the conversation.