mafia-agents

Inspiration

This project is mainly about AI safety, not just game playing. We wanted to study what happens when a model is trained in an environment where deception is useful and sometimes rewarded.

Mafia gives us a controlled setting for that question. It has hidden roles, bluffing, coordination, and pressure from other agents, which makes it a useful context environment for measuring whether a model becomes more deceptive over time after reinforcement learning.

What it does

mafia-agents is a Python system for training and evaluating language models in Mafia, with a focus on tracking changes in deceptive behavior.

It currently includes:

a working Mafia game engine
support for mafia, doctor, detective, and villager roles
structured observations and actions
scripted baseline players
OpenRouter-backed opponent policies
a local and Modal training path
RL training infrastructure using grouped rollouts and GRPO-style updates
checkpointing, logging, and benchmark hooks
TruthfulQA tracking to compare behavior before and after training

The main goal is to see how a model changes when trained in a deception-heavy setting, and whether simple interventions like inoculation prompting can reduce that effect.

How we built it

We built the project in a few parts:

src/game/ handles rules, phases, resolution, and win conditions
src/policies/ handles scripted players, prompt rendering, parsing, and OpenRouter-based policies
src/train/ handles config loading, rollouts, rewards, grouped updates, checkpoints, and logs
src/eval/ handles benchmark code, including TruthfulQA tracking

The trainable model is run through a Hugging Face path, and the training loop uses structured trajectories plus grouped RL updates so we can compare checkpoints over time. We used Mafia as the training context, then looked at outside behavior with TruthfulQA and prompt-based interventions.

Challenges we ran into

Building a game environment that is structured enough for RL but still captures bluffing and hidden information
Measuring deception in a useful way instead of just assuming game skill means deception
Keeping model outputs valid and safely handling malformed actions
Managing large model loading time, memory use, and remote training setup
Comparing in-game behavior with out-of-game behavior in a way that still feels meaningful

Accomplishments that we're proud of

We built a full end-to-end system instead of only a proposal
We created a usable RL training setup around Mafia with checkpoints, evaluation hooks, and remote training support
We used Mafia as a concrete testbed for studying deception after RL training
We tracked broader behavior with TruthfulQA instead of only measuring in-game win rate
We added inoculation prompting and saw that deceptive behavior decreased after training when that prompt-based intervention was applied

What we learned

A model can pick up more deceptive behavior when trained in an environment where deception helps it win
Looking only at game performance is not enough for safety questions
A controlled environment like Mafia is useful because it gives repeated situations where deception has strategic value
Simple interventions such as inoculation prompting may reduce harmful carryover effects after training
Safety work needs both training experiments and behavior measurement outside the training task

What's next for mafia-agents

Run longer and more stable GPU-backed training jobs on Modal
Measure deception changes more directly across checkpoints
Improve the evaluation pipeline for both Mafia behavior and outside benchmarks
Test stronger and more varied inoculation prompts
Explore the wider safety question of how situationally rewarded deception carries over beyond the original training environment

Built With

ai
hugging-face
llm
modal
openrouter
posttraining
python
rl

Updates

kanjonavosabud Sabud started this project — Mar 22, 2026 11:55 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.