MSG (Multimodal Spatial Reasoning Gym) is an open framework for training VLMs on construction-site spatial reasoning via RL. It targets verifiable questions: How far is the worker from the edge? Is the scaffolding to the left or right of the ladder? Which object is closest to the person? Is there a guardrail visible? Ground truth based on Meta SAM3 model.

The library is compatible with the Prime Intellect Vertifiers and Environments Hub and will be uploaded there in the future (without the dataset.)

Supports adding a teacher model for more advanced reward signals.

I worked alone on this, I was able to extract almost 1 million VQA pairs from the dataset. I tried finetuning Qwen3VL on 12 gpus however our credits ran out and it took hours to get the runs started.

Built With

Share this project:

Updates