RL Research Platform: Universal Multi-Agent Reinforcement Learning System

Inspiration

This project was born out of the sheer frustration of conducting Multi-Agent Reinforcement Learning (MARL) research. We identified three critical "pain points" that every researcher faces but few solve systematically:

  1. Dependency Hell: Running a legacy baseline (requiring Gym 0.21) alongside a SOTA algorithm (requiring Gymnasium 1.0) on the same machine is a nightmare of conflicting libraries.
  2. The "Win-Rate" Illusion: In non-transitive games (like Rock-Paper-Scissors), a simple reward curve doesn't tell you if a strategy is actually good. We needed Payoff Matrices to see who beats whom, but generating them manually is tedious.
  3. The Reproducibility Crisis: "It works on my machine" is the standard in AI. Experiments often fail to reproduce due to uncommitted code changes, specific random seeds, or subtle driver mismatches.

We built the RL Research Platform to stop fighting configuration files and start doing real science.

What it does

The RL Research Platform is an RLOps (Reinforcement Learning Operations) infrastructure that manages the entire lifecycle of an experiment:

  • Universal Environment Management: It treats environments as versioned Docker containers. You can run StarCraft II, PettingZoo, and custom GridWorlds side-by-side without conflicts.
  • Automated Tournaments: Users can select $N$ historical checkpoints, and the platform automatically schedules a round-robin tournament to generate a Payoff Matrix Heatmap, revealing the true game-theoretic landscape.
  • "Time-Travel" Reproducibility: It captures a forensic snapshot of every experiment—including uncommitted code (git diff), exact dependency trees (pip freeze), and hardware drivers—ensuring any result can be reproduced down to the bit.

How we built it

We adopted a Micro-kernel + Plugin architecture to ensure flexibility and scalability.

  • The Runner Protocol: Instead of hardcoding algorithms, we built a universal runner_main.py that dynamically loads user code as plugins. This allows the backend to orchestrate any algorithm (SB3, RLlib, custom) via a standardized JSON protocol.
  • Automated Evaluation Pipeline: We implemented a dedicated service EvalMatrixService that decomposes evaluation requests into parallel jobs. The results are aggregated using Logistic Regression (Elo Rating) to calculate expected win rates.
  • Polyglot Interop: To support complex aerospace simulations written in Java (Orekit), we bypassed standard slow IPC (sockets) and used JPype to launch a JVM inside the Python process. This gave us zero-copy access to Java objects from Python.
  • Streaming Storage: We integrated directly with S3/MinIO for all artifact storage. Logs and checkpoints are streamed directly to object storage, keeping the backend stateless and lightweight.

Challenges we ran into

  • Handling "Dirty" Code: Researchers often run experiments without committing their code. To solve this, we built a Snapshot Recorder that forces the capture of git diff and uncommitted changes before any training starts.
  • Cross-Language Performance: Interfacing Python RL agents with Java environments initially caused massive serialization overhead. We solved this by using shared memory via JPype, achieving 100x speedups over socket-based solutions.
  • Cluster Abstraction: Moving from local Docker to a GPU cluster usually requires code changes. We implemented an Executor Adapter pattern that seamlessly maps local paths to cluster volumes, allowing the same code to run anywhere.

Accomplishments that we're proud of

  • Automated Insight: Turning a folder of checkpoints into a clear Heatmap that visualizes strategy cycles (A beats B, B beats C, C beats A) without writing a single line of script.
  • True Isolation: Successfully running legacy algorithms (2 years old) and modern algorithms on the same infrastructure simultaneously via strict Docker encapsulation.
  • Engineering Quality: We treated research code like production software—utilizing Pydantic for schema validation, SQLAlchemy for strict lineage tracking, and React for a responsive UI.

What we learned

  • Protocol > Implementation: Designing a robust Runner Protocol was more important than the implementation of any single algorithm. It allowed us to decouple the "What" (Algorithm) from the "Where" (Infrastructure).
  • Tools Shape Research: By reducing the friction of setting up environments, we found ourselves running more diverse and rigorous experiments.
  • The Power of Snapshots: You don't realize how important pip freeze is until you try to reproduce a paper from 6 months ago.

What's next for RL Research Platform

  • Distributed Training: Integrating with Kubernetes or Ray to support massive parallel sampling for large-scale environments.
  • LLM Analyst: Integrating Large Language Models to automatically interpret the Payoff Matrices and provide tactical suggestions (e.g., "Policy A is too passive in the late game").
  • Open Source: Polishing the documentation to release this as a general-purpose tool for the MARL community.

Built With

Share this project:

Updates