Junqi

Introduction

This project is on the game of Junqi, otherwise known as Army Chess, a popular board game from China. The ultimate goal is to construct a deep reinforcement learning (DRL) model that can play the game to a satisfactory level. The premise of the game of Junqi is simple: a 5x13 board, pieces with different ranks, with higher ranks being able to take lower ranks, and bombs that take all ranks. The goal is to reach the other player’s flag, located at the other end of the board. More nuances are presented with the fact that the opponent’s board will be unobservable during the play, and that the starting position of pieces can be altered by each player. We anticipate that the game of Junqi will be a challenging objective for the DRL model, especially with incomplete-information states, variations in strategy, and delayed rewards. However, there is one important simplification we are going to make, which is the removal of the board creation portion of the game. We anticipate that the complicated, strategic nature of Junqi will be extremely challenging already, and adding this extra layer of complexity might make this project unrealistic for the scope of this course.

Methodology

This project will take reference from the existing open-spiel-junqi package to build the training environment. This package includes the smaller board variants beneficial for training, though we also provide our own implementation of the full board to train the final mode.

We will primarily investigate two algorithms to improve their effectiveness for Junqi: Deep Recurrent Q-Learning (DRQL) and Recurrent Proximal Policy Optimization (R-PPO). Both algorithms have been studied and proven to be effective tools for partially observable state spaces. Training of these models will be from self-play. Evaluation metrics will be win-rate-based, using the following (tentative) list of metrics: Self-play win rate: Evaluate against earlier checkpoints Baseline win rate: heuristic models, random models Alternate model win rate: DRQL vs. R-PPO models

Plan

Being still in the primitive stage of development, we aim to start with proof-of-concept, quick-to-train models. To this end, we will start development with a 3x8 board, where we have a significantly smaller state space and, consequently, shorter training time. 
After a successful proof-of-concept model has been created, we will compare different implementation details, evaluate its performance, and attempt to improve upon the existing structure. At the same time, time-consuming training on bigger boards will start, where we will be able to make progress towards the final, completed model for the full game of Junqi. 
If time allows, we will investigate the part of the game we excluded: board creation, which is likely going to be an extremely challenging task.