Inspiration

I love machine learning. I have built many models in both fields of reinforcement learning and supervised learning. Currently, my personal project is creating a ResNet to evaluate chess positions. Solving board games through self play and reinforcement learning is a passion.

What it does

My algorithm plays 750 games against itself, storing the game's state, the move made, and the reward (calculated at the end of the game). After the 750 games are played, the model trains on itself by minimizing the cross entropy loss of the actor head and the MSE loss of the critic head. Finally, newly trained model plays against the old version of itself to see if the training helped or hurt its performance. If the new model wins over 55% of the games, then the new model replaces the old model. This process iterates 200 times.

How we built it

I built it in python using pytorch and pytorch_lightning. I used the simple A2C algorithm at first (found on Medium.com) to act as the baseline. To improve stability, I implemented gradient clipping in the training (so it doesn't change the model too much). I had the previous model compete with the new model in a best of 100 games to see if the training helped or hurt the model. The actual neural net used was a convolutional neural network with a Residual Tower of 15 blocks and filters of size 64 acting as its backbone. The actor head was a convolutional neural network and the critic head was a fully connected network. Relu was used as the activation function for each layer. The model and code will be uploaded below.

Accomplishments that we're proud of

The new model was able to beat the random agent over 95% of the time. Sometimes, however, the random agent got lucky.

What's next for A2C (Advantage Actor Critic) to solve board games

I intend to implement an ensemble of trained models to help stability. I would take the average of the models' move probabilities to find the best average move. This would allow for one of the models to not make a horrible mistake, as each model is held accountable by the other 2 models. If all 3 models say that a move is good, its most likely good. If 1 models says a move is great and the other 2 say it is terrible, then the move is probably not good.

Built With

Share this project:

Updates