GoDeep: Implementing 9x9 Go with self-play only Idil Cakmur (icakmur), Jason Singer (jsinger2) GitHub link: https://github.com/idilcak/GoDeep
Introduction: We are going to reimplement a version of DeepMind’s AlphaGo Zero: “Mastering the Game of Go Without Human Knowledge.”
In 2016, DeepMind built the first artificial intelligence system that could defeat a champion Go player: AlphaGo. That system learned first by analyzing human professional games, learning to mimic their moves, and then playing itself to improve. After AlphaGo, DeepMind wanted to take the challenge a step further and build a system that masters the game without any human knowledge, just by self-play and reinforcement learning. We chose this paper because, as two Go enthusiasts, we were really interested in building our own AlphaGo Zero.
This is a reinforcement learning problem.
Related Works: We will be drawing on Minigo to do our project. Minigo is a minimalist Go engine modeled after AlphaGo Zero, built on MuGo. This will mainly be providing us with the environment for our program to play against others as well as the interface for the game of Go itself. That is, our plan is to write the self-play and training sides and test using Minigo. https://github.com/tensorflow/minigo
Data: We will not be using any data other than that generated during self-play, since the purpose is to only train using self-play and reinforcement learning.
Methodology: The essential parts of the architecture we will implement are the Value Network, the Policy Network and the Monte Carlo Tree Search Algorithm. Though AlphaGo Zero used an architecture which combined the Value and Policy Network (with shared parameters) we plan on making and training them separately as was done for AlphaGo because this will simplify the problem. The Value network takes a game state and outputs a number between -1 and 1 which indicates whether the state is favorable for the player whose turn it is. Positive numbers are favorable for the current player and negative numbers are for the opponent. The Value Network is trained with pairs of game states and game outcomes generated in self-play. The Policy Network takes a game state and outputs a vector of probabilities for each move (which encode whether the move is good or bad). The Policy Network is trained with pairs of board states and labels of probabilities generated by the MCTS (which uses the value network at each leaf node of the game tree). The MCTS’s input and output are the same as the Policy Network’s but it generates them by searching the game tree and updating values based on the value network (and calculating Upper Confidence Bounds, visiting nodes in the tree more often based on UCB’s etc. (all according the algorithm itself)). We haven’t yet decided the inner architecture of each network (we need to decide whether we will be implementing the computer vision techniques that DeepMind used e.g. residual networks and convolutions (it might not be necessary for a 9x9)) but we are sure that both the Policy and Value networks will have trainable parameters such as dense layers and non-linearity between them.
We are afraid that our biggest problem will be integrating our system with the Minigo environment.
Metrics: We plan to make it play against a Minigo-trained system to see how successful it is after different amounts of training for each of them on a 9x9 board, using the Minigo framework. We also would like to have it play against previous versions of itself to see that it is improving (though this cannot be the only test since it could just learn to beat itself but not play the game well). We will be checking how often it wins. We think that variation in our goals should be based on the strength of our system at the game itself. Since AI agents’ difficulty can be adjusted we think our base goal should be beating a 25k difficulty AI, target 20k, stretch 15k+ (in go ranks; 30k is the worst).
Ethics: We think that there are two ethical implications that need to be considered while working on projects such as this one:
First, we should be aware of the amount of computing power systems like this require to train. Go is a game that can be played on a 19x19 board, meaning there are as many as 2 * 10170 legal positions that could occur in the game. This translates to a massive amount of compute required to train a successful system. This means that we have to be aware of where we are sourcing this energy from. A lot of deep learning systems use non-renewable energy sources, and this is especially important in this case because there isn’t any “positive impact” of this system that can justify the energy resources wasted to train it. Therefore, it is especially important to make sure that the energy sources are renewable to limit the environmental impact.
Another societal impact that must be kept in mind is that Go is not just any game, it is the oldest board game played continuously to the present day, more than 2500 years old. It is a game so embedded in the cultures of East Asia, it means a lot more than just a game. For the western coders in DeepMind working in AlphaGo, they were just writing this system. However, when AlphaGo managed to beat the world champion, it was truly devastating for a lot of people involved. A lot of people do not just think of Go in terms of strategies, they think of it in terms of creativity and beauty. When a computer program beats humans, it is almost like humanity has lost something that they held very dear. I strongly urge everyone to watch the documentary on AlphaGo in YouTube, and in particular minutes 50 to 55, it is truly heartbreaking to see the world champion being defeated. It is also heartbreaking for him, because when he is talking about his loss, he is not just melancholic because he lost to a computer, he feels that he let down humanity. When we are building systems, even though it might just seem like a “game” to us, we need to keep in mind what that game can mean for other cultures and approach it in a way that gives it the respect it deserves. We must also consider the serious implications that breakthroughs in this field mean for our understanding of concepts such as creativity and, more broadly, what it means to be human.
Division of Labor: We will be doing everything together.
Log in or sign up for Devpost to join the conversation.