Arkanoid Breakout Game
Who
Mingye Zhang
Cathy Gao
Introduction
In normal Arkanoid game, the player will used keyboard to move the board left or right so that it can hit the ball back to hit the bricks. In this game, the only thing that matters is the position and movement of the board.
In our project, we want to train the board. We hope the board can move automatically which means it can precisely predicted where the ball will fall and move to that position, hit the ball back to hit bricks. Over and over again until all the bricks have been removed.
Related Work
In some paper, they use reinforcement learning to tackle this kind of problem. For example, in the paper link, they used deep Q learning to train their model to address this problem. They applied this single model to seven pixel games which with same hyper parameters with different rewards function based on the rules of the games. In the game atari breakout, the model perform very well after enough training.
Data
As we use reinforcement learning in this model, we do not need to prepare dataset before hand. When using the data, rather than using the dataset as we collect them, we put them into replay buffer, then sample data at random. Since the buffer size is fixed, new data will kick out old ones. Since we sample data at random, each data potentially reused. In this way, we can increase data efficiency. By running multiple episodes, we can accumulate more and more data in our storage. And we use these data to adjust out Q function.
DQN
Currently, we plan to use reinforcement learning or more specifically deep Q learning to tackle how to train our model.
Our goal is to connect a reinforcement learning algorithm to a deep neural network which operates directly on RGB images and efficiently process training data by using stochastic gradient updates. In this DQN model, our inputs are the raw pixels, and it outputs a value function estimating future rewards. We would like to obtain the Q-function that gives the maximum rewards. Our network is trained with a variant of the Q-learning algorithm, with stochastic gradient descent to update the weights. In this algorithm, we train a Q-function, which takes in a state s and an action a, and outputs what the reward is going to be in the future if we are in state s and performs action a. The equation is Q(s,a) -> R.
We also use the key idea of online reinforcement learning, which is to say, play the game at the same time we train the neural network, then use the improved neural networks to play more gains. And the key of the DQN algorithm, Bellman recurrence property of Q-function, which is to say we will take the action a, and for all future steps always take the best action that can maximize Q. Q of this state depends on Q of the future states. And if we iterate, eventually, Q-function will converge to the optimal Q-function.
Also, we utilize a technique known as experience replay where we store the agent’s experiences at each time step in a dataset, pooled over many episodes into a replay memory. During the inner loop of the algorithm, we apply Q-learning updates, or minibatch updates, to samples of experience, drawn at random from the pool of stored samples. After performing experience replay, the agent selects and executes an action according to a greedy policy. Since using histories of arbitrary length as inputs to a neural network can be difficult, our Q-function instead works on fixed length representation of histories produced by a function which is deep Q learning network.
This approach has several advantages over standard online Q-learning. First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency. Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically. By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.
Model Architecture
The input to the neural network consists is an 84 × 84 × 4 image. The first hidden layer convolves 168 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 324 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.
Training
In these experiments, we used the RMSProp algorithm with minibatches of size 32. The behavior policy during training was e-greedy with e annealed linearly from 1 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained for a total of 10 million frames and used a replay memory of one million most recent frames.
In supervised learning, one can easily track the performance of a model during training by evaluating it on the training and validation sets. In reinforcement learning, however, accurately evaluating the progress of an agent during training can be challenging. Since our evaluation metric is the total reward the agent collects in an episode or game averaged over a number of games, we periodically compute it during training. The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits. Another more stable metric is the policy’s estimated action-value function Q, which provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state. We collect a fixed set of states by running a random policy before training starts and track the average of the maximum predicted Q for these states which is much more stable and smooth. In addition to seeing relatively smooth improvement to predicted Q during training we did not experience any divergence issues in any of our experiments. This suggests that, despite lacking any theoretical convergence guarantees, our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.
At the time when there are only a few blocks remained, maybe it is relatively easy to move to the right position and hit the ball back. But it is hard to guarantee that after being hitting back, the ball can hit one of the blocks left. In some cases, the ball may just hit the walls and came back. In this situation, the reward is zero and it may fall into a dead loop. We need to fine tune the model to try to hit all the blocks. We think it is a hard part.
Metrics
In our project, as our work is not classification, there is no specific accuracy. As we are training a model to play a pixel game, we can evaluate the performance by the scores every time it gets. And we can compare those scores with the scores that a real human being can get to evaluate the performance of the agent. We can also refer to other existing agents' performance on the game.
Reflection
Looking at the results of our agent's game, we can see that our agent is able to play the game successfully and get some rewards. However, it's only able to clear a few of the blocks before it dies in each episode and it only gains a reward at about 30 in each episode. This is definitely a lot less than what a human player can achieve, and also a lot less than some existing agents (like the ones from Deepmind), who can perform pretty well on games like breakout. We think this is due to the limitation of training time. Successful agents like Deepmind's are able to clear out all the blocks and obtain a reward of hundreds of points because they are trained for over thousands of episodes, while we only trained for less than a hundred episodes, for a million steps across all episodes in each training. If we have more time, we would train our agent on 5 million, 10 million, and maybe even more steps over hundreds of episodes. And we believe the agent will perform a lot better then.
Ethics
Is it meaningful to spend a lot of time and effort to design an "invincible'' AI in games? Well, we think that it is very cool that we are able to train an agent that can play the game successfully and get a certain amount of reward. It's so exciting to build powerful robots which can win the game or even defeat humans. But this also reminds us that some years ago, google robot alpha GO defeated many famous Go players. It is widely recognized that such AI is so unprecedented and amazing that many Go players are reluctant to admit such a fact.
But we believe that the ethics or the purpose of a game is to bring pleasure to people. By playing games, people can dive into them and relax. Winning is definitely not the only target of a game. So as some AI has indeed great power in some games, we still think not limited to games, there are more meaningful aspects to display and use this kind of power.
Conclusion
We built a deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. Although we are not able to train an agent that can successfully clear out all the blocks (like the agent by Deepmind), we are able to demonstrate a trained agent that can successfully play the game and clear some blocks, achieving a reasonable reward at an initial stage given our training time. Through this project, we are able to delve deeper into the field of Reinforcement Learning and to further explore this interesting topic. Specifically, we gained a much deeper understanding of the dynamic systems of Q-learning. We also gained valuable experience reading research papers, developing research skills, and training and evaluating models.
Division of labor:
Mingye: Look for some related works and resources. Design the architecture of the model, build loss functions. Fine tune Q functions. Collect resources and write reports.
Cathy: Look for some related works and resources. Design each individual part of the model. Set and tune hyper parameters. Collect resources and write reports.
Built With
- gym
- python
- torch

Log in or sign up for Devpost to join the conversation.