Artificial Unintelligence
Who:
Daniel Archer - darcher-1
Mathijs Deetman - mdeetman
Introduction:
Our group liked the idea of reinforcement learning and found the most recent project interesting, so we set out to complete our final project on something within the space. After doing research and spending time debating different papers we decided to Implement a paper by T Lillicrap et Al titled “Continuous Control with deep reinforcement learning”. However, after our first check in with our TA mentor we came to learn that this paper would too closely mimic homework 6 and thus we had to go back to the drawing board. Along a similar subject matter we then selected a paper by Mnih et Al titled “Playing Atari with deep reinforcement learning”. In this paper a DQN with experience replay algorithm is used to play Atari 2600 games. Our implementation of the paper will focus on the atari game breakout. We chose this game because of its success in the paper relative to the average human as well as the simplicity of the action space. The goal of this game is to move a paddle situated on the bottom of the screen left and right and prevent the ball from hitting the bottom of the screen while directing it to break bricks at the top of the screen.
Methodology:
We use the atari gym library to load breakout. We use a deterministic version of this game with a fixed frameskip of 4 frames. This is consistent with the implementation described in the paper.
As in the paper, we preprocess our images to be 84x84 grayscale images. The observation from the environment is an RGB image with a height of 210px and a width of 160px. We crop 34 pixels from the top of the screen and 16 pixels from the bottom of the screen to make the input a 160px by 160px square image. The pixels removed at the top of the screen crop out the score and the pixels removed at the bottom crop out empty space below the paddle. The cropped images contain the entire playable area and information which the model could find useful. We then convert the images to grayscale and downsize them to 84x84 pixels. We then stack each input with the previous 3 frames and input them into our network.
Our network consists of the following: A convolutional layer with 16 8x8 filters with stride 4 and ReLu activation, another convolutional layer 32 4x4 filters with stride 2 and ReLu activation, a flatten layer, a fully connected layer with 256 rectified units and finally a fully connected output later with a linear output.
After each step, the state, reward, action, next state and terminal symbol are all added to a memory replay data structure and the model then undergoes one training cycle. During the training process, a minibatch is randomly sampled from the memory replay. The states in the minibatch are passed through the network and the “next states” are passed through the target model. Our target network has the same structure as the regular network but is not trained. Its weights are instead updated to be identical to the regular network every 5 episodes (about 2500 training steps).
Our loss calculation is a temporal difference loss where we are computing the mean squared error between the Q-values from the Q-network and the target network. We then use this loss to train out Q-network while keeping the target network constant.
We have gone with the following hyperparameters: our epsilon (probability that we select a random action i.e exploring rather exploiting) is 0.05, our gamma (the discount rate we use to get the present value of future rewards) 0.99, our minibatch size is 64, our maximum replay memory size is 50000, the number of episodes we train for is 200 and the number of episodes between each target network update is 5.
Results:
We found that our model learned really quickly initially - it was on par with Google and Open Ai after around 100 episodes scoring on average around 4 points and after 150 episodes scoring 8-13 consistently. We saved the model at this point and were able to render the emulator and visualize our model playing the game.Thereafter, we only decided to train our model for 200 epochs. This was mainly due to the computational resources available to us. The score obtained in this range of epochs started to taper off and generally sat in the 10-15 range. As we trained our model past 200 epochs we started to notice a decrease in performance which we attributed to over fitting. Nonetheless we were still able to get results which we were happy with and being able to visualize our model playing breakout was really rewarding - excuse the pun !
Challenges:
We initially battled with getting a better understanding of all the concepts outlined in the paper. As we began to code more specific challenges came up. For instance, the concept of memory replay was a little foreign to us at first so we had to spend some time researching the concept in order to better understand it and implement it. Thereafter, we battled with where and how to calculate the loss between the Q-network and target network - it was confusing to us where and at what exact step we needed to compute this value.
Computational power was another issue that we ran into. Running the model on our local machines took forever so we had to claim some GCP credits and run our project on the google infrastructure.
Reflection:
We feel as though our project was successful - we understand that we don’t nearly have the same computational power as the authors of the paper so we were expecting to get much lower results. However, as we began to train our model we were ecstatic to find that it had comparable results to that of google and OpenAi for the first 100 training episodes - scoring roughly 10 on average ! We didn’t really have too many set quantitative goals going into the project as we knew that there would be a lot of variables affecting our success. We were very happy with our results nonetheless. The model worked exactly how we expected it too …. after we had fixed a couple of sneaky bugs ! Our approach didn’t change over time as we had a good idea of what needed to be done from the start, so it was more a question of getting the implementation correct. We pivoted from using a paper on continuous Q-Learning early on as our mentor TA said that it would be too close to the work we were completing in homework 6. If we had more time we would have used it to further train and optimize the model as opposed to just trying to get it to work. I think the biggest takeaways for us were: 1. the sheer amount of time it takes to train complicated algorithms like this and 2. the need for real computation power in order to achieve success.
Built With
- keras
- python
Log in or sign up for Devpost to join the conversation.