Artificial Unintelligence

Who:
Daniel Archer - darcher-1
Mathijs Deetman - mdeetman

Introduction:
Our group liked the idea of reinforcement learning and found the most recent project interesting, so we set out to complete our final project on something within the space. After doing research and spending time debating different papers we decided to Implement a paper by T Lillicrap et Al titled “Continuous Control with deep reinforcement learning”. However, after our first check in with our TA mentor we came to learn that this paper would too closely mimic homework 6 and thus we had to go back to the drawing board. Along a similar subject matter we then selected a paper by Mnih et Al titled “Playing Atari with deep reinforcement learning”. In this paper a DQN with experience replay algorithm is used to play Atari 2600 games. Our implementation of the paper will focus on the atari game breakout. We chose this game because of its success in the paper relative to the average human as well as the simplicity of the action space. The goal of this game is to move a paddle situated on the bottom of the screen left and right and prevent the ball from hitting the bottom of the screen while directing it to break bricks at the top of the screen.

Methodology:
We use the atari gym library to load breakout. We use a deterministic version of this game with a fixed frameskip of 4 frames. This is consistent with the implementation described in the paper.

As in the paper, we preprocess our images to be 84x84 grayscale images. The observation from the environment is an RGB image with a height of 210px and a width of 160px. We crop 34 pixels from the top of the screen and 16 pixels from the bottom of the screen to make the input a 160px by 160px square image. The pixels removed at the top of the screen crop out the score and the pixels removed at the bottom crop out empty space below the paddle. The cropped images contain the entire playable area and information which the model could find useful. We then convert the images to grayscale and downsize them to 84x84 pixels. We then stack each input with the previous 3 frames and input them into our network.

Our network consists of the following: A convolutional layer with 16 8x8 filters with stride 4 and ReLu activation, another convolutional layer 32 4x4 filters with stride 2 and ReLu activation, a flatten layer, a fully connected layer with 256 rectified units and finally a fully connected output later with a linear output.

After each step, the state, reward, action, next state and terminal symbol are all added to a memory replay data structure and the model then undergoes one training cycle. During the training process, a minibatch is randomly sampled from the memory replay. The states in the minibatch are passed through the network and the “next states” are passed through the target model. Our target network has the same structure as the regular network but is not trained. Its weights are instead updated to be identical to the regular network every 5 episodes (about 2500 training steps).

Our loss calculation is a temporal difference loss where we are computing the mean squared error between the Q-values from the Q-network and the target network. We then use this loss to train out Q-network while keeping the target network constant.

We have gone with the following hyperparameters: our epsilon (probability that we select a random action i.e exploring rather exploiting) is 0.05, our gamma (the discount rate we use to get the present value of future rewards) 0.99, our minibatch size is 64, our maximum replay memory size is 50000, the number of episodes we train for is 200 and the number of episodes between each target network update is 5.

Results:
We found that our model learned really quickly initially - it was on par with Google and Open Ai after around 100 episodes scoring on average around 4 points and after 150 episodes scoring 8-13 consistently. We saved the model at this point and were able to render the emulator and visualize our model playing the game.Thereafter, we only decided to train our model for 200 epochs. This was mainly due to the computational resources available to us. The score obtained in this range of epochs started to taper off and generally sat in the 10-15 range. As we trained our model past 200 epochs we started to notice a decrease in performance which we attributed to over fitting. Nonetheless we were still able to get results which we were happy with and being able to visualize our model playing breakout was really rewarding - excuse the pun !

Challenges:
We initially battled with getting a better understanding of all the concepts outlined in the paper. As we began to code more specific challenges came up. For instance, the concept of memory replay was a little foreign to us at first so we had to spend some time researching the concept in order to better understand it and implement it. Thereafter, we battled with where and how to calculate the loss between the Q-network and target network - it was confusing to us where and at what exact step we needed to compute this value.

Computational power was another issue that we ran into. Running the model on our local machines took forever so we had to claim some GCP credits and run our project on the google infrastructure.

Reflection:
We feel as though our project was successful - we understand that we don’t nearly have the same computational power as the authors of the paper so we were expecting to get much lower results. However, as we began to train our model we were ecstatic to find that it had comparable results to that of google and OpenAi for the first 100 training episodes - scoring roughly 10 on average ! We didn’t really have too many set quantitative goals going into the project as we knew that there would be a lot of variables affecting our success. We were very happy with our results nonetheless. The model worked exactly how we expected it too …. after we had fixed a couple of sneaky bugs ! Our approach didn’t change over time as we had a good idea of what needed to be done from the start, so it was more a question of getting the implementation correct. We pivoted from using a paper on continuous Q-Learning early on as our mentor TA said that it would be too close to the work we were completing in homework 6. If we had more time we would have used it to further train and optimize the model as opposed to just trying to get it to work. I think the biggest takeaways for us were: 1. the sheer amount of time it takes to train complicated algorithms like this and 2. the need for real computation power in order to achieve success.

Built With

Share this project:

Updates

posted an update

UPDATE: our paper has changed since our initial devpost introduction.

INTRODUCTION: We are implementing an existing paper by V Mnih et Al titled Playing Atari with Deep Reinforcement Learning. The goal of this paper is to adapt a Deep Q-learning algorithm to train on a set of 7 Atari 2600 games and compare those results to that obtained by two reinforcement learning algorithms: Sarsa and Contingency and a human result. We will initially be training our DQN on an Atari gym game called breakout - the paper’s DQN with experience replay scored 168 on this game and we hope to get close to this score.

CHALLENGES: We have battled to understand the concept of replay memory detailed in the paper. We also initially selected an algorithm which was very similar to the one completed in hw6 so we’ve had to change our approach to DQN and use an Atari game breakout instead of the cart pole problem from the gym.

INSIGHTS: We’re still in the process of coding our model so we have not been able to provide any results up until this point.

PLAN: Yes we’re on track with our project. What we need to do next is: 1) finish preprocessing our images so that we can pass in batches of 4 images in the correct grayscale (84*84) format; 2) finish coding our DQN algorithm to that we can begin to test our model 3) compare the results of our DQN to that of others and asses our success 4) possibly train our model on other Atari games to see how well it fares against the results found in the paper. For comparative purposes the paper’s version of DQN scored a 168 on the game so

Log in or sign up for Devpost to join the conversation.

posted an update

Title: Continuous Q-learning Group Members: Daniel Archer, Mathijs Deetman Introduction: We are implementing an existing paper by T Lillicrap et Al titled Continuous Control with deep reinforcement learning. The goal of this paper is to adapt the Q-learning algorithm to create a deep reinforcement model free algorithm which can operate over continuous action spaces. In the case of the paper they use it to solve physics problems such as the cart pole swing problem. We chose this paper because we are both interested in but have limited knowledge of reinforcement and Q-learning and thought that this paper would help to increase our knowledge allowing us to implement this kind of algorithm in different scenarios. Related Work: The paper referenced obviously related work but had no source code attached. This paper cites Silver et AL’s paper “Deterministic Policy gradient algorithms” as the most relevant piece of related work. This paper discusses how deterministic rather than stochastic policies can allow the policy gradient to be estimated much more efficiently. The paper found that the deterministic actor-critic algorithm well outperformed it’s stochastic counterparts by several orders of magnitude. Data: We are going to train our algorithm on a physics based environment so there is no data required. Methodology: We aren’t training a model because Q-learning is a model free algorithm. Rather our algorithm will be able to optimize the output based on the reward earned in a given scenario. We will be selecting a physics problem to test our model on, as in the paper. One such problem which we will be training our algorithm on is the cart pole swing up system. In this example, a pole is attached to a cart which can move left and right along a track with varying force. The goal is to manipulate the cart in order to keep the pole balanced upright as long as possible. When implementing this paper, we feel that the hardest part will be translating the mathematics discussed in the paper into our algorithm in python as well as fully understanding the conceptual side of the algorithm. Metrics: We plan to run the cartpole simulation with the goal of maximizing the amount of time the pole stays within a certain range of angles around the upright position and the cart should stay within a certain range of the center of the track. The notion of accuracy doesn’t apply here as we trying to maximize the reward for a given scenario. A more appropriate metric would - in the case of the cart pole scenario - how long we could get the pole to remain within 15 degrees of vertical and then comparing that to other algorithms which have been written to solve the same problem. Our base goal is to implement the algorithm and test it with the cartpole simulation. Our target goal is to test it with 1 or 2 slightly more complex tasks implemented in the paper such as a puck-hitting task or a driving task. Our stretch goal would be to use our algorithm in a continuous action space of our own design. Ethics: What broader societal issues are relevant to your chosen problem space? - given that our algorithm solves problems in order to maximize a reward function it is very possible that the results of this maximization have unintended side effects. In the real world it’s not too easy to have well-defined objectives and thus the usage of a reward function tries to quantify success on a numerical basis - an ethical issue that we need to be aware of is mapping the results of the reward maximization into a real world context and making sure that our algorithm isn’t unintentionally creating harm in it’s attempt to earn maximum reward. Source: The Societal Implications of Deep Reinforcement Learning - Whittlestone Et Al. How are you planning to quantify or measure error or success? What implications does your quantification have? - If our algorithm works then it should be able to keep the pole upright. Our quantification is binary and thus measuring success is very easy.

Division of labor: We plan to work on implementing the algorithm together. We will each try to create our own environment for the algorithm to be tested on (e.g. one does the cartpole simulation and the other does the driving simulation).

Log in or sign up for Devpost to join the conversation.