posted an update

Title: Continuous Q-learning Group Members: Daniel Archer, Mathijs Deetman Introduction: We are implementing an existing paper by T Lillicrap et Al titled Continuous Control with deep reinforcement learning. The goal of this paper is to adapt the Q-learning algorithm to create a deep reinforcement model free algorithm which can operate over continuous action spaces. In the case of the paper they use it to solve physics problems such as the cart pole swing problem. We chose this paper because we are both interested in but have limited knowledge of reinforcement and Q-learning and thought that this paper would help to increase our knowledge allowing us to implement this kind of algorithm in different scenarios. Related Work: The paper referenced obviously related work but had no source code attached. This paper cites Silver et AL’s paper “Deterministic Policy gradient algorithms” as the most relevant piece of related work. This paper discusses how deterministic rather than stochastic policies can allow the policy gradient to be estimated much more efficiently. The paper found that the deterministic actor-critic algorithm well outperformed it’s stochastic counterparts by several orders of magnitude. Data: We are going to train our algorithm on a physics based environment so there is no data required. Methodology: We aren’t training a model because Q-learning is a model free algorithm. Rather our algorithm will be able to optimize the output based on the reward earned in a given scenario. We will be selecting a physics problem to test our model on, as in the paper. One such problem which we will be training our algorithm on is the cart pole swing up system. In this example, a pole is attached to a cart which can move left and right along a track with varying force. The goal is to manipulate the cart in order to keep the pole balanced upright as long as possible. When implementing this paper, we feel that the hardest part will be translating the mathematics discussed in the paper into our algorithm in python as well as fully understanding the conceptual side of the algorithm. Metrics: We plan to run the cartpole simulation with the goal of maximizing the amount of time the pole stays within a certain range of angles around the upright position and the cart should stay within a certain range of the center of the track. The notion of accuracy doesn’t apply here as we trying to maximize the reward for a given scenario. A more appropriate metric would - in the case of the cart pole scenario - how long we could get the pole to remain within 15 degrees of vertical and then comparing that to other algorithms which have been written to solve the same problem. Our base goal is to implement the algorithm and test it with the cartpole simulation. Our target goal is to test it with 1 or 2 slightly more complex tasks implemented in the paper such as a puck-hitting task or a driving task. Our stretch goal would be to use our algorithm in a continuous action space of our own design. Ethics: What broader societal issues are relevant to your chosen problem space? - given that our algorithm solves problems in order to maximize a reward function it is very possible that the results of this maximization have unintended side effects. In the real world it’s not too easy to have well-defined objectives and thus the usage of a reward function tries to quantify success on a numerical basis - an ethical issue that we need to be aware of is mapping the results of the reward maximization into a real world context and making sure that our algorithm isn’t unintentionally creating harm in it’s attempt to earn maximum reward. Source: The Societal Implications of Deep Reinforcement Learning - Whittlestone Et Al. How are you planning to quantify or measure error or success? What implications does your quantification have? - If our algorithm works then it should be able to keep the pole upright. Our quantification is binary and thus measuring success is very easy.

Division of labor: We plan to work on implementing the algorithm together. We will each try to create our own environment for the algorithm to be tested on (e.g. one does the cartpole simulation and the other does the driving simulation).

Log in or sign up for Devpost to join the conversation.