We were inspired by recent OpenAI and DeepMind advancements in reinforcement learning (RL). Specifically the recent hide-and-seek demonstration in addition to the now infamous AlphaZero demonstration peaked all of our curiosities in such a way as to motivate us to want to learn more about RL. We thought taking a simplified toy-problem in a similarly ambitious field (i.e. space exploration) would be an interesting challenge for the team.
What it does
We have two simulations: a lunar and an earth simulation. The lunar simulation replicated the moon's gravity. The reward function is designed so that there is a large reward for landing on the central platform, but also a moderate reward for landing anywhere else on the map. This encourages behavior that leads to safe descents regardless of where which helps encourage smarted path searches early in training. The earth simulation was far more challenging. In addition to earth's increased gravity (and therefore more fuel required for a safe landing which costs reward points), we added air resistance, random wind fluctuations and, crucially, made the platform much more narrow. The platform in the earth environment was surrounded by water instead of solid land like in the lunar environment. This meant that reward was only given out for near-perfect landings, rather than landing at all. In the end, DQN proved sufficient for both environments.
How we built it
Using the OpenAI Gym background, we implemented a DQN reinforcement learning algorithm from the RLib library. We trained both the earth and the lunar lander on our local machines. Our exploration factor decreased linearly with time so as to not get stuck in local optima while also taking a reasonable amount of computational resources. We additionally tweaked the OpenAI lunar lander gym to better simulate SpaceX-like landing conditions.
Challenges we ran into
Initially, the original DQN implementation we were using (stable-baselines) was not training well in either environments. We decided to opt for a different implementation of DQN, namely, RLib. This solved the initial problem
Accomplishments that we're proud of
Both the Earth and the Lunar lander trained models converged around the Gym maximum possible reward despite random initial conditions.
What we learned
The basic structure of DQN reinforcement learning algorithms. The difficulties in training RL models (variation from initial seed initialization, variance between different library implementations, etc.) We also learned that the difference between different implementations of the same algorithm can vary greatly in performance.
What's next for LunarLander-v2
Transfer learning. We want to take the trained model from the lunar lander and see how it fares under the earth lander environment. We also want to compare transferring the model from one environment to another with slightly training this transferred model. This would help show us how much training a pertained model would actually affect the transfer learning efficiency.