6 years ago, TwitchPlaysPokemon took the livestreaming world by storm by allowing twitch chat to collectively play Pokemon Red. Twitch chat has been out of the game for a while, but now it's time for them to come out of retirement to train the next generation of gamers. Only this time, the player is a machine.
What it does
TwitchTrainsMario is a livestream where users participating in Twitch chat can provide feedback to an AI agent as it learns to play the original Super Mario for the Nintendo Entertainment System. Twitch users will type "good bot" or "bad bot" to reward or penalize the agent for good or bad behavior. This feedback is used as the reward signal for a reinforcement learning algorithm, which uses these rewards to automatically learn how to play the game.
This may seem like a silly system, but there is a strong theoretical motivation for using Twitch chat as a reward signal. In reinforcement learning, there are two ubiquitous problems: sparse rewards, and credit assignment. In a game of chess, the typical reward structure assigns a player 1 point for winning a game, and -1 points for losing. No rewards are typically provided for any intermediate states. By incorporating twitch chat, the agent can receive denser rewards, allowing it to learn faster.
The other common issue, credit assignment, is the problem of deciding which intermediate states contributed the most to a particular victory. Obviously the move where you checkmated your opponent was important, but what about the states leading up to that? Again, having Twitch chat continuously contribute rewards throughout the training process will greatly simplify this problem. You can think of this is crowd-sourced human-in-the-loop training for AI.
How I built it
There are a lot of components to this system. First, I wrote a simple IRC chatbot in python that can connect to twitch chat and record input from the users. This runs asynchronously (using asyncio) during the training process so that feedback enters the training loop as quickly as possible. Second, I created a custom training script by modifying components of the Autonomous Learning Library (ALL) to incorporate the Twitch reward signal. This was a significant modification, because in order to collect feedback while the agent is training, I had to create an entirely asynchronous training system, and modify the internal state representation of the agents. Third, this is less technical than everything else, but I have never streamed anything before, so I had to learn how to set up a stream with OBS and design a stream layout.
Challenges I ran into
Adding an outside reward signal into training is very much a hack, and as a result, I had to spend a long time messing around with the internals of the ALL to figure out how to incorporate this information without breaking the training scheme.
The other challenge was writing all of the code asynchronously. Both the chatbot and the training system run in loops that block processing. This means that chat info can't be collected while running the agent in the environment, or the agent can't act while chat is being collected. Due to the way python implements concurrency, both of these components need to be written asynchronously in order to run together. Making an asynchronous chatbot server is pretty standard, but writing asynchronous training code was a unique challenge for me. Fortunately, I recently worked on a project that heavily used asyncio, so I had some practice with writing asynchronous python code.
Accomplishments that I'm proud of
I've wanted to build this for a while, so I'm really happy that I was able to build it in a single weekend. This is some confirmation for me that I'm finally learning how to write reinforcement learning code, and getting better with some programming concepts that I used to struggle with.
What's next for Twitch Trains Mario
There is a ton of room for improvement for this project. There are some technical changes I could make that would give even more control to Twitch users. Similarly, I'd like to add some dynamic features to the stream layout to let Twitch chat know how much their input is affecting training.
The biggest next step for TwitchTrainsMario will be training something aside from Mario. Mario is actually a really good game for reinforcement learning. The default reward signal, which combines your x position on the screen, the times, and your score, is a really easy reward for modern RL algorithms to learn from. The problems of sparse rewards and credit assignment aren't really noticeable here. I'd like to try this with a much harder exploration problem to see how effective Twitch chat is as a teacher. Maybe I'll implement TwitchTrainsPokemon to really bring back the nostalgia for 6 years ago.