Motion of the Ocean

Our poster!
Our humanoid agent attempts a backflip.

Team members

Lucas Schroeder, Peter Zubiago, Chris Luke

Final Write Up

Check out the Google Docs version of out final write-up at the link below! (Same content as this Devpost)

https://docs.google.com/document/d/1fAVw1Tmg5h6_0Iuz13tIqhkI46XcPP_qetUk0hGP22w/edit

Final Poster

https://drive.google.com/file/d/1Dnh30ErG-KltKt8Sw57V8rz8n-K8fPID/view?usp=sharing

Introduction:

Our group began this project with the task of better understanding how to animate the way that bodies move through space. We were interested in the capacity of reinforcement learning algorithms to capture how desirable any given state is when provided with a reference. By integrating these movements into a physics-based environment, we hoped to be able to create a helpful visualization of human motion and interaction as a way to better render realistic environments in digital settings.

Methodology:

Our model trains with Proximal Policy Optimization, which is a reinforcement learning algorithm that is both more accurate and (ostensibly) easier to implement than other RL algorithms. As a policy optimization algorithm, it shares some features with REINFORCE.

Actor/Critic Network

An action in this model is represented as a 36-dimensional list, where each dimension represents a motion for a joint. To generate an action, the actor network is called on the current state. The actor network consists of three linear layers, the first two of which have relu activation. The first later has 1024 parameters, the second has 512 parameters, and the third has 36 parameters. After calling the actor network, the action that is applied is found by sampling from Gaussian distributions, where the result of the actor network at a given index is the mean and the standard deviation of the actor network’s result is the standard deviation. The sampling gives us the ability to pick continuous actions that are necessarily definitive results from the actor network. This action is then applied to the model.

The critic network has 1024 parameters in its first layer and 512 in its second layer (both with relu activation), but its final layer has one parameter with linear activation.

Loss

The model uses Proximal Policy Optimization (PPO), which is a variant of REINFORCE. It utilizes an actor and critic network but limits the amount that the policy is able to change. By calculating the ratio between the updated policy’s action and the old policy’s action, we can decide how much of a change in policy we are willing to tolerate. To do this, actor loss is calculated as either the product of ratio of the new policy to the old policy and the advantages or the product of the clipped ratio and the advantages, depending on which one is less, allowing us to ignore trials where the new ratio was radically different from the previous one. This ensures that the new policy does not deviate too far from the previous, resulting in smoother training and less variance. Critic loss is the mean square error of the output of the critic network and the discounted rewards.

Ratio = [new π(state)]/ [old π(state)]
S1 = Ratio * Advantages
S2 = clip(ratio, 1 + ε, 1 - ε) * Advantages
Actor_loss = min(S1, S2)
Critic_loss = mean_square_error(result_of_critic_network, discounted_rewards)
Total_loss = Critic_loss - Actor_loss

Reward

The total rewards are calculated as the addition of four different criteria. There are 4 different reward terms, one for the pose, one for the velocity, one for the end-effectors (i.e. hands and feet), and one for the center of mass. Each reward is calculated by summing all of the square differences between each of the reference model’s features and the simulated model’s features for each joint in the model and multiplying said sum by a scale factor (-2 for pose, -0.1 for velocity, -40 for end-effectors, and -10 for center of mass). Each reward term is then given a different weight (.65 for pose, .1 for velocity, .15 for end-effectors, and .1 for center of mass), and are then added together to come up with the total reward.

Reference State Initialization

At the start of each episode, a state will be randomly sampled from the reference motion, and used to initialize the state of the agent. This allows the model to explore - if the initial state is fixed, the model cannot learn later poses without mastering early poses. Early Termination When the conditions for early termination are met, rewards for the remainder of the are set to zero. This disincentivizes learning unhelpful behaviors and prevents early iterations of the model from learning that these bad actions are reasonably fine. Our conditions for early termination are whenever a major part of the agent’s body (e.g. head or chest) hits the ground.

Results:

Our results are, unfortunately, not up to the level that we had hoped to achieve. The average rewards for our episodes neither increases nor decreases substantially. It is also very unstable, varying wildly from one episode to the next. After 557,056 testing episodes of training, the model had a nearly identical reward as it began with. Chart 1 shows trends over these first episodes, showing a slight downward trajectory of reward values and Table 1 gives more specific values for samples.

Link to Chart 1

Table 1. Specific rewards at a given episode.

# samples trained	avg. reward
4096	15.552906
81920	14.419824
163840	14.111729
245760	13.317774
327680	14.042795
409600	13.061647
491520	14.388458
557056	12.911989

We believe the error in our model lies in our loss, which affects how the gradients are being updated. This error is a result of how we find the ratios between the old policy and the new policy. Our model compares the action taken by the old policy with that from the new policy to compare the change in policy. We believed this would be a good measurement of the change in policy, however, we may have found better luck by actually computing a mean and standard deviation for the weights and parameters of the policy.

Challenges:

Most challenges in the project stemmed from our initial unfamiliarity with the learning algorithm. It was difficult to understand the Deep Mimic paper the first time through and sitting down to fully understand how PPO was working - and how the paper itself chose to quantify the parameters - was initially daunting. As we worked through it, however, we became significantly more comfortable with the concepts and feel confident in our understanding of the algorithm.

Additionally, the original source code (which contained the pybullet physics engine and other code that helped us render and interpret our model) was in the deprecated Tensorflow 1, which proved difficult to convert to Tensorflow 2. We had to strip a lot of unnecessary code and implement gradient tape. It is possible that something in conversion interfered with the model’s learning.

The reward function itself was also initially a challenge, as it depended on the graphical state of the visual model. With little experience in computer graphics, it was at first a little disorienting to be working with representations of these graphics (i.e. quaternions) to determine some of the reward terms, as was required to compute pose reward.

Reflections and Other Avenues to Pursue:

Coming out of this project, we feel we have a better grasp on policy-gradient methods, having now implemented both PPO and REINFORCE with baseline. Randomized initial state and early termination are important for training because they allow for exploration. Additionally, we learned how to implement more sophisticated reward functions compared to CartPole.

We believe that training the actor to perform a continuous sequence of actions, based on its current environment, is the next direction to take the model in. There is significantly more work to be done to improve the process by which a model responds to different environmental stimuli in order to trigger a more complex range of activities. For example, the agent could be trained to traverse a narrow, twisting path. This is possible because the learning policies from reinforcement learning are very robust.

In the future, other industries can use this work to aid in tasks such as animation or robotics. More research should be done to speed up the training process for models such as these, as it currently takes close to a full 24 hours.

Ethics

Our metric for success relies on a classification of what a normal body moves like. By normal, we mean able-bodied. One of the implications of our model is that it will most likely not be trained on frames with disabled bodies. While the model should be able to be trained on different body types because it takes physics into account, the data sets available for us to train on do not feature any of this type of movement. As a result, our model does end up perpetuating a lack of representation among disabled bodies.

Another ethical question to consider is the impact that these automated tools will have on artists. This model offers a compromise between artistic direction and adaptability. Will the advent of these tools drastically change the landscape of graphical animation, wiping out potential creativity for the creator?

Division of Labor

As a group, we all plan to work together to complete each part of the model. However, to simplify things, we've decided to each focus on different parts of processing the RL algorithm. Peter will focus on the a custom reward function, Lucas will focus on the training logic and visualizer, and Chris will take the helm on the loss function.