Present
Future

Introduction

Hello there! Thank you for stopping by.

This little project is a reimplementation of the 2019 paper of Georgios Pavlakos, Nikos Kolotouros, Kostas Daniilidis entitled TexturePose: Supervising Human Mesh Estimation with Texture Consistency. Using this framework, we perform model-based human pose estimation (HPE) from natural images, which dramatically expands the applicability of HPE. We follow the “natural” supervision approach of Pavlakos et al. using images from both in-the-wild and controlled settings with annotations.

Check It Out

Updated Write-Up: https://docs.google.com/document/d/1s3FFmNTk4xW227E-kTY6AaEgpMwKQX1Wy4Fem8erRdM/edit?usp=sharing
Final Write-Up: https://docs.google.com/document/d/1m0d5hAAGqIySBEMhmgQj5fiVi94dTSnsCtGuNv-4Ars/edit?usp=sharing
Poster: https://drive.google.com/file/d/1XQkFgYg3RaDbc9ST6wteTbwB3VXtb3aa/view?usp=sharing
Presentation: https://docs.google.com/presentation/d/1gnn9icfZpsabVUY3bnNBr0XxBBhcz4RrUt_HurnsByw/edit?usp=sharing

Follow Our Journey

Outline: https://docs.google.com/document/d/1NsunpIJBlJWJmCdD55Kqc1Y6mNMfeNeVh5QgX2s9l_I/edit?usp=sharing
Reflection: https://docs.google.com/document/d/1dCLbGj62dyadDcnEmQLPjzIuL46VRRJ9_Xfw2z02YzE/edit?usp=sharing

Related Work

TexturePose builds upon End-to-end recovery of human shape and pose, which, in turn, relies heavily on both Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image and MoSh: Motion and shape capture from sparse markers.

[1] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In CVPR, 2018.

[2] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV, 2016.

[3] M. Loper, N. Mahmood, and M. J. Black. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia, 33(6):220:1–220:13, 2014.

Methodology

We use a Generative Adversarial Network supported by reprojection loss and texture consistency loss to generate realistic meshes.

The basic GAN has a generator which passes the input through pre-trained ResNet to get image features. STAR/SMPL parameters are then generate with an iterative regression loop. Next, the discriminator evaluates the validity of these parameters by training on MoShed datasets. The generator loss is a combination of the discriminator’s output, and the reprojection difference between the ground truth 2D joint locations and projected 3D joint locations.

Texture pose improves on the above architecture by projecting the images onto the models as textures and determining which portions of the image are visible. For two images of the same person, TexturePose defines a loss based on the correlation of texel values for points visible in both images.

Given two images Input_i and Input_j:

Input_i → CNN_i→ predicted shape_i→ texture map_i

Input_j→ CNN_{j →}predicted shape_{j →} texture map_j

Loss is calculated as follows:

1 & 2 ⇒ || V_ij ⊙ A_i− A_j||

Data

Human3.6M: 3.6 million video frames with corresponding 3D poses captured simultaneously by 4 high-resolution progressive scan cameras (http://vision.imar.ro/human3.6m)

Leeds Sports Pose Extended Training Dataset: 10,000 images, each with 14 corresponding 2D joint annotations (https://sam.johnson.io/research/lspet.html)

MPII Human Pose Dataset (MPII): 25,000 images (> 40,000 people, 410 activities) with 2D joint annotations (http://human-pose.mpi-inf.mpg.de/)

MoShed CMU Graphics Lab Motion Capture Database: shape and pose parameters extracted with MoSh from CMU motion capture data (https://drive.google.com/file/d/1b51RMzi_5DIHeYh2KNpgEs8LVaplZSRP/view)