Illustration of our human pose estimation pipeline. We generate synthetic depth images on Blender using an animated human 3D model. We train a domain adaptation model to translate synthetic images into realistic depth images, that we use to train a human pose estimation model.

Unsupervised Human Pose Estimation on Depth Images

T. Blanc–Beyne, A. Carlier, S. Mouysset, V. Charvillat.
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2020

PDF Slides Slides+Comments

Abstract

Human pose estimation is a widely studied problem in the field of computer vision that consists in regressing body joints coordinates from an image. Most state-of-the-art techniques rely on RGB or RGB-D data, but driven by an industrial use-case to prevent musculoskeletal disorders, we focus on estimating human pose based on depth images only. In this paper, we propose an approach for predicting 3D human pose in challenging depth images using an image-to-image translation mechanism. As our dataset only consists in unlabelled data, we generate an annotated set of synthetic depth images using a human3D model that provides geometric features of the pose. To fit the challenging nature of our real depth images as closely as possible, we first refine the synthetic depth images with an image-to-image translation approach using a modified CycleGAN. This architecture is trained to render realistic depth images using synthetic depth images while preserving the human pose. We then use labels from our synthetic data paired to the realistic outputs of the CycleGAN to train a convolutional neural network for pose estimation. Our experiments show that the proposed unsupervised framework achieves good results on both usual and challenging datasets.