SCULPT

(Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes)

Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J. Black, Justus Thies, Timo Bolkart
Computer Vision and Pattern Recognition (CVPR) 2024, Seattle, USA

Code | Dataset | Arxiv paper | YouTube | PS Project Page

Abstract

We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies.

Overview of the method

misc-001 (2)

SCULPT consists of two StyleGAN-based generators for geometry G_geo and appearance G_tex, both acting in the UV space of the SMPL body model. The geometry network G_geo outputs pose-dependent displacement maps that are added to the SMPL template mesh and is trained using 3D scan data. Based on this model, the appearance generator G_tex is trained in an unsupervised way using adversarial losses computed on rendered images of the generated synthetic human. It is conditioned on intermediate features of the geometry network. Besides the noise code, both generator networks receive additional attributes for clothing type c_gand appearance c_t as input, respectively. This enhances the connection between appearance and geometry, and it offers user-friendly control over the generation.

Download Information

Arxiv paper
SCULPT dataset can be downloaded after accepting the license agreement and logging into this website.
Please note that the dataset can only be used for reproducing the results mentioned in the paper and any commercial usage of the data is prohibited.
The dataset consists of the resized images, their corresponding masks, and BLIP and CLIP annotations.
The training and inference code can be downloaded from the GitHub repo.
Please cite our paper if you use the codebase and/or the dataset.

Referencing SCULPT

@inproceedings{SCULPT:CVPR:2024,
  title = {{SCULPT}: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes},
  author = {Sanyal, Soubhik and Ghosh, Partha and Yang, Jinlong and Black, Michael J. and Thies, Justus and Bolkart, Timo},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
}