iHuman: Instant Animatable Digital Humans From Monocular Videos

ECCV 2024


Pramish Paudel1, Anubhav Khanal1, Danda Pani Paudel2,3,4, Jyoti Tandukar1, Ajad Chhatkuli2,3,4

1Pulchowk Campus, IOE, Tribhuvan University    2ETH Zurich 3NAAMII 4INSAIT

Abstract


iHuman can reconstruct human body in motion given video and poses in both 3D Gaussian Splats and mesh representation.

Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations. This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface's displacements and the color's spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses.



Comparison of geometric fidelity with other methods


Fig: iHuman produces high fidelity mesh even capturing subtle facial details like hair, ear in 15 seconds of computational budget.


Pipeline



Our method represents the human body in canonical space with gaussians parameterized by 3D gaussian centers x, rotations q, scales S, opacity αo, colors SH, skinning weight w and its associated parent triangle ix. It takes body pose (θt) of tth frame as input and applies forward linear blend skinning to transform v' to posed space vp. We compute gaussian center x from the posed space vertices vp of ix. The normal of parent triangle ix is encoded to SH and rasterized to obtain the normal map I. Then, we apply photometric loss and normal map loss to recover both geometry and color. The GT normal map () is obtained from monocular RGB image (It) using pix2pixHD [Wang et al., 2018] network.


Novel Pose Synthesis




Citation


@inproceedings{pramishp2024iHuman,
  title={iHuman: Instant Animatable Digital Humans From Monocular Videos},
  author={Paudel P, Khanal A, Chhatkuli A, Paudel D, Tandukar J},
  booktitle={ECCV},
  year={2024}
}