DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

Abstract

DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person, offering intuitive control over both pose and expression.

We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose, we render a neural parametric head model (NPHM) from the target viewpoint, which acts as a proxy geometry of the person. Additionally, to enhance the modeling of intricate facial expressions, we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally, to synthesize consistent surface details across different viewpoints and expressions, we rig learnable spatial features to the head’s surface via TriPlane lookup in NPHM’s canonical space.

We train DiffusionAvatars on RGB videos and corresponding tracked NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person, outperforming existing approaches.

Avatar Animation via Expression Transfer

A DiffusionAvatar can be animated by expression transfer. We obtain the expression code z_exp by fitting NPHM to a video sequence of a different person. Our diffusion-based neural renderer then transfers the expression to the avatar.

Source sequence

Animated DiffusionAvatar

Avatar Animation via NPHM

A DiffusionAvatar can also be controlled via the underlying NPHM. We obtain expression codes z_exp by interpolating between several target expressions. Using rasterization and our diffusion-based neural renderer, the expression codes are translated into a realistic avatar with viewpoint control.

Method Overview

DiffusionAvatars decodes an NPHM expression code z_exp in two ways to obtain a realistic image:

We first extract an NPHM mesh and rasterize it from the desired viewpoint, giving us canonical coordinates, depths, and normal renderings for the head mesh.
The canonical coordinates x_can are used to look up spatial features in a TriPlanes structure, rigging the features onto the mesh surface. Together with the rasterizer output, these mapped features form the input for the ControlNet part of DiffusionAvatars.
The second route for the expression code goes through a linear layer. It yields expression tokens that are subsequently used in a newly added cross-attention layer inside the pre-trained latent diffusion model. Intuitively, the rasterized inputs should encode pose, shape and rough expression while the direct expression conditioning hints at more detailed facial expressions.
The final image is synthesized by iteratively denoising Gaussian noise.

BibTeX

@inproceedings{kirschstein2024diffusionavatars, title={Diffusionavatars: Deferred diffusion for high-fidelity 3d head avatars}, author={Kirschstein, Tobias and Giebenhain, Simon and Nie{\ss}ner, Matthias}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={5481--5492}, year={2024} }

Diffusion Avatars
Deferred Diffusion for High-fidelity 3D Head Avatars

Abstract

Video

Avatar Animation via Expression Transfer

Avatar Animation via NPHM

Method Overview

Animate a DiffusionAvatar yourself

BibTeX

Diffusion Avatars Deferred Diffusion for High-fidelity 3D Head Avatars

Abstract

Video

Avatar Animation via Expression Transfer

Avatar Animation via NPHM

Method Overview

Animate a DiffusionAvatar yourself

BibTeX

Diffusion Avatars
Deferred Diffusion for High-fidelity 3D Head Avatars