FlexAvatar
Learning Complete 3D Head Avatars with Partial Supervision

1Technical University of Munich
FlexAvatar creates high-quality and complete 3D head avatars from as few as a single input image.

Interactive 3D Avatar Viewer

Input Image
Loading...

Abstract

We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations.

In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations.

Video

Method Overview

FlexAvatar uses an encoder-decoder architecture to first encode a single image into a latent avatar code which is then decoded a 3D Gaussian Splatting representation for rendering. Once trained, the model provides control over both viewpoint π and facial expression zexp.

  1. The transformer-based encoder E first produces a compressed avatar code A via cross-attention.
  2. The decoder D then incorporates the effect of the facial expression zexp into the avatar representation.
  3. Crucially, the corresponding bias sinks are concatenated to the expression tokens: z2D if the input image I comes from a monocular dataset, and z3D if it comes from a multi-view dataset.
  4. Finally, the upsampled avatar code is decoded into the 3D Gaussian attributes for rendering.
During training, the bias sinks absorb data-modality specific biases such as the entanglement of driver expression and target viewpoint of monocular datasets. At inference time, only z3D is used to inherit the disentangled behavior of multi-view datasets yielding both generalized and complete 3D head avatars.

BibTeX

@article{kirschstein2025flexavatar,
  title={FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision},
  author={Kirschstein, Tobias and Giebenhain, Simon and Nie{\ss}ner, Matthias},
  journal={arXiv preprint arXiv:2512.15599},
  year={2025}
}