NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads

1Technical University of Munich

NeRSemble takes multi-view video recordings of a person and allows rendering from arbitrary viewpoints.

RGB Depth Deformations

Abstract

We focus on reconstructing high-fidelity radiance fields of human heads, capturing their animations over time, and synthesizing re-renderings from novel viewpoints at arbitrary time steps.

To this end, we propose a new multi-view capture setup composed of 16 calibrated machine vision cameras that record time-synchronized images at 7.1 MP resolution and 73 frames per second. With our setup, we collect a new dataset of over 4700 high-resolution, high-framerate sequences of more than 220 human heads, from which we introduce a new human head reconstruction benchmark. The recorded sequences cover a wide range of facial dynamics, including head motions, natural expressions, emotions, and spoken language.

In order to reconstruct high-fidelity human heads, we propose Dynamic Neural Radiance Fields using Hash Ensembles (NeRSemble). We represent scene dynamics by combining a deformation field and an ensemble of 3D multi-resolution hash encodings. The deformation field allows for precise modeling of simple scene movements, while the ensemble of hash encodings helps to represent complex dynamics. As a result, we obtain radiance field representations of human heads that capture motion over time and facilitate re-rendering of arbitrary novel viewpoints. In a series of experiments, we explore the design choices of our method and demonstrate that our approach outperforms state-of-the-art dynamic radiance field approaches by a significant margin.

Video

Method Overview

NeRSemble represents a spatio-temporal radiance field for dynamic NVS using volume rendering (left). On the right side, we show how NeRSemble obtains a density 𝜎(x) and color value c(x, d) for a point x on a ray at time 𝑡:

  1. Given the deformation code 𝝎𝑡 the point x is warped to x′ = D(x, 𝝎𝑡 ) in the canonical space.
  2. The resulting point is used to query features H𝑖(x′) from the 𝑖-th hash grid in our ensemble.
  3. The resulting features are blended using weights 𝛽𝑡 . Note that both 𝝎𝑡 and 𝛽𝑡 contribute to explaining temporal changes.
  4. We predict density 𝜎(x) and view-dependent color c(x, d) from the blended features using an efficient rendering head consisting of two small MLPs.

Contributions of Individual Hashtables

𝛽𝑡
...
𝛽𝑡,2:
𝛽𝑡,3:
𝛽𝑡,4:

During training, blend weights 𝛽𝑡 are optimized for each timestep 𝑡 to combine the hash tables s.t. they explain the observed expression.

At inference time, choosing different blend weights 𝛽𝑡 allows certain control over the person's expression.

While the first hash table H1 learns a representation similar to the mean face of a person, the remaining hash tables H𝑖 add further expression-dependent details to the scene.

BibTeX

@article{kirschstein2023nersemble,
    author = {Kirschstein, Tobias and Qian, Shenhan and Giebenhain, Simon and Walter, Tim and Nie\ss{}ner, Matthias},
    title = {NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads},
    year = {2023},
    issue_date = {August 2023},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    volume = {42},
    number = {4},
    issn = {0730-0301},
    url = {https://doi.org/10.1145/3592455},
    doi = {10.1145/3592455},
    journal = {ACM Trans. Graph.},
    month = {jul},
    articleno = {161},
    numpages = {14},
}