NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads

Abstract

We focus on reconstructing high-fidelity radiance fields of human heads, capturing their animations over time, and synthesizing re-renderings from novel viewpoints at arbitrary time steps.

To this end, we propose a new multi-view capture setup composed of 16 calibrated machine vision cameras that record time-synchronized images at 7.1 MP resolution and 73 frames per second. With our setup, we collect a new dataset of over 4700 high-resolution, high-framerate sequences of more than 220 human heads, from which we introduce a new human head reconstruction benchmark. The recorded sequences cover a wide range of facial dynamics, including head motions, natural expressions, emotions, and spoken language.

In order to reconstruct high-fidelity human heads, we propose Dynamic Neural Radiance Fields using Hash Ensembles (NeRSemble). We represent scene dynamics by combining a deformation field and an ensemble of 3D multi-resolution hash encodings. The deformation field allows for precise modeling of simple scene movements, while the ensemble of hash encodings helps to represent complex dynamics. As a result, we obtain radiance field representations of human heads that capture motion over time and facilitate re-rendering of arbitrary novel viewpoints. In a series of experiments, we explore the design choices of our method and demonstrate that our approach outperforms state-of-the-art dynamic radiance field approaches by a significant margin.

Video

Method Overview

NeRSemble represents a spatio-temporal radiance field for dynamic NVS using volume rendering (left). On the right side, we show how NeRSemble obtains a density 𝜎(x) and color value c(x, d) for a point x on a ray at time 𝑡:

Given the deformation code 𝝎_𝑡 the point x is warped to x′ = D(x, 𝝎_𝑡 ) in the canonical space.
The resulting point is used to query features H_𝑖(x′) from the 𝑖-th hash grid in our ensemble.
The resulting features are blended using weights 𝛽_𝑡 . Note that both 𝝎_𝑡 and 𝛽_𝑡 contribute to explaining temporal changes.
We predict density 𝜎(x) and view-dependent color c(x, d) from the blended features using an efficient rendering head consisting of two small MLPs.

Contributions of Individual Hashtables

𝛽_𝑡 =

...

𝛽_𝑡,2:

𝛽_𝑡,3:

𝛽_𝑡,4:

During training, blend weights 𝛽_𝑡 are optimized for each timestep 𝑡 to combine the hash tables s.t. they explain the observed expression.

At inference time, choosing different blend weights 𝛽_𝑡 allows certain control over the person's expression.

While the first hash table H₁ learns a representation similar to the mean face of a person, the remaining hash tables H_𝑖 add further expression-dependent details to the scene.

BibTeX

@article{kirschstein2023nersemble,
    author = {Kirschstein, Tobias and Qian, Shenhan and Giebenhain, Simon and Walter, Tim and Nie\ss{}ner, Matthias},
    title = {NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads},
    year = {2023},
    issue_date = {August 2023},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    volume = {42},
    number = {4},
    issn = {0730-0301},
    url = {https://doi.org/10.1145/3592455},
    doi = {10.1145/3592455},
    journal = {ACM Trans. Graph.},
    month = {jul},
    articleno = {161},
    numpages = {14},
}