Tutorial 2: NeRF (Neural Radiance Field)

Ben Mildenhall et al., ECCV 2020

Reviewer: Kunho Kim

arXiv: https://arxiv.org/abs/2003.08934

1. Abstract

NeRF achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. This algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (the spatial location ($x, y, z$) and the view direction ($\theta, \phi$) and whose output is the volume density and view-dependent emitted radiance at that spatial location. NeRF synthesizes images from novel views by querying 5D coordinates along camera rays and leverages classic volume rendering techniques to project the output colors and densities into an image.

2. Introduction

Why NeRF?

Novel view image synthesis is one of the long-standing problems in computer vision and computer graphics fields, and its significance is becoming larger as demands on interactive media applications increase.

There were huge successes in representing highly detailed 3D shapes with neural networks, but these methods were not suitable for reproducing realistic images compared to discrete-representation-based methods using mesh or voxels. → In other words, no rendering technique for continuous geometry & radiance distributions.
What is NeRF?

NeRF represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (the spatial location ($x, y, z$) and the view direction ($\theta, \phi$)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.

To render a neural radiance field (NeRF) from a particular viewpoint we:
1. march camera rays through the scene to generate a sampled set of 3D points,
2. use those points and their corresponding 2D viewing directions as input to the neural network to produce an output set of colors and densities, and
3. use classical volume rendering techniques to accumulate those colors and densities into a 2D image.
Because this process is naturally differentiable, we can use gradient descent to optimize this model by minimizing the error between each observed image and the corresponding views rendered from NeRF representation.
Main Contribution
- State-of-the-art method for synthesizing novel views of complex scenes by optimizing continuous volumetric scene function (= radiance field) representing complex geometry and materials.
- Differentiable rendering based on classical volume rendering techniques, which generates images that can be compared with ground truth for optimization. And hierarchical sampling that reduces the cost by adequately sampling high frequency scene representation.
- Positional encoding for mapping 5D coordinates to higher dimensional space.

3. Problem Statement

The continuous scene of interest can be represented as a 5D vector-valued function $F_\Theta$ whose input & output are:

Input: Concatenation of a 3D location $\bold{x}=(x,y,z)$ and 2D viewing direction $(\theta, \phi)$.
Output: A color $\bold{c}=(r,g,b)$ and volume density $\sigma$ at that point.

Thus, the relation between input and output in terms of $F_\Theta$ is:

$$ F_\Theta: (\bold{x},\bold{d}) \rightarrow (\bold{c}, \sigma) $$

4. Method

NeRF Overview

Neural Radiance Field Scene Representation

5D input vector is obtained by casting ray though scene, sampling points along each ray. Also, the viewing direction $(\theta, \phi)$ are substituted to a unit vector d indicating the identical direction in practice.