2017-06-26

Disney reference, weighted samples reconstruction

I implemented a simplified version of Disney's architecture (Kernel Predicting CNN). By “simplified”, I mean: no albedo demodulation, no diffuse/specular separation, no extra statistics in the input (e.g. local variance) or data preprocessing). This network works on data aggregated per pixel, and predicts a local 21x21 kernel at each pixel. This kernel is use to estimate the pixel's RGB value from the local neighborhood's RGB values, with identical weights for R, G, B. It produces reasonable outputs, and converges much quicker than the direct version.

8spp input
8spp input
KPCNN output
KPCNN output

Datasets

All the results shown are trained on 19 scenes with 4096 samples per-pixel, direct-sampling only, with 5 bounces, visible emitters are not rendered. Materials are solid color, diffuse only. Data augmentation includes independent random scaling of the RGB radiance components with scale in \([0,2]\), random flip and random 90deg rotation.

I use three types of input data:

  1. Basic: R, G, B per pixel value

  2. Extended: R,G ,B + screen-space normals + depth

  3. ExtendedSamples: 8 samples per pixel, each with R, G, B + normal + depth at first bounces, + probabilities at each bounce (5 numbers, log(1+p) transform)

I am still rendering more synthetic scenes, we have around 400 for now. Rendering takes 20min/scene, mostly limited by disk access

Baselines (per-pixel)

Direct prediction vs. KPCNN. KPCNN is much better than direct prediction around HDR boundaries
Direct prediction vs. KPCNN. KPCNN is much better than direct prediction around HDR boundaries
RGB only vs. RGB+depth+normals. Adding extra features helps with geometry discontinuity and prevents over smoothing
RGB only vs. RGB+depth+normals. Adding extra features helps with geometry discontinuity and prevents over smoothing
L1 (right) is slightly better and reduces low-frequency oscillations compared to L2(left)
L1 (right) is slightly better and reduces low-frequency oscillations compared to L2(left)

Sample-based networks (ours)

I experimented with a few ways to leverage the individual samples. Inspired from KPCNN, and the issues of direct networks, I try to predict weights to modulate sample contributions. That is, given and [h, w, spp, chans] array of samples, I predict a \([h,w, s\times k\times k]\) array of weights (kernel size \(k=21\), \(s=8\) samples, \(c=3+3+1+5=12\) channels.)

Instead of the usual Monte-Carlo averaging:

\[I=\frac{1}{N}\sum_{i=1}^{N}\frac{f_i}{p_i}\]

We reweigh the samples, and pool information from a local 21x21 neighborhood \(\mathcal{N}\):

\[I=\sum_{x\in\mathcal{N}}\sum_{i=1}^{N}w_{x, i}\frac{f_i}{p_i}\]

With \[\sum_{x\in\mathcal{N}}\sum_{i=1}^{N}w_{x, i} = 1\]

One issue: the output space is very-high dimensional. KPCNN already predicts \(21\times21 = 441\) numbers per-pixel. We multiply this by 8. So it is possible that we need to drastically increase the network capacity.

Generally, convergence is much slower for our sample-based models. I describe a few attempts below. All nets use 5 layers of \(5\times5\) conv+relu, with 128 filters each.

The network with a fully-connected feature embedding at the sample level has not converged yet.

KPDumbSamples

This model stacks samples, and treats them as extra “channels”. That is, our input is an image with shape \([h,w ,s\times c]\). This model predicts pixel-wise weights (only 441 numbers), that is the RGB values of the samples are averages per-pixel and cannot be modulated independently.

KPDumbSamples architecture
KPDumbSamples architecture

KPDumbSamples has a slighly lower MSE (compared to per-pixel net).

Better shadow boundaries,

BeliefSamples

This model computes the mean and variance of the sample features to obtain per-pixel values, applies convolution to propagate this information spatially, then predicts a scalar per-sample quality measure of a sample ("belief"). Beliefs are softmax-normalized over a small region: this correspond to a 3D kernel, weighting each sample in a local window to contribute to the final reconstruction of the center pixel.

The "belief" approach is an attempt to avoir the \(21*21*8=3538\) network output if we wanted to predict a weight per neighborhood sample directly. We should nonetheless add this expensive comparison.

BeliefSamples architecture
BeliefSamples architecture
SeparableWeightsSamples
SeparableWeightsSamples architecture
SeparableWeightsSamples architecture

This is another attempt to avoid the high-dimensional output. We compute the weights of the samples within each pixel (using spatially-propagated information), and a grayscale spatial kernel. This effectively enforces separability of the 3d sample-weighting kernel in the sample/spatial dimension.

Results are not convincing with this approach: outputs are much noisier, the geometry is not captured well.

Reference 4096spp
Reference 4096spp
KPCNN (per-pixel input)
KPCNN (per-pixel input)
Separable 3D weights on samples
Separable 3D weights on samples

RNN

I will revisit the RNN idea as it could be a way to save some processing: by building the final output incrementally.