I implemented a simplified version of Disney's architecture (Kernel Predicting CNN). By “simplified”, I mean: no albedo demodulation, no diffuse/specular separation, no extra statistics in the input (e.g. local variance) or data preprocessing). This network works on data aggregated per pixel, and predicts a local 21x21 kernel at each pixel. This kernel is use to estimate the pixel's RGB value from the local neighborhood's RGB values, with identical weights for R, G, B. It produces reasonable outputs, and converges much quicker than the direct version.

All the results shown are trained on 19 scenes with 4096 samples per-pixel, direct-sampling only, with 5 bounces, visible emitters are not rendered. Materials are solid color, diffuse only. Data augmentation includes independent random scaling of the RGB radiance components with scale in \([0,2]\), random flip and random 90deg rotation.

I use three types of input data:

Basic: R, G, B per pixel value

Extended: R,G ,B + screen-space normals + depth

ExtendedSamples: 8 samples per pixel, each with R, G, B + normal + depth at first bounces, + probabilities at each bounce (5 numbers, log(1+p) transform)

I am still rendering more synthetic scenes, we have around 400 for now. Rendering takes 20min/scene, mostly limited by disk access

*KPCNN vs. Direct, pixel-based network*: KPCNN converges much faster. On boundaries with high dynamic range KPCNN performs much better than a direct network. The direct network may try to apply distinct transformations to distinct intervals of the dynamic range. Or the \(L_2\) loss is in cause.

*RGB only vs. RGB + depth + screen-space normals*: As we expect, providing depth and normals avoid over smoothing and yields generally better geometric discontinuities.

- KPCNN
*L2 vs. L1*loss:

I experimented with a few ways to leverage the individual samples. Inspired from KPCNN, and the issues of direct networks, I try to predict weights to modulate sample contributions. That is, given and [h, w, spp, chans] array of samples, I predict a \([h,w, s\times k\times k]\) array of weights (kernel size \(k=21\), \(s=8\) samples, \(c=3+3+1+5=12\) channels.)

Instead of the usual Monte-Carlo averaging:

\[I=\frac{1}{N}\sum_{i=1}^{N}\frac{f_i}{p_i}\]

We reweigh the samples, and pool information from a local 21x21 neighborhood \(\mathcal{N}\):

\[I=\sum_{x\in\mathcal{N}}\sum_{i=1}^{N}w_{x, i}\frac{f_i}{p_i}\]

With \[\sum_{x\in\mathcal{N}}\sum_{i=1}^{N}w_{x, i} = 1\]

One issue: the output space is very-high dimensional. KPCNN already predicts \(21\times21 = 441\) numbers per-pixel. We multiply this by 8. So it is possible that we need to drastically increase the network capacity.

Generally, convergence is much slower for our sample-based models. I describe a few attempts below. All nets use 5 layers of \(5\times5\) conv+relu, with 128 filters each.

The network with a fully-connected feature embedding at the sample level has not converged yet.

This model stacks samples, and treats them as extra “channels”. That is, our input is an image with shape \([h,w ,s\times c]\). This model predicts pixel-wise weights (only 441 numbers), that is the RGB values of the samples are averages per-pixel and cannot be modulated independently.

KPDumbSamples has a slighly lower MSE (compared to per-pixel net).

Better shadow boundaries,

This model computes the mean and variance of the sample features to obtain per-pixel values, applies convolution to propagate this information spatially, then predicts a scalar per-sample quality measure of a sample ("belief"). Beliefs are softmax-normalized over a small region: this correspond to a 3D kernel, weighting each sample in a local window to contribute to the final reconstruction of the center pixel.

The "belief" approach is an attempt to avoir the \(21*21*8=3538\) network output if we wanted to predict a weight per neighborhood sample directly. We should nonetheless add this expensive comparison.

This is another attempt to avoid the high-dimensional output. We compute the weights of the samples within each pixel (using spatially-propagated information), and a grayscale spatial kernel. This effectively enforces separability of the 3d sample-weighting kernel in the sample/spatial dimension.

Results are not convincing with this approach: outputs are much noisier, the geometry is not captured well.

I will revisit the RNN idea as it could be a way to save some processing: by building the final output incrementally.