- Rotation, shear, scale give us better results than per-pixel net. Both visually:

*4spp*

*truth*

*ours*

*no transform, ours*

*pixels only*

and numerically:

pixel | no transform | ours | |
---|---|---|---|

Loss (sum absval + gradient) | 9.4 | 6.2 | 5.3 |

\(L_2\) loss was insufficient (uniform blurring of the image gives good numbers). I used: \[\mathcal{L} = |I(x)|_1 + \lambda|\nabla I(x)|_1\] with \(\lambda=0.1\). The image gradients are computed from simple finite difference, after Gaussian smoothing with \(\sigma=1\).

Our predicted shear does not seem to align with the motion (but my 3am visualization might be broken too...):

- 25,000 64x64 patches (4 patches for each 128px rendering).
- Randomly rotated and translated textured quad in front of textured background.
- Repeated texture for HF content.
- Uniform translation over shutter time,
**max speed ~9px** - 4spp, ground-truth 256spp
- We compute the screen-space motion vector

- Predict per-pixel transformation parameters: rotation (parametrized by (cos,sin)), scale, shear.
- Input: per-sample, radiance and screen-space velocity, averaged over a pixel (box filter)
- 4 layers convnet w. LeakyReLU, 3px filters, 32 filters/layer, batchnormed
- 9px receptive field \((1+4\times(3-1))\)
- 4D output (transform params), squashed with \(\tanh\)
- cos, sin normalized appropriately
- scale normalized to \([0, \textit{max_scale}]\) (learneable max)
- shear normalized to \([-\textit{max_shear}, \textit{max_shear}]\) (learneable max)

- Sample averaging:
- 1 3D kernel \(w\) covering
**9**pixels neighborhood, discretization: \([15, 15, 9]\) - Kernel initialized to a \(9\times 9\) box blur
- weights are non-negative
- Output given by: \[O_{x} = \frac{1}{\sum_i w_i}\sum_i w_i I_{x+i}\]

- 1 3D kernel \(w\) covering
- Post-averaging:

- 4 layers convnet, same as (1.)

- Batchnorm in (1.) was a good idea for radiance-only, not so for velocity measurement
- Receptive field in (1.) too small to properly "smooth out" or detect motion?
- Should we remove step 3.? (I made it for comparison with pixel-only network)
- only allows usual radiance mixing, should we add features?

- allow negative weights? \(L_2\) normalization?
- Fairness of the comparison?

- If we want to get rid of the
*motion vectors*as feature, we need to have some sub-sample precision in step 1.