Confocal Stereo

Samuel W. Hasinoff and Kiriakos N. Kutulakos

Publications

Samuel W. Hasinoff and Kiriakos N. Kutulakos, Confocal Stereo. International Journal of Computer Vision, 81(1), pp. 82-104, 2009 (invited paper). [pdf]

Samuel W. Hasinoff and Kiriakos N. Kutulakos, Confocal Stereo. Proc. 9th European Conference on Computer Vision, ECCV 2006, pp. 620-634. [pdf]
Longuet-Higgins Best Paper Award, Honorable Mention

Samuel W. Hasinoff, Variable-Aperture Photography. PhD Thesis, University of Toronto, Dept. of Computer Science, 2008. [pdf]
Alain Fournier Ph.D. Thesis Award

Journal abstract

We present confocal stereo, a new method for computing 3D shape by controlling the focus and aperture of a lens. The method is specifically designed for reconstructing scenes with high geometric complexity or fine-scale texture. To achieve this, we introduce the confocal constancy property, which states that as the lens aperture varies, the pixel intensity of a visible in-focus scene point will vary in a scene-independent way, that can be predicted by prior radiometric lens calibration. The only requirement is that incoming radiance within the cone subtended by the largest aperture is nearly constant. First, we develop a detailed lens model that factors out the distortions in high resolution SLR cameras (12MP or more) with large-aperture lenses (e.g., f1.2). This allows us to assemble an AxF aperture-focus image (AFI) for each pixel, that collects the undistorted measurements over all A apertures and F focus settings. In the AFI representation, confocal constancy reduces to color comparisons within regions of the AFI, and leads to focus metrics that can be evaluated separately for each pixel. We propose two such metrics and present initial reconstruction results for complex scenes, as well as for a scene with known ground-truth shape.

Supplementary material

PowerPoint slides, presented at ECCV 2006 [zip]
Accompanying video of reconstruction results [divx avi, 115MB] and its transcript [txt]
Partial Matlab code [zip]

"hair" dataset

Our first test scene was a wig with a messy hairstyle, approximately 25cm tall, surrounded by several artificial plants. Reconstruction results for this scene show that our confocal constancy criteria lead to very detailed depth maps, at the resolution of individual strands of hair, despite the scene's complex geometry and despite the fact that depths can vary greatly within small image neighborhoods (e.g., toward the silhouette of the hair). By comparison, the 3x3 variance operator produces uniformly-lower resolution results, and generates smooth "halos" around narrow geometric structures like individual strands of hair. In many cases, these "halos" are larger than the width of the spatial operator, as blurring causes distant points to influence the results.

In low-texture regions, such as the cloth flower petals and leaves, fitting a model to the entire AFI allows us to exploit defocused texture from nearby scene points. Window-based methods like variance, however, generally yield even better results in such regions, because they propagate focus information from nearby texture more directly, by implicitly assuming a smooth scene geometry. Like all focus measures, those based on confocal constancy are uninformative in extremely untextured regions, i.e., when the AFI is constant. However, by using the proposed confidence measure, we can detect many of these low-texture pixels. To better visualize the result of filtering out these pixels, we replace them using a simple variant of PDE-based inpainting (Bertalmio et al., 2000).

sample aligned image (gamma = 2.0) - focus = 41, widest aperture (f1.2)

sample aligned image (gamma = 2.0) - focus = 41, narrowest aperture (f16)

depthmap - 3x3 variance (depth-from-focus)

depthmap - direct confocal constancy

depthmap - AFI model fitting (old feature-based online alignment, ECCV'06)

depthmap - AFI model fitting (new Lucas-Kanade online alignment, plus global lighting correction)

depthmap - AFI model fitting - low-confidence pixels* marked

depthmap - AFI model fitting - low-confidence pixels filled using inpainting

Full dataset (50.16 GB) available on request

"box" dataset

To quantify reconstruction accuracy, we used a tilted planar scene consisting of a box wrapped in newsprint. The plane of the box was measured with a FaroArm Gold 3D touch probe whose single-point accuracy was +-0.05mm in the camera's workspace. To relate probe coordinates to coordinates in the camera's reference frame we used the Camera Calibration Toolbox for Matlab along with further correspondences between image features and 3D coordinates measured by the probe. A similar procedure was used to estimate the mapping between focal settings and the depth of in-focus points, i.e., the dist(.) function in Eq. (10).

sample aligned image (gamma = 2.0) - focus = 46, widest aperture (f1.2)

sample aligned image (gamma = 2.0) - focus = 46, narrowest aperture (f16)

depthmap - 3x3 variance (depth-from-focus)

depthmap - direct confocal constancy

depthmap - AFI model fitting (old feature-based online alignment, ECCV'06)

depthmap - AFI model fitting (new Lucas-Kanade online alignment, plus global lighting correction)

depthmap - AFI model fitting - low-confidence pixels* marked

depthmap - AFI model fitting - low-confidence pixels filled using inpainting

depthmap - ground truth (front-most plane of the box)

"plastic" dataset

Our third test scene was a rigid, near-planar piece of transparent plastic, formerly used as packaging material, which was covered with dirt, scratches, and fingerprints. This plastic object was placed in front of a dark background and lit obliquely to enhance the contrast of its limited surface texture. Reconstruction results for this scene illustrate that at high resolution, even transparent objects may have enough fine-scale surface texture to be reconstructed using focus- or defocus-based techniques. In general, wider baseline methods like standard stereo cannot exploit such surface texture easily because textured objects behind the transparent surface may interfere with matching.

sample aligned image (gamma = 2.0) - focus = 31, widest aperture (f1.2)

sample aligned image (gamma = 2.0) - focus = 31, narrowest aperture (f16)

depthmap - 3x3 variance (depth-from-focus)

depthmap - direct confocal constancy

depthmap - AFI model fitting (old feature-based online alignment, ECCV'06)

depthmap - AFI model fitting (new Lucas-Kanade online alignment, plus global lighting correction)

depthmap - AFI model fitting - low-confidence pixels* marked

depthmap - AFI model fitting - low-confidence pixels filled using inpainting

"teddy" dataset

Our final test scene was captured using low-quality camera equipment, using one of earliest digital SLR cameras (the Canon EOS 10D), with a low-quality zoom lens. The scene consists of a teddy bear with coarse fur, seated in front of a hat and several cushions, with a variety of ropes in the foreground. Since little of this scene is composed of the fine pixel-level texture found in previous scenes, this final dataset provides an additional test for low-texture areas.

sample aligned image (gamma = 2.0) - focus = 1, widest aperture (f3.5)

sample aligned image (gamma = 2.0) - focus = 1, narrowest aperture (f16)

depthmap - 3x3 variance (depth-from-focus)

depthmap - direct confocal constancy

depthmap - AFI model fitting (old feature-based online alignment, ECCV'06)

depthmap - AFI model fitting (new Lucas-Kanade online alignment, plus global lighting correction)

depthmap - AFI model fitting - low-confidence pixels* marked

depthmap - AFI model fitting - low-confidence pixels filled using inpainting

* We have experimented with a simple confidence measure computed as the second derivative at the minimum of the focus criterion. In practice, since computing second derivatives directly can be noisy, we compute the width of the valley that contains the minimum, at a level 10% above the minimum. For AFI model-fitting across all datasets, we reject pixels whose width exceeds 14 focus settings. Small adjustments to this threshold do not change the results significantly.

Acknowledgements

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under the RGPIN and CGS-D programs, by a fellowship from the Alfred P. Sloan Foundation, by an Ontario Premier's Research Excellence Award and by Microsoft Research.