Multi-frame stereo matching with edges, planes, and superpixels
Tianfan Xue1 Andrew Owen2 Daniel Scharstein3 Michael Goesele4 Richard Szeliski5
1MIT CSAIL      2University of California, Berkeley      3Middlebury College      4Technische Universit├Ąt Darmstadt      5Facebook

We present a multi-frame narrow-baseline stereo matching algorithm based on extracting and matching edges across multiple frames. Edge matching allows us to focus on the important features at the very beginning, and deal with occlusion boundaries as well as untextured regions. Given the initial sparse matches, we fit overlapping local planes to form a coarse, over-complete representation of the scene. After breaking up the reference image in our sequence into superpixels, we perform a Markov random field optimization to assign each superpixel to one of the plane hypotheses. Finally, we refine our continuous depth map estimate using a piecewise-continuous variational optimization. Our approach successfully deals with depth discontinuities, occlusions, and large textureless regions, while also producing detailed and accurate depth maps. We show that our method out-performs competing methods on high-resolution multi-frame stereo benchmarks and is well-suited for view interpolation applications.

@article{xue2019multistereo, title={Multi-frame stereo matching with edges, planes, and superpixels}, author={Xue, Tianfan and Owens, Andrew and Scharstein, Daniel and Goesele, Michael and Szeliski, Richard}, journal={Image and Vision Computing}, volume={91}, year={2019}, publisher={Elsevier} }

Downloads: PDF

This webpage requires Javascript to run. In some browsers, you may need to explicitly enable it. If the results do not fit width-wise across your screen, you can zoom out in your web browser.

1. View interpolation
2. Results on Midd-F
3. Results on Disney
4: Comparison with Kim et al. [3] on Midd-Q
5: Varying numbers of frames as input

1. View interpolation         Back to top
First, we show view interpolation result using depth maps calculated by our algorithm. For each sequence, we recover the depth maps of both left and right frames using 5 (for Teddy) or 9 (for Mansion) frames around these two frames. We then render viewpoints between the left and right frames by warping both color images with recovered depth maps and combining these results using z-buffering and linear blending. To evaluate the quality of interpolation, we compare the interpolated view with the actual image captured at that viewpoint.

Please mouse-over or click on the labels beneath each image to switch between them (it might take 1-2 seconds to load an image since the images are very large). Two corresponding close-up views are shown besides each sequence.



In the interpolated view using SGM depth, some background pixels near depth boundaries are missing. For example, both the digit 2 in the top patch and the digit 4 in the bottom patch are missing (compare SGM with the ground truth). This is because the depth at these regions is incorrect due to foreground fattening. Such errors do not exist in our result.


In the interpolated view using depth calculated by SGM there are "halos" around the leaves (top patch) or the spike (bottom patch) due to foreground fattening, while the boundary of these objects is much cleaner in our result.

Below we show synthesized videos using our depth maps. The first and the last frame of each sequence are used as input and the remaining frames are generated by view interpolation.



Back to top

2. Results on Midd-F         Back to top
We provide depth maps recovered by our technique on Midd-F, along with intermediate results (depth of edges and depth of patches), and a comparison with SGM. Please mouse over or click on the labels beneath each image to switch between them (it might take 1-2 seconds to load an image since the images are very large). Two corresponding close-up views are shown besides each sequence.



Our depth, error rate=3.08


Our depth, error rate=4.28

In the SGM depth, the leaf shown in the top patch is thicker than it should be, and there are also errors between two leaves shown in the bottom patches. These errors do not exist in our depth map.

Most errors in our depth occur at the leaf protruding towards the viewer and along other untextured leaft regions, in which few edges were detected (see edge depth).

The boundaries of the three pens shown in the top patch are mostly accurate in our depth map, but are thicker than the ground truth in the SGM depth map. Our algorithm also produces fewer errors on the background (see both patches).


Our depth, error rate=1.08


Our depth, error rate=2.43

The recovered depth maps by our algorithm and SGM are mostly accurate in this sequence, since the input images are highly textured.

Several cones in this sequence are occluded by other cones, making matching challenging. Our edge matching algorithm can correctly estimate the depth of each cone (see edge depth), and therefore the estimated depth has clear boundaries between these cones, even when two cones have similar colors, e.g., the two red cones shown in the bottom patch. In contrast, the depth boundaries in the SGM depth map are much less clean.


Our depth, error rate=7.35


Our depth, error rate=1.21

The depth boundaries of the dolls are mostly accurate in our depth map, while SGM often results in foreground fattening (see the two zoomed patches).

Both algorithms have difficulty in untextured slanted and curved regions such as the ground plane and the black feet of the doll in the bottom right.

The recovered depth maps by both algorithms are mostly accurate, except for a few textureless holes between rocks.


Our depth, error rate=4.47

This is a challenging sequence, as there are highly foreshortened regions (the newspaper at the bottom), curved surfaces (the teddy's face), and untextured regions (the background to the right of the teddy). Our technique correctly recovers the depth map in most of these regions, while SGM creates large errors in the textureless regions as can be seen in the two zoomed patches.

Back to top

3. Results on Disney         Back to top

Since ground truth is not available for Disney, we can only provide a qualitative comparison between our results and the results of SGM and Kim et al. [3]. Note that Kim et al. [3] use 101 frames as input while SGM and our algorithm only use 9 frames.



The depth map recovered by our algorithm is very similar to the one recovered by Kim et al. [3], except at very thin structures. The depth boundaries of the wires in our depth map are cleaner than those in the SGM depth map (see the top patch). Following Kim et al. [3] we remove the depth of the sky using the mask provided in [3].


This is a very challenging sequence since there are many thin structures, such as the fence and the leaves shown in the two zoomed patches. Our algorithm accurately recovers the depth of most of these objects, while SGM again tends to produce foreground fattening (see zoomed patches). The method by Kim et al. [3] is sometimes more accurate on thin structures than our method, but our method often does better in untextured regions such as the window behind the spike in the top patch. In addition, our method uses only 9 input frames while Kim et al. [3] use 101 frames.


All three algorithms work reasonably well on this sequence, but our algorithm is the only one that accurately recovers the leaves in front of the untextured car (top patch), and it does better than SGM recovering the thin car antenna (bottom patch).

Back to top

4. Comparison with Kim et al. [3] on Midd-Q         

To quantitatively compare our algorithm with Kim et al. [3], we test both algorithms on three sequences from the Middlebury 2001 dataset. The error rate (%) for the recovered depths are shown below (threshold = 1.0). Our algorithm performs much better than Kim et al. [3] (which is not designed for small numbers of input images).

Sequence No. of Frames Kim et al. [3] Ours
Tsukuba 5 8.42 5.83
Venus 9 10.59 0.64
Sawtooth 9 6.25 1.46
Back to top

5. Varying number of frames as input         Back to top

This experiment shows how our algorithm performs under different number of input frames. The recovered depth map is shown on the left, and the corresponding error map is shown on the right. Please mouse over or click on the labels beneath each image to switch between them.

Input/Estimated depth map Error map

Although we use 7 or 9 frames as input in most experiments, our algorithm still works reasonably well when fewer frames are provided., Even with just 3 frames, clean depth maps can still be derived, and errors decrease when more frames are used.
Back to top