An Efficient Hybrid Shadow Rendering Algorithm

Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology

Proceedings of the Eurographics Symposium on Rendering 2004

Abstract

We present a hybrid algorithm for rendering hard shadows accurately and efficiently. Our method combines the strengths of shadow maps and shadow volumes. We first use a shadow map to identify the pixels in the image that lie near shadow discontinuities. Then, we perform the shadow volume algorithm only at these pixels to ensure accurate shadow edges. This approach simultaneously avoids the edge aliasing artifacts of standard shadow maps and avoids the high fillrate consumption of standard shadow volumes. The algorithm relies on a hardware mechanism for rapidly rejecting non-silhouette pixels during rasterization. Since current graphics hardware does not directly provide this mechanism, we simulate it using available features related to occlusion culling and show that dedicated hardware support requires minimal changes to existing technology.

Files

paper		pdf (8.2 MB)
submission video		mpg (65.1 MB)
bibtex		bib
slides		html

Images

The following images show the three test scenes that we study in our paper. We provide three different views of each scene. In the table below, rows correspond to different viewpoints, and columns show different renderings of a scene from a given viewpoint. Click on a thumbnail for a full-resolution (1024 x 768) image. All images were rendered using a NVIDIA GeForce 6800 (NV40) graphics card.

The visualization images in the last column require further explanation. Red and blue pixels are non-silhouette pixels. In our hybrid algorithm, shadow maps are used to determine which pixels are in shadow (red) and which ones are not (blue). Black and green pixels are shadow silhouette pixels. Shadow volumes are used to determine which ones are in shadow (black) and which ones are not (green). Note that this is the same color scheme used in our paper.


Shadow map	Shadow volumes	Hybrid algorithm	Shadow silhouette pixels	Shadow-volume polygons	Reconstruction visualization

Cubes
					N/A



Tree




Dragon Cage

Code

You can download some of our Cg shader code:

The following vertex and fragment programs are used to identify shadow silhouettes. These pixels are tagged by writing 1 to the color buffer.

Vertex program for finding shadow silhouettes: findsil_v.cg
Fragment program for finding shadow silhouettes: findsil_f.cg

The following fragment program is used to simulate the computation mask in conjunction with the EXT_depth_bounds_test OpenGL extension. The idea is to draw a screen-sized quad and use this shader to set the depth values of non-silhouette pixels to z = 0; the depth values of silhouette pixels are left alone. Then, enable the depth bounds test and set the bounds to [0.001, 1]. This tells the hardware that in all subsequent rendering, ignore all pixels whose depth values in the framebuffer are less than 0.001: these ignored pixels are exactly the non-silhouette pixels. Now we can draw the shadow volumes into the stencil buffer, and the hardware will perform rasterization and updates only at the silhouette pixels.

Fragment program for setting up computation mask: setmask_f.cg

Performance Issues

Our hybrid algorithm relies heavily on the hardware's ability to reject rapidly non-silhouette pixels early in the pipeline. Actual performance gains will depend on the effectiveness of hardware culling. In our tests on NVIDIA hardware, the GeForce 6 (NV40) can rasterize z/stencil (color writes disabled) at 32 pixels/clock, but it can reject pixels at 64 pixels/clock. Similarly, we observed that the GeForce FX 5950 (NV38) can rasterize z/stencil at 8 pixels/clock, but it can reject at 16 pixels/clock. Both measurements represent peak performance. In either case, the relative speedup is a factor of 2.

We have found that ATI's early culling mechanisms on the Radeon 9800 Pro are somewhat more effective than NVIDIA's, meaning that the relative performance gains for rejected pixels is higher in the ATI scenario. We have not performed detailed measurements, however. The purpose of this discussion is not to identify which vendors offer superior early culling performance. Instead, we simply want to point out that culling mechanisms can vary across different architectures; as these mechanisms improve in the future, so will the performance of our hybrid algorithm.

We mention in the paper that our algorithm requires the hardware to preserve early culling behavior when using a pixel shader to set depth values. Of the various NVIDIA hardware, only the GeForce 6 (NV40) architecture supports this feature. Previous NVIDIA architectures, such as the GeForce 3/4 (NV2x) and GeForce FX (NV3x) do not. On those earlier architectures, if you use a pixel shader to discard fragments (fragment kill) or modify depth values, early culling becomes disabled. Consequently, our hybrid algorithm is not efficient on those architectures because shadow volumes will be completely rasterized instead of being limited to the shadow silhouettes.

We have also discussed this issue with ATI, and they informed us that using a pixel shader to set depth values will preserve culling behavior for subsequent render passes on the Radeon 9700 (R300) or newer architectures. (Note that early culling will not apply to the current pass.) While this arrangement is suitable for our hybrid algorithm, unfortunately ATI hardware does not support the EXT_depth_bounds_test extension. Therefore, currently we cannot simulate a computation mask efficiently on ATI hardware.

Is it necessary to use the EXT_depth_bounds_test extension to simulate a computation mask? Not always. Consider the following image-processing scenario. We want to identify a small subset of the pixels in the image to work on; all other pixels are to be masked off. Then we want to run an expensive image-processing shader on the small working set of pixels. This can be accomplished as follows:

Clear the depth buffer so that z = 1.
Draw a screen-sized quad at z = 0 and use a pixel shader to discard the pixels in the working set. Make sure to have depth testing enabled and set the depth compare function to LESS. The end result is that all pixels to be masked-off will have depth values of 0, and all remaining pixels will have depth values of 1.
Draw another screen-sized quad, this time at z = 0.5, and apply the expensive image-processing pixel shader. Continue to use depth testing with the depth compare function set to LESS. A big performance win will result because the masked-off pixels will fail the depth test.

The above example does not use the EXT_depth_bounds_test; it only relies on standard z-buffering. Why can't our algorithm do the same? The reason is that we need to preserve the real z-values of the scenes' surfaces at the silhouette pixels in order to apply the shadow volume algorithm. Similarly, we can't do early stencil-based culling, because we also need the stencil buffer for rasterizing shadow volumes. These difficulties underscore the need for a user-programmable computation mask that is semantically independent of the existing depth and stencil buffers.

Related Research

Timo Aila and Tomas Akenine-Möller have written a paper titled "A Hierarchical Shadow Volume Algorithm" that will appear in Graphics Hardware 2004. Their idea is similar to ours: apply the shadow-volume computation only near the shadow silhouettes. The most important difference is that they identify tiles of pixels in object space by using testing for triangle intersections against the tiles' 3D bounding boxes, whereas we identify shadow silhouette pixels in image space using a shadow map. Since they do not use a discrete image representation anywhere in their algorithm, their method is not subject to undersampling artifacts such as missing shadows of fine geometry. Also, their method always identifies tiles of boundary pixels with respect to the observer's image plane, whereas our method uses a shadow map drawn from the light's point of view. This means that, as long as we're using only a single shadow map, our method will only provide performance speedups for parts of the observer's image that are covered by the shadow map. Another way of saying this is that the shadow-map method implicitly involves some sort of directional light source, like a spotlight, because one has to choose a field-of-view for the shadow camera. In contrast, the hierarchical shadow volume approach doesn't involve this shadow camera and naturally handles omnidirectional lights. The tradeoff, however, is that their method requires numerous changes to current hardware, whereas our method leverages existing early culling hardware.

Brandon Lloyd and his colleagues at UNC have written a related paper titled "CC Shadow Volumes" and published it in the same forum as our paper: the 2004 Eurographics Symposium on Rendering. Their method is also aimed at minimizing the fillrate consumed by shadow volumes. The main difference between their approach and ours is that they reduce the number and size of the shadow-volume polygons (by culling and clamping), whereas we send all the polygons (at full size) and rely on the early culling hardware.

It is interesting to consider both approaches because they incur different types of overhead. In our method, rasterization would ideally be performed only at the shadow silhouette pixels, but in reality there is the overhead of considering and then rejecting the non-silhouette pixels. Another way of thinking about this is that (a) rasterizing a quad that covers half of the viewport is faster than (b) drawing a full-screen quad and using early culling to throw away half of the pixels. Of course, their is also some overhead in the CC shadow-volume method because they have to clamp the shadow-volume polygons. The take-home message here is that the two methods are complementary: In a highly-optimized implementation of shadow volumes, it makes sense to spend some additional work on the CPU to perform coarse-grained culling of shadow-volume polygons and then to rely on the GPU for fine-grained culling. Maximizing rendering performance by balancing overhead vs. culling benefits is a difficult (but interesting!) problem.

Mark Harris gave a presentation on general-purpose GPU programming at Game Developer's Conference in March 2004. In the presentation he talks about computation masks (though he doesn't use that term) and describes one way to implement by exploiting the standard depth test. As mentioned above, we couldn't implement our mask this way because we need to preserve the depth values of silhouette pixels.

Tim Purcell et al. describe computation masks in the context of a GPU-based photon-mapping algorithm. Tim discusses the issue further in his PhD thesis.

Copyright Notice

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than the publisher must be honoured and acknowledged.

Back to paper index