Efficient Partitioning of Fragment Shaders for Multipass Rendering on Programmable Graphics Hardware

Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot, and Pat Hanrahan

Computer Science Department
Stanford University

Proceedings of the SIGGRAPH / Eurographics Workshop on Graphics Hardware 2002

Abstract

Real-time programmable graphics hardware has resource constraints that prevent complex shaders from rendering in a single pass. One way to virtualize these resources is to partition shading computations into multiple passes, each of which satisfies the given constraints. Many such partitions exist for a shader, but it is important to find one that renders efficiently. We present Recursive Dominator Split (RDS), a polynomial-time algorithm that uses a cost model to find near-optimal partitions of arbitrarily complex shaders. Using a simulator, we analyze partitions for architectures with different resource constraints and show that RDS performs well on different graphics architectures. We also demonstrate that shader partitions computed by RDS can run efficiently on programmable graphics hardware available today.  

Files

paper   pdf  (263 KB)
video   mov  (49.5 MB)
video   avi  (23.5 MB)
bibtex   bib
slides   html

Related Work

A variant of RDS is used in ATI's ASHLI rendering system.

RDS assumes that the fragment program provides only one vector output per pass. This was true of graphics hardware until the ATI Radeon 9700, which supports 4 vector outputs per pass, emerged in the summer of 2002. There are two cases where having multiple outputs per pass is significant. The first case is the obvious one where the shader itself computes multiple values, and here the original RDS offers no solution. The second case is more subtle: the shader computes only one value, but because of multiple subgraphs of the original computation that are referenced multiple times, it is possible to optimize the non-final rendering passes to take advantage of multiple-output hardware.

Other researchers have improved on RDS to support fragment-shading hardware with multiple outputs:

  • Tim Foley, Mike Houston, and Pat Hanrahan developed MRDS, an extension to RDS. Their paper appears in Graphics Hardware 2004. To support shaders that compute multiple output values, MRDS creates a single imaginary root whose children are all the shader output nodes. Next, the algorithm creates a dominator tree and applies the original RDS strategy: RDS marks a set of nodes as splits and renders one pass per split.

    To support hardware with multiple outputs, MRDS first proceeds as above, then attempts to merge intermediate passes. The idea is that some of the single-output subpasses can be placed together into a single pass with multiple outputs. This is accomplished by considering all possible pairs of merges and assigning each pair a score (using an extended cost metric). Finally, merges proceed in order of decreasing score and per-pass validity is enforced as usual.

    MRDS, like RDS, is a O(n^3) algorithm. The authors also propose another version of MRDS (called MRDS') that combines greedy pass merging with save/recompute searching. The idea is to perform local optimizations as part of the save/recompute decision. But this makes the algorithm O(n^4) and in practice the authors did not find this benefit to be worthwhile.

    As expected, MRDS outperforms RDS on MRT hardware, measured both in metric cost and execution time.

  • Andrew Riffel, Aaron Lefohn, Kiril Vidimce, Mark Leone, and John D. Owens developed a partitioning algorithm called Mio. Their algorithm was also published in Graphics Hardware 2004. Mio is designed to handle hardware with multiple render targets (MRT). Unlike RDS and MRDS, Mio is also designed to offer superior compile-time performance, with an expected running time of O(n log n) versus O(n^2) for RDSh, O(n^3) for MRDS and RDS, and O(n^4) for MRDS'.

    Mio takes a different approach to the multipass partitioning problem (MPP). The authors formulate MPP as a list scheduling problem, which is often studied in the compiler community. Each job represents an operation that needs to be scheduled. Mio uses two O(n) DAG walks to determine a traversal order. These prepasses are designed to find an order that minimizes register pressure. Essentially, Mio performs a reverse topological sort using a heuristic to decide which schedule minimizes the number of live variables. This helps to avoid hitting the register resource limit when scheduling.

    The speed increase of Mio comes partly from the fact that Mio performs incremental resource estimation, instead of repeatedly invoking a linear-time code generator to optimize the packing of operations into passes.

    Note that Mio is designed (1) to be efficient, and (2) to minimize the number of operations overall. In contrast, RDS tries to minimize the number of passes (with the hope of minimizing the overall cost). Note, for example, that Mio's partitions usually contain more texture operations than those generated by RDS or RDSh.

  • Alan Heirich has recently written a paper for Graphics Hardware 2005 that partitions shaders using dynamic programming. His algorithm, called DPMPP, provides an optimal solution to MPP, as opposed to the locally but not globally optimal solutions found by RDS, RDSh, MRDS, and Mio.

    DPMPP works by starting from the desired end state and searching backwards for an initial state whose path to the end state has the minimum cost. Intermediate steps along the path between the initial state and the end state represent actual machine states and operations, such as a memory fetch or the addition of two register contents. In each step of the algorithm, DPMPP considers valid state transitions that extend the current search path, but keeps only the minimum-cost paths.

    Note that Heirich's algorithm finds the optimal solution for expression trees, but not necessarily for DAGs whose nodes may have multiple parents. For his paper, Heirich tested his algorithm using a compiler that did combine common subexpressions, so his shaders were expressed as trees. Unfortunately, optimal code generation for DAGs is NP-complete.

Copyright Notice

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than the publisher must be honoured and acknowledged.

Back to paper index