Efficient Partitioning of Fragment Shaders for Multipass
Rendering on Programmable Graphics Hardware
Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot, and Pat Hanrahan
Computer Science Department
Stanford University
Proceedings of the SIGGRAPH / Eurographics Workshop on Graphics
Hardware 2002
Abstract
Real-time programmable graphics hardware has resource constraints that
prevent complex shaders from rendering in a single pass. One way to
virtualize these resources is to partition shading computations into
multiple passes, each of which satisfies the given constraints. Many
such partitions exist for a shader, but it is important to find one
that renders efficiently. We present Recursive Dominator Split (RDS),
a polynomial-time algorithm that uses a cost model to find
near-optimal partitions of arbitrarily complex shaders. Using a
simulator, we analyze partitions for architectures with different
resource constraints and show that RDS performs well on different
graphics architectures. We also demonstrate that shader partitions
computed by RDS can run efficiently on programmable graphics hardware
available today.
|
|
|
Files
Related Work
A variant of RDS is used in ATI's ASHLI
rendering system.
RDS assumes that the fragment program provides only one vector output
per pass. This was true of graphics hardware until the ATI Radeon
9700, which supports 4 vector outputs per pass, emerged in the summer
of 2002. There are two cases where having multiple outputs per pass
is significant. The first case is the obvious one where the shader
itself computes multiple values, and here the original RDS offers no
solution. The second case is more subtle: the shader computes only
one value, but because of multiple subgraphs of the original
computation that are referenced multiple times, it is possible to
optimize the non-final rendering passes to take advantage of
multiple-output hardware.
Other researchers have improved on RDS to support fragment-shading
hardware with multiple outputs:
-
Tim Foley, Mike
Houston, and Pat Hanrahan developed MRDS, an extension to RDS.
Their paper appears in Graphics Hardware 2004. To support shaders
that compute multiple output values, MRDS creates a single imaginary
root whose children are all the shader output nodes. Next, the
algorithm creates a dominator tree and applies the original RDS
strategy: RDS marks a set of nodes as splits and renders one pass
per split.
To support hardware with multiple outputs, MRDS first proceeds as
above, then attempts to merge intermediate passes. The idea is that
some of the single-output subpasses can be placed together into a
single pass with multiple outputs. This is accomplished by
considering all possible pairs of merges and assigning each pair a
score (using an extended cost metric). Finally, merges proceed in
order of decreasing score and per-pass validity is enforced as usual.
MRDS, like RDS, is a O(n^3) algorithm. The authors also propose
another version of MRDS (called MRDS') that combines greedy pass
merging with save/recompute searching. The idea is to perform local
optimizations as part of the save/recompute decision. But this makes
the algorithm O(n^4) and in practice the authors did not find this
benefit to be worthwhile.
As expected, MRDS outperforms RDS on MRT hardware, measured both
in metric cost and execution time.
- Andrew
Riffel, Aaron Lefohn, Kiril Vidimce, Mark Leone, and John D. Owens
developed a partitioning algorithm called Mio. Their algorithm
was also published in Graphics Hardware 2004. Mio is designed to
handle hardware with multiple render targets (MRT). Unlike RDS and
MRDS, Mio is also designed to offer superior compile-time performance,
with an expected running time of O(n log n) versus O(n^2) for RDSh,
O(n^3) for MRDS and RDS, and O(n^4) for MRDS'.
Mio takes a different approach to the multipass partitioning problem
(MPP). The authors formulate MPP as a list scheduling problem, which
is often studied in the compiler community. Each job represents an
operation that needs to be scheduled. Mio uses two O(n) DAG walks to
determine a traversal order. These prepasses are designed to find an
order that minimizes register pressure. Essentially, Mio performs a
reverse topological sort using a heuristic to decide which schedule
minimizes the number of live variables. This helps to avoid hitting
the register resource limit when scheduling.
The speed increase of Mio comes partly from the fact that Mio performs
incremental resource estimation, instead of repeatedly invoking a
linear-time code generator to optimize the packing of operations into
passes.
Note that Mio is designed (1) to be efficient, and (2) to
minimize the number of operations overall. In contrast, RDS tries
to minimize the number of passes (with the hope of minimizing the
overall cost). Note, for example, that Mio's partitions usually
contain more texture operations than those generated by RDS or RDSh.
-
Alan
Heirich has recently written a paper for Graphics Hardware 2005 that
partitions shaders using dynamic programming. His algorithm,
called DPMPP,
provides an optimal solution to MPP, as opposed to the
locally but not globally optimal solutions found by RDS, RDSh, MRDS,
and Mio.
DPMPP works by starting from the desired end state and searching
backwards for an initial state whose path to the end state has the
minimum cost. Intermediate steps along the path between the initial
state and the end state represent actual machine states and
operations, such as a memory fetch or the addition of two register
contents. In each step of the algorithm, DPMPP considers valid state
transitions that extend the current search path, but keeps only the
minimum-cost paths.
Note that Heirich's algorithm finds the optimal solution for
expression trees, but not necessarily for DAGs whose nodes may have
multiple parents. For his paper, Heirich tested his algorithm using
a compiler that did combine common subexpressions, so his shaders were
expressed as trees. Unfortunately, optimal code generation for DAGs
is NP-complete.
Copyright Notice
Permission to make digital or hard copies of part or all of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or
initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than the publisher must be
honoured and acknowledged.
Back to paper index
|