Single-Cycle Collective Communication Over A Shared Network Fabric
Tushar Krishna and Li-Shiuan Peh
In the multicore era, on-chip network latency and throughput have a direct impact on system performance.
A highly important class of communication flows traversing the network is collective,
i.e., one-to-many and many-to-one. Scalable coherence protocols often
leverage imprecise tracking to lower the overhead of directory storage,
in turn leading to more collective communications on-chip.
Routers with support for message forking/aggregation have been previously demonstrated,
supporting such protocols.
However, even with the fastest possible designs today (1-cycle routers),
collective flows on a kxk mesh still incur delays proportional to
k
since all communication is across the entire chip. As k increases across technology generations,
the latency of these flows will also go up.
However, the pure wire delay to cross the chip is just 1-2 cycles today, and is expected to remain roughly invariant.
The dependence of message delays on k arises due to the requirement to latch messages at every router.
In this work, we remove this requirement.
We design a network fabric that enables messages to (1) dynamically create virtual 1-to-Many (multicast) and
Many-to-1 (reduction) tree routes over a physical mesh,
(2) get forked/aggregated at nodes on the tree, and
(3) traverse the tree -
all within a single-cycle across each dimension.
For synthetic 1-to-Many/Many-to-1 flows, we demonstrate 76/82% reduction in latency, and 1.6/2X improvement in throughput
over a state-of-the-art NoC with 1-cycle routers and support for collective communication.
Across a suite of SPLASH-2 and PARSEC benchmarks, full-system runtime and energy is reduced by 14% and 50%
for a limited-directory protocol.