Breaking the On-Chip Latency Barrier Using SMART
Tushar Krishna, Chia-Hsin Owen Chen, Woo Cheol Kwon and Li-Shiuan Peh
As the number of on-chip cores increases, scalable on-chip
topologies such as meshes inevitably add multiple hops in
each network traversal. The best we can do right now is to design 1-cycle routers, such that the low-load network latency
between a source and destination is equal to the number of
routers + links (i.e. hopsx2) between them. OS/compiler
and cache coherence protocols designers often try to limit
communication to within a few hops, since on-chip latency
is critical for their scalability. In this work, we propose
an on-chip network called SMART (Single-cycle Multi-hop
Asynchronous Repeated Traversal) that aims to present a
single-cycle data-path all the way from the source to the destination. We do not add any additional fast physical express
links in the data-path; instead we drive the shared crossbars
and links asynchronously up to multiple-hops within a single
cycle. We design a router + link microarchitecture to achieve
such a traversal, and a flow-control technique to arbitrate
and setup multi-hop paths within a cycle. A place-and-routed
design at 45nm achieves 11 hops within a 1GHz cycle for
paths without turns (9 for paths with turns). We observe
5-8X reduction in low-load latencies across synthetic traffic
patterns on an 8x8 CMP, compared to a baseline 1-cycle
router. Full-system simulations with SPLASH-2 and PARSEC benchmarks demonstrate 27/52% and 20/59% reduction
in runtime and EDP for Private/Shared L2 designs.