SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering
Bhavya K. Daya, Chia-Hsin Owen Chen, Suvinay Subramanian, Woo-Cheol Kwon, Sunghyun Park, Tushar Krishna, Jim Holt, Anantha P. Chandrakasan, and Li-Shiuan Peh
In the many-core era, scalable coherence and on-chip interconnects
are crucial for shared memory processors. While
snoopy coherence is common in small multicore systems,
directory-based coherence is the de facto choice for scalability
to many cores, as snoopy relies on ordered interconnects
which do not scale. However, directory-based coherence does
not scale beyond tens of cores due to excessive directory area
overhead or inaccurate sharer tracking. Prior techniques
supporting ordering on arbitrary unordered networks are impractical
for full multicore chip designs.
We present SCORPIO, an ordered mesh Network-on-Chip
(NoC) architecture with a separate fixed-latency, bufferless network
to achieve distributed global ordering. Message delivery
is decoupled from the ordering, allowing messages to arrive
in any order and at any time, and still be correctly ordered.
The architecture is designed to plug-and-play with existing
multicore IP and with practicality, timing, area, and power
as top concerns. Full-system 36 and 64-core simulations on
SPLASH-2 and PARSEC benchmarks show an average application
runtime reduction of 24.1% and 12.9%, in comparison
to distributed directory and AMD HyperTransport coherence
protocols, respectively.
The SCORPIO architecture is incorporated in an 11 mm-by-
13mm chip prototype, fabricated in IBM 45nm SOI technology,
comprising 36 Freescale e200 Power ArchitectureTMcores with
private L1 and L2 caches interfacing with the NoC via ARM
AMBA, along with two Cadence on-chip DDR2 controllers.
The chip prototype achieves a post synthesis operating frequency
of 1 GHz (833MHz post-layout) with an estimated
power of 28.8W (768mW per tile), while the network consumes
only 10% of tile area and 19 % of tile power.