#### SCORPIO: 36-Core Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect

#### **Chia-Hsin Owen Chen**

Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett Wilkerson, John Arends, Anantha Chandrakasan, Li-Shiuan Peh

Contributions:

Core integration (Bhavya and Owen), Cache coherence protocol design (Bhavya and Woo Cheol) L2 cache controller implementation (Bhavya) Memory interface controller implementation (Owen) High-level idea of notification network (Woo-Cheol) Network architecture (Woo-Cheol, Bhavya, Owen, Tushar, Suvinay) Network implementation (Suvinay)



DDR2 and PHY integration (Sunghyun and Owen) Backend of entire chip (Owen) FPGA interfaces, on-chip testers and scan chains (Tushar) RTL functional simulations (Bhavya, Owen, Suvinay) Full-system GEMS simulations (Woo-Cheol) Board Design (Sunghyun) Software Stack (Bhavya and Owen) Package Design (Freescale)



IBM 45nm SOI, 143mm<sup>2</sup> 600M transistors



IBM 45nm SOI, 143mm<sup>2</sup> 600M transistors

#### 36 cores with total 4.5MB L2



IBM 45nm SOI, 143mm<sup>2</sup> 600M transistors

#### 36 cores with total 4.5MB L2

6×6 mesh on-chip network supporting snoopy coherence



IBM 45nm SOI, 143mm<sup>2</sup> 600M transistors

#### 36 cores with total 4.5MB L2

6×6 mesh on-chip network supporting snoopy coherence

Dual channel DDR2 memory controller



#### **Tile Architecture**



#### Core

- Freescale e200 z760n3
- In-order
- Dual-issue

#### **Private L1 cache**

- Split 16KB for Inst / Data
- 4-way set associative

#### **Private L2 cache**

- 128KB
- 4-way set associative
- Inclusive

### **Tile Architecture**



### **Snoopy Coherence**







#### 6×6 mesh interconnect

- 137b wide data-path
- One network node / tile









**Problem:** Broadcast Messages delivered to different nodes in different orders on unordered networks



Problem: Broadcast Messages delivered to different nodes in different orders on unordered networks
 We want: Every node to see all messages in the same global order



Problem: Broadcast Mossages delivered to different podes in Solution: Decouple message delivery from ordering We want: Every node to see all messages in the same global order





#### **Notification Network**



#### Bounded latency ( ≤ 12 cycle )

- Non-blocking
- 1 cycle / hop broadcast mesh
- Dedicated 1 bit / tile

#### **Notification Network**



#### Bounded latency ( ≤ 12 cycle )

- Non-blocking
- 1 cycle / hop broadcast mesh
- Dedicated 1 bit / tile

#### Low cost

• Only DFF + ORs

#### **Notification Network**



























### **Synchronization Primitives**



#### lwarx, stwcx

- Link in L2 cacheline granularity
- Detect modifications after load-link using coherence protocol



#### msync

- Broadcast sync requests
- Gather acks from all cores when they complete the sync request

### **Evaluation Setup**

| Simulator    | GEMS + GARNET                                |  |  |
|--------------|----------------------------------------------|--|--|
| Access times | L1 – 1 cycle; L2 – 10 cycles; DRAM 90 cycles |  |  |
| LPD          | Limited Pointer Directory Coherence          |  |  |
| HT           | AMD HyperTransport Coherence                 |  |  |
| SCORPIO      | Snoopy Coherence: MOSI                       |  |  |

### **Evaluation Setup**

| Simulator    | GEMS + GARNET                                |  |  |
|--------------|----------------------------------------------|--|--|
| Access times | L1 – 1 cycle; L2 – 10 cycles; DRAM 90 cycles |  |  |
| LPD          | Limited Pointer Directory Coherence          |  |  |
| HT           | AMD HyperTransport Coherence                 |  |  |
| SCORPIO      | Snoopy Coherence: MOSI                       |  |  |

|                      | LPD         | нт                   | SCORPIO              | Isolate                |
|----------------------|-------------|----------------------|----------------------|------------------------|
| What is tracked?     | Few sharers | Presence of<br>owner | Presence of<br>owner | Storage<br>overhead    |
| Who orders requests? | Directory   | Directory            | Network              | Indirection<br>latency |

#### **Runtime Comparison**



LPD HT SCORPIO

→ 24% better than Limited Pointer Directory

→ 13% better than Hyper-Transport

#### **Requests served by other caches**



🔳 Network: Req to Dir 🔳 Dir Access 🔳 Network: Dir to Sharer 🔳 Network: Bcast Req 🔳 Req Ordering 📕 Sharer Access 🔳 Network: Resp

#### **Requests served by other caches**

120 100 80 Soldes Cycles 40 20 0 SCORPIO SCORPIO LPD 노 SCORPIO LPD HT SCORPIO 노 SCORPIO LPD SCORPIO 노 SCORPIO LPD 노 LPD 보 LPD 보 LPD barnes fft lu blackscholes fluidanimate average canneal

Network: Req to Dir 📕 Dir Access 📓 Network: Dir to Sharer 🔳 Network: Bcast Req 📕 Req Ordering 📕 Sharer Access 🔳 Network: Resp

# → 19% lower than LPD → 18% lower than HT

#### **Requests served by other caches**

■ Network: Reg to Dir ■ Dir Access ■ Network: Dir to Sharer ■ Network: Bcast Reg ■ Reg Ordering ■ Sharer Access ■ Network: Resp 120 100 80 Cycles 0900 40 20 0 SCORPIO SCORPIO LPD SCORPIO SCORPIO SCORPIO SCORPIO SCORPIO LPD 노 LPD 노 보 LPD 노 LPD 보 LPD 보 LPD 노 fft lu blackscholes fluidanimate barnes canneal average

→ 19% lower than LPD
→ 18% lower than HT

#### **Requests served by directory -- MC**



Owen Chen / MIT

#### **Requests served by other caches**

■ Network: Reg to Dir ■ Dir Access ■ Network: Dir to Sharer ■ Network: Bcast Reg ■ Reg Ordering ■ Sharer Access ■ Network: Resp 120 100 80 Cycles 0900 40 20 ٥ SCORPIO SCORPIO LPD SCORPIO SCORPIO SCORPIO SCORPIO SCORPIO LPD 노 LPD 노 보 LPD 노 LPD 보 LPD 보 LPD 노 fft lu blackscholes fluidanimate barnes canneal average

→ 19% lower than LPD
→ 18% lower than HT

#### **Requests served by directory -- MC**



# → 7.5% lower than LPD → 4.2% higher than HT

Owen Chen / MIT

#### **Requests served by other caches**

■ Network: Reg to Dir ■ Dir Access ■ Network: Dir to Sharer ■ Network: Bcast Reg ■ Reg Ordering ■ Sharer Access ■ Network: Resp 120 100 80 Cycles 0900 40 20 ٥ SCORPIO 노 SCORPIO LPD HT SCORPIO SCORPIO SCORPIO SCORPIO SCORPIO LPD 노 LPD LPD 노 LPD 보 LPD 보 LPD 보 fft lu blackscholes fluidanimate barnes canneal average

→ 19% lower than LPD
→ 18% lower than HT

#### 90% requests served by other caches

#### **Requests served by directory -- MC**



# → 7.5% lower than LPD → 4.2% higher than HT

Owen Chen / MIT

#### **Requests served by other caches**

🔳 Network: Req to Dir 📕 Dir Access 📕 Network: Dir to Sharer 🔳 Network: Bcast Req 📕 Req Ordering 📕 Sharer Access 📕 Network: Resp



#### **Network Cost**



#### **Network Cost**



### Contributions

- SCORPIO: A 36-core shared-memory processor Snoopy coherency on a mesh interconnect:
  - Runtime: 24% better than LPD, 13% better than HT
  - Cost: 28.8W @ 833MHz
- Novel network-on-chip for scalable snoopy coherence New ideas:
  - Distributed in-network ordering mechanism
  - Decouple message delivery from message ordering

### **Ongoing Work**



#### Software stack development

- Boot Linux
- Run PARSEC, SPLASH, ..., etc

#### **Chip** measurement

- Power, timing
- Performance





