# SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS

Daniel Sanchez

Electrical Engineering Stanford University

## **Multicore Scalability**



- Multicore is key to future of computing
- Scaling performance is hard, even with a lot of parallelism

## Memory is Critical

- Memory limits performance and energy efficiency
- Basic indicators:
  - □ 64-bit FP op: ~1ns latency, ~20pJ energy
  - □ Shared cache access: ~10ns latency, ~1nJ energy
  - DRAM access: ~100ns latency, ~20nJ energy

#### □ HW & SW must optimize memory performance

## **Multicore Memory Hierarchy**



- Per-core private caches
  - Fast access to critical working set
  - Should satisfy most accesses
- Shared last-level cache
  - Increases utilization
  - Accelerates communication
  - Can be partitioned for isolation
- Coherence protocol
  - Makes caches transparent to SW
  - Uses directory to track sharers



#### Cache hierarchy is hard to scale



#### Cache hierarchy is hard to scale

6

1. Directories scale poorly



- Cache hierarchy is hard to scale
- 1. Directories scale poorly
- 2. Conflicts in caches & directory are more frequent

#### Main Memory



Coherence Directory



#### Cache hierarchy is hard to scale

- 1. Directories scale poorly
- 2. Conflicts in caches & directory are more frequent
- 3. Shared cache cannot be partitioned efficiently



Cache hierarchy is hard to scale

- 1. Directories scale poorly
- 2. Conflicts in caches & directory are more frequent
- 3. Shared cache cannot be partitioned efficiently
- 4. No isolation or QoS due to shared cache and directory

## Scaling Parallel Runtimes

- Parallel runtime maps application to hardware
  - Resource management
  - Scheduling
- Runtime is fundamental to scale with manageable complexity



## Scheduling Parallel Applications



#### $\Box$ Application $\rightarrow$ Parallel tasks

- Different requirements
- May have dependences

## Scheduling Parallel Applications



 $\Box$  Application  $\rightarrow$  Parallel tasks

- Different requirements
- May have dependences
- Scheduler assigns tasks to cores



#### Constrained parallelism

13



#### Constrained parallelism

- Coarser tasks
- Unneeded serialization



Constrained parallelism

15

Increased cache misses



Constrained parallelism
 Increased cache misses
 Load imbalance

16



Constrained parallelism

17

- Increased cache misses
- Load imbalance
- Scheduling overheads



Constrained parallelism

18

- Increased cache misses
- Load imbalance
- Scheduling overheads
- Excessive memory footprint (crash!)



Constrained parallelism

19

- Increased cache misses
- Load imbalance
- Scheduling overheads
- Excessive memory footprint (crash!)

## Contributions



- Scalable cache hierarchies:
  - Efficient highly-associative caches [MICRO 10]
  - Scalable cache partitioning [ISCA 11, Top Picks 12]
  - Scalable coherence directories [HPCA 12]

- Scalable scheduling:
  - Efficient dynamic scheduling by leveraging programming model information [PACT 11]
  - Hardware-accelerated scheduling [ASPLOS 10]

### This Talk

#### Scalable cache hierarchies:

- Efficient highly-associative caches [MICRO 10]
- Scalable cache partitioning [ISCA 11, Top Picks 12]

Scalable coherence directories [HPCA 12]

- Scalable scheduling:
  - Efficient dynamic scheduling by leveraging programming model information [PACT 11]
  - Hardware-accelerated scheduling [ASPLOS 10]

## Rethinking Common-Case Design

Conventional approach: Make the common case fast
 Based on patterns of past and current workloads
 Overprovision to mitigate worst case or for future workloads

- Multicore demands going beyond the common case

  - Overprovisioning alone is insufficient and wasteful
    - Some overprovisioning simplifies design
    - Must provide guarantees with minimal overprovisioning

## Solution: Analytical Design Approach

- Design basic components that are easily analyzable
  - Simple, accurate, workload-independent analytical models

- Easy to understand, reason about behavior
- Use models to design systems that work well in all cases
  Scalability and QoS guaranteed in all scenarios
  Outperform conventional techniques in the common case
- Need to revisit fundamental aspects of our systems (associativity, coherence, ...)

### **Set-Associative Caches**

#### Basic building block of caches, directories



24

#### Problems:

Reducing conflicts (higher associativity) > more ways

Higher energy, latency, area

Conflicts depend on workload's access patterns

### ZCache

25

#### One hash function per way



- $\Box$  Hits require a single lookup  $\rightarrow$  low hit energy and latency
- Misses exploit the multiple hash functions to obtain an arbitrarily large number of replacement candidates
  - Multi-step process, draws on prior research on Cuckoo hashing
  - Happens infrequently (on misses) and off the critical path







| Way 1 | Way 2 | Way 3 |   |
|-------|-------|-------|---|
| U     | V     | Μ     | 0 |
| F     | С     | X     | 1 |
| Р     | K     | н     | 2 |
| В     | E     | R     | 3 |
| Ν     | D     | J     | 4 |
| А     | Z     | Q     | 5 |
| G     | Т     | I     | 6 |
| L     | 0     | S     | 7 |



□ Instead of evicting A, can move it and evict K or X
 □ Similarly, can move K or X → more candidates









| U | V | М |
|---|---|---|
| F | С | A |
| Р | К | Н |
| В | E | R |
| Х | D | J |
| Y | Z | Q |
| G | Т | 1 |
| L | 0 | S |

Hits always take a single lookup



32

Replacements do not affect hit latency, are simple to implement

## Methodology

zsim: A fast, 1000-core, microarchitectural x86 simulator
 Fast: Parallel, leverages dynamic binary translation (Pin)
 15-60 Minstrs/s per host core, 600 Minstrs/s on 12-core Xeon
 Scalable: Phase-based sync, simulates thousands of cores
 Validated: Within 10% of Atom and Nehalem systems
 Simple: ~20 KLoC, used in research and courses at Stanford

33

Integrate zsim with existing area, energy, and latency models (McPAT, CACTI)

## **ZCache Benefits**

8MB shared LLC optimized for area · latency · energy, 32nm:

34



ZCache = Scalable associativity at low cost

- Cost of 4-way cache
- Associativity > 64-way cache

## ZCache Associativity

- ZCache associativity depends only on the number of replacement candidates (R)
  - Independent of ways, workload, and replacement policy
- Problems in defining associativity: Cache array + replacement policy
- Insight 1: With ZCache, replacement candidates are very close to uniformly distributed over the array
- Insight 2: All policies do the same thing, rank cache lines
  Eviction priority: Rank of a line normalized to [0,1]
  Example: With LRU policy, LRU line has 1.0 priority, MRU has 0.0

## ZCache Associativity

- Associativity: Probability distribution of eviction priorities of evicted lines
- ZCache associativity depends only on the number of replacement candidates (R):

$$F_A(x) = \Pr(A \le x) = x^R, x \in [0,1]$$



#### ZCache Analytical Models

#### Analytical models are accurate in practice:



14 workloads, 1024 cores

#### **Cache Partitioning**



## **Cache Partitioning**



39

Cache partitioning techniques divide cache space explicitly

- Isolation: Virtualize cache among applications, VMs
- Efficiency: Improve performance, fairness
- Configurability: SW-controlled buffers (performance, security)

## Cache Partitioning Techniques

Strict partitioning schemes: Based on restricting line placement

- Way partitioning: Restrict insertions to specific ways
- Strict, but supports few partitions and degrades associativity

| Way 1 | l Way 2 | Way 3 | Way 4 | Way 5 | Way 6 | Way 7 | Way 8 |
|-------|---------|-------|-------|-------|-------|-------|-------|
|       |         |       |       |       |       |       |       |
|       |         |       |       |       |       |       |       |
|       |         |       |       |       |       |       |       |
|       |         |       |       |       |       |       |       |
|       |         |       |       |       |       |       |       |
|       |         |       |       |       |       |       |       |
|       |         |       |       |       |       |       |       |

40

Soft partitioning schemes: Based on tweaking the replacement policy

- PIPP: Insert and promote lines in LRU chain depending on their partition
- Simple, but approximate partitioning and degrades replacement performance



#### Cache Partitioning with Vantage

Previous partitioning techniques have major drawbacks
 Not scalable, support few partitions
 Degrade performance

4

Vantage solves deficiencies of previous techniques
 Scalable: Supports hundreds of fine-grain partitions
 Maintains high associativity and strict isolation among partitions (QoS)



 Vantage partitions most of the cache logically by modifying the replacement process
 No restrictions on line placement



## Vantage Design

Vantage partitions the managed region

- Incoming lines (misses) inserted in partition
- Each partition demotes least wanted lines to unmanaged region

43

Evict only from unmanaged region  $\rightarrow$  no interference



## **Controlling Demotions**



- Always demoting from inserting partition does not scale with number of partitions
- □ Instead, maintain sizes by matching demotion rate to miss rate

#### **Demoting with Apertures**

#### □ Aperture: Portion of candidates to demote from each partition



#### Managing Apertures

Partition apertures can be derived analytically:

$$A_i = \frac{M_i}{\sum_{k=1}^{P} M_k} \frac{\sum_{k=1}^{P} S_k}{S_i} \frac{1}{R \cdot m}$$

46

Intuition: Aperture ~ miss rate (Mi)/size (Si)

□ Apertures are also capped to A<sub>max</sub>
 □ Higher aperture ↔ lower partition associativity
 □ A<sub>max</sub> ensures high minimum associativity
 ■ e.g., A<sub>max</sub> =40% ~ R=16 associativity
 □ We just let partitions that need A<sub>i</sub> > A<sub>max</sub> grow

#### Bounds on Size and Interference

The worst-case total growth of all partitions over their target sizes is bounded and small:

47

$$\Delta = \frac{1}{A_{\max}} \frac{1}{R}$$

Intuition: A  $\Delta$ -sized partition is always stable, and multiple unstable partitions help each other demote

Independent of the number of partitions!

□ Assign an extra ∆ to unmanaged region
 □ With R=52 and A<sub>max</sub>=0.4, ∆=5% of the cache
 □ Bounded worst-case sizes & interference

## A Simple Vantage Controller

- Use negative feedback loop to derive apertures
- Use timestamps to determine lines within aperture
- Practical implementation that maintains analytical guarantees



#### Vantage Evaluation



350 mixes on a 32-core CMP with a shared LLC (32 partitions)

- Partitions sized to maximize throughput (utility-based partitioning)
- Each line shows throughput vs unpartitioned 64-way baseline
- Way-partitioning, PIPP degrade throughput for most workloads

#### Vantage Evaluation



Vantage improves throughput for most workloads using a 4-way/52-candidate Zcache

Other schemes cannot scale beyond a few cores

## **Scaling Directories**



51

#### Scaling directories is hard:

- Excessive latency, energy, area overheads, or too complex
- $\square$  Introduce invalidations  $\rightarrow$  Interference

#### Scalable Coherence Directory

#### Insights:

52

- Use ZCache  $\rightarrow$  Efficient high associativity, analytical models
  - Negligible invalidations with minimal overprovisioning (~10%)

SCD achieves scalability and performance guarantees
 Area, energy grow with log(cores), constant latency
 Simple: No modifications to coherence protocol
 At 1024 cores, SCD is 13x smaller than a sparse directory, 2x smaller, faster and simpler than a hierarchical directory

## Scalable Scheduling



#### Scheduling requirements:

- Expose enough parallelism
- Locality-aware
- Load balancing
- Low overheads
- Bounded memory footprint



- Dynamic vs static schedulers:
  - Dynamic: Poor locality, footprint not bounded if non-trivial dependences
  - Static: Great compile-time schedules, but no load-balancing, only regular apps

## Insight: Leverage Programming Model

Solution: Dynamic fine-grain scheduling techniques that leverage programming model information to satisfy requirements

- Expose all parallelism through fine-grain tasks
- Locality-aware task queuing and load-balancing
- Bounded footprint
- Make dynamic scheduling practical in rich programming models (Streamlt, GRAMPS, Delite)
- Significant improvements over state-of-the-art schedulers on existing 12-core, 24-thread Xeon SMP:
  - Up to 17x over dynamic (more parallelism, locality-aware, footprint)
  - Up to 5.3x over static (no load imbalance)
- Scheduler choice becomes more critical as we scale up!

#### Hardware-Accelerated Schedulers

Fine-grain scheduling with 100+ threads is slow in software
 Hardware schedulers (e.g., GPUs): Fast but inflexible

- Insight: Software schedulers dominated by communication
- Solution: Accelerate communication with simple hardware
  - ADM: Asynchronous, register-register messages between threads
    - Small and scalable costs (~1KB buffers per core), virtualizable
  - ADM-accelerated fine-grain schedulers:
    - Achieve speed and scalability of HW + flexibility of SW
    - At 512 threads, 6.4x faster than SW and 70% faster than HW
  - ADM can accelerate other primitives (e.g., barriers, IPC)

#### Contributions



- Scalable cache hierarchies:
  - Efficient highly-associative caches [MICRO 10]
  - Scalable cache partitioning [ISCA 11, Top Picks 12]
  - Scalable coherence directories [HPCA 12]
- Scalable scheduling:
  - Efficient dynamic scheduling by leveraging programming model information [PACT 11]
  - Hardware-accelerated scheduling [ASPLOS 10]

#### Conclusions

Scaling to 1000 cores requires HW and SW techniques:

- Scale hardware with highly efficient caches with scalable partitioning and coherence
- Scale software with dynamic, fine-grain, HW-accelerated scheduling

#### Acknowledgements

58

#### Christos

- Research group: Jacob, David, Richard, Christina, Woongki, Austen, Mike, Hari
- PPL faculty: Kunle, Bill, Mark, Pat, Mendel, John, Alex
- PPL students: George, Jeremy, ...
- Defense committee: Bill, Kunle, Nick
- Family & friends
  - Borja, Gemma, Idoia, Carlos, Felix, Dani, Manuel, Gonzalo, Adrian, Christina, George, Yiannis, Sotiria, Alexandros, Nadine, Martin, Elliot, Nick, Steph, Olivier, Leen, John, Sam, Mario, Nicole, Cristina, Kshipra, Robert, Erik, ...

# THANK YOU FOR YOUR ATTENTION QUESTIONS?