Fiscal Year 2018



Course number: CSC.T433 School of Computing, Graduate major in Computer Science

# Advanced Computer Architecture

3. Memory Hierarchy Design

www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W936 Mon 13:20-14:50, Thr 13:20-14:50

Kenji Kise, Department of Computer Science kise \_at\_ c.titech.ac.jp

#### Datapath of processor supporting ADD and ADDI IR[25:21] IR[20:16] format 16 bit immediate rt rs op addi \$t1, \$t0, 3 [addi \$9, \$8, 3] 0x804 PCSrc Add RegWrite Instruction [25:21] Read Read register 1 PC Read address Instruction [20:16] data 1 Read ALUSIC eq Zero register 2 Instruction 0 ALU ALU [31:0] Read 0 Write result data 2 Instruction Μ Instruction [15:11] register memory Write Registers data RegDst 16 32 Instruction [15:0] **\$8 = 7** Sign extend MemRead Instruction [5:0]

### Machine Language - Load Instruction

• Load/Store Instruction Format (I format):





#### Datapath of processor supporting ADD, ADDI, LW IR[25:21] IR[20:16] format 16 bit immediate rt rs op lw \$t2, 4(\$t0) 0x808 [ lw \$10, 4(\$8) ] PCSrc Add RegWrite Instruction [25:21] Read Read register 1 PC Read address Instruction [20:16] data 1 Read Memto Reg ALUSIC Zero 🛏 register 2 Instruction 0 ALU ALU [31:0] Read Read Address 0 Write data result data 2 Instruction Μ Instruction [15:11] register u memory х Write 0 Registers data Data RegDst Write memory data

16

Instruction [5:0]

32

Sign extend

#### CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

Instruction [15:0]

 $\$8 = 0 \times 10$ 

mem[0x14] = 3

4

MemRead

# A Typical Memory Hierarchy

By taking advantage of the principle of locality

- Present much memory in the cheapest technology
- at the speed of fastest technology



TLB: Translation Lookaside Buffer

# MIPS Direct Mapped Cache Example

One word/block, cache size = 1K words (4KB)



CSC. T433 Advanced Com

What kind of locality are we taking advantage of?

### Multiword Block Direct Mapped Cache

• Four words/block, cache size = 1K words (4KB)



### What kind of locality are we taking advantage of?

### Four-Way Set Associative Cache



# Cache Associativity & Replacement Policy



Bookshelf





### Costs of Set Associative Caches

- N-way set associative cache costs
  - N comparators (delay and area)
  - MUX delay (set selection) before data is available
  - Data available after set selection and Hit/Miss decision.
- When a miss occurs, which way's block do we pick for replacement ?
  - Least Recently Used (LRU): the block replaced is the one that has been unused for the longest time
    - Must have hardware to keep track of when each way's block was used
    - For 2-way set associative, takes one bit per set → set the bit when a block is referenced (and reset the other way's bit)
  - Random

# Recommended Reading

- Emulating Optimal Replacement with a Shepherd Cache
  - Kaushik Rajan, Govindarajan Ramaswamy, Indian Institute of Science
  - MICRO-40, pp. 445-454, 2007
  - Session 8: Cache Replacement Policies
- A quote:

"The inherent temporal locality in memory accesses is filtered out by the L1 cache. As a consequence, an L2 cache with LRU replacement incurs significantly higher misses than the optimal replacement policy (OPT). We propose to narrow this gap through a novel replacement strategy that mimics the replacement decisions of OPT."



# Memory Hierarchy Design

### Memory Hierarchy



### L2 and lower caches

- Objective : Need to reduce expensive memory accesses
- Design : Large size, Higher associativity, Complex design
- Problem : Do not interact with program directly and observe filtered temporal locality
- High Associativity  $\implies$  replacement policy crucial to performance
- L1 cache services temporal accesses accesses at L2 LRU replacement inefficient
- Replacement decisions are taken off the processor critical path

### LRU has room for improvement

#### LRU vs OPT



Emulating Optimal Replacement with a Shepherd Cache, MICRO-2007

## **OPT: Optimal Replacement Policy**

### The Optimal Replacement Policy

- Replacement Candidates : On a miss any replacement policy could either choose to replace any of the lines in the cache or choose not to place the miss causing line in the cache at all.
- Self Replacement : The latter choice is referred to as a self-replacement or a cache bypass

#### Optimal Replacement Policy

On a miss replace the candidate to which an access is least imminent [Belady1966,Mattson1970,McFarling-thesis]

Lookahead Window : Window of accesses between miss causing access and the access to the least imminent replacement candidate. Single pass simulation of OPT make use of lookahead windows to identify replacement candidates and modify current cache state [Sugumar-SIGMETRICS1993]



# Example of Optimal Replacement Policy

### Understanding OPT



- Consider 4 way associative cache with one set initially containing lines (A1,A2,A3,A4), consider the access stream shown in table
- Access A<sub>5</sub> misses, replacement decision proceeds as follows
  - Identify replacement candidates : (A1,A2,A3,A4,A5)
  - Lookahead and gather imminence order : shown in table, lookahead window circled
  - Make replacement decision : A<sub>5</sub> replaces A<sub>2</sub>
- A6 self-replaces, lookahead window and imminence order in table

### Shepherd Cache emulation OPT

### Emulating OPT with a Shepherd Cache



- Split the cache into two logical parts
  - Main Cache (MC) for which optimal replacement is emulated
  - Shepherd Cache (SC) used to provide a lookahead and guide replacements from MC towards OPT

#### Operation

- Buffer lines temporarily in SC before moving them to MC, SC acts as a FIFO buffer
- While in SC, gather imminence information and emulate lookahead
- When forced out of SC, make an MC replacement based on the gathered imminence order



### Shepherd Cache Overview

### Overview of Shepherd Caching



- To emulate MC with 4 ways per set and 2 SC ways per set
- To gather imminence order add a counter matrix (CM)
- CM has one column per SC way to track imminence order w.r.t to it
- CM has one row per SC and MC line as any of them can be a replacement candidate
- Each column has one Next Value Counter (NVC) to track the next value to assign along column



Emulating Optimal Replacement with a Shepherd Cache, MICRO-2007

### Shepherd cache bridges 32 - 52% of the gap

#### Bridging the performance gap



Emulating Optimal Replacement with a Shepherd Cache, MICRO-2007

### Homework 3

- Design a single-cycle processor supporting MIPS add, addi, lw and sw instructions in Verilog HDL. Please download proc03.v from the support page and refer it.
- 2. Verify the behavior of designed processor using following assembly code
  - add \$0, \$0, \$0 # NOP {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}
  - addi \$t0, \$zero, 8 # {6'h8, 5'd0, 5'd8, 16'd8}
  - sw \$t0, 4(\$t0) #
  - lw \$t1, 4(\$t0) #
  - addi \$t2, \$t1, 6 #
- 3. Submit a report printed on A4 paper at the beginning of the next lecture.
  - The report should include a block diagram, a source code in Verilog HDL, and obtained waveforms of your design.

### Waveform of proc02

| Signals      | Waves             |          |              |                    |          |
|--------------|-------------------|----------|--------------|--------------------|----------|
| Time         | 0 100 ns 200      | ns 300   | ns 400       | ns 500             | ns       |
| CLK          |                   |          |              |                    |          |
| RST_X        |                   |          |              |                    |          |
| pc[31:0]     | 0000000           | 00000004 | 0000008      | 00000000           | 00000010 |
| ir[31:0]     | xx+ 00000020      | 20080003 | 20090005     | 01095020           | xxxxxxxx |
| op[5:0]      | xx 00             | 08       |              | 00                 | xx       |
| rs[4:0]      | xx 00             |          |              | 08                 | xx       |
| rt[4:0]      | xx 00             | 08       | 09           |                    | xx       |
| rd[4:0]      | xx 00             |          |              | )(OA               | xx       |
| imm[31:0]    | xxxx+ 00000020    | 0000003  | 00000005     | 00005020           | xxxxxxxx |
| rrs[31:0]    | xxxxxxx 00000000  |          |              | 0000003            | xxxxx    |
| rrt[31:0]    | xxxxxxx 00000000  | xxxxxxx  | 00+ xxxxxxxx | 00000005           | xxxxx    |
| RRT[31:0]    | xxxxxxxx 00000000 | 0000003  | 00000005     |                    | xxxxxx   |
| rdst[4:0]    | xx 00             | )08      | 09           | ) <mark>0</mark> A | xx       |
| result[31:0] | xxxxxxxx 00000000 | 0000000  | 000000       | 000000             | 8 xxx    |
|              |                   |          |              |                    |          |
|              |                   |          |              |                    |          |

add \$0, \$0, \$0 # NOP {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20} addi \$t0, \$zero, 3 # {6'h8, 5'd0, 5'd8, 16'd3} addi \$t1, \$zero, 5 # {6'h8, 5'd0, 5'd9, 16'd5} add \$t2, \$t0, \$t1 # {6'h0, 5'd8, 5'd9, 5'd10, 5'd0, 6'h20}