Fiscal Year 2018



Course number: CSC.T433 School of Computing, Graduate major in Computer Science

### Advanced Computer Architecture

#### 6. Instruction Level Parallelism: Instruction Fetch and Branch Prediction

www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W936 Mon 13:20-14:50, Thr 13:20-14:50

Kenji Kise, Department of Computer Science kise \_at\_ c.titech.ac.jp

#### Scalar and Superscalar processors

- Scalar processor can execute at most one single instruction per clock cycle using one ALU.
  - IPC (Executed Instructions Per Cycle) is less than 1.
- Superscalar processor can execute more than one instruction per clock cycle by executing multiple instructions using multiple pipelines.
  - IPC (Executed Instructions Per Cycle) can be more than 1.
  - using n pipelines is called n-way superscalar



(b) pipeline diagram of 2-way superscalar processor

A four stage pipelined 2-way superscalar processor supporting ADD which does not adopt data forwarding (proc10, Homework 5)



#### Waveform of Proc10

| Signals              | Waves                                   |                 |                   |                   |                         |                    |       |
|----------------------|-----------------------------------------|-----------------|-------------------|-------------------|-------------------------|--------------------|-------|
| Time                 | 0 100 ns                                | 200 ns 300      | ns 400            | ns 500            | ns 600                  | ns 76              | 00 ns |
| CLK =                |                                         |                 |                   |                   |                         |                    |       |
| RST_X =:             |                                         |                 |                   |                   |                         |                    |       |
| PC[31:0]=            | 0000000                                 | 0000008         | 00000010          | 0000018           | 0000020                 | 00000028           | 0000  |
| If_IRS[63:0] =:      | xx+ 0000002000000020                    | 004210200021082 | 0 008420200063182 | 0 xxxxxxxxxxxxxx  | x                       |                    |       |
| IfId_IRS[63:0] =:    | 000000000000000000000000000000000000000 | 0000020000020   | 0042102000210820  | 0084202000631820  | xxxxxxxxxxxxxxxx        |                    |       |
| Id_IR1[31:0] =:      | x+ 00000000                             | 0000020         | 00210820          | 00631820          | XXXXXXXX                |                    |       |
| Id_IR2[31:0] =:      | x+ 00000000                             | 0000020         | 00421020          | 00842020          | XXXXXXXX                |                    |       |
| Id_RS1[4:0] =:       | xx .00                                  |                 | 01                | <mark>/</mark> 03 | xx                      |                    |       |
| Id_RT1[4:0] =:       | xx 00                                   |                 | 01                | <mark>.</mark> 03 | xx                      |                    |       |
| Id_RD1[4:0] =:       | xx 00                                   |                 | 01                | 03                | xx                      |                    |       |
| Id_RS2[4:0] =:       | xx .00                                  |                 | 02                | 04                | xx                      |                    |       |
| Id_RT2[4:0] =:       | xx 00                                   |                 | 02                | 04                | xx                      |                    |       |
| Id_RD2[4:0] =:       | xx 00                                   |                 | 02                | 04                | xx                      |                    |       |
| Id_RRS1[31:0] =:     | xxxxx+ 00000000                         |                 | 00000016          | 0000002C          | xxxxxxxx                |                    |       |
| Id_RRT1[31:0] =:     | xxxxx+ 00000000                         |                 | 00000016          | 0000002C          | xxxxxxxx                |                    |       |
| Id_RRS2[31:0] =:     | xxxxx+ 00000000                         |                 | 00000021          | 0000037           | XXXXXXXX                |                    |       |
| Id_RRT2[31:0] =:     | xxxxx+ 00000000                         |                 | 00000021          | 0000037           | xxxxxxxx                |                    |       |
| IdEx_RD1[4:0] =:     | 00                                      |                 |                   | 01                | ) <mark>0</mark> 3      | xx                 |       |
| IdEx_RD2[4:0] =:     | 00                                      |                 |                   | 02                | )@4                     | xx                 |       |
| IdEx_RRT2[31:0] =:   | 0000000                                 |                 |                   | 0000021           | 0000037                 | xxxxxxx            |       |
| IdEx_RRT1[31:0] =: . | 0000000                                 |                 |                   | 0000016           | ) <mark>0000002C</mark> | xxxxxxx            |       |
| IdEx_RRS1[31:0] =:   | 0000000                                 |                 |                   | 00000016          | ) <mark>0000002C</mark> | xxxxxxx            |       |
| IdEx_RRS2[31:0] =: 2 | 0000000                                 |                 |                   | 0000021           | 0000037                 | xxxxxxx            |       |
| Ex_RSLT1[31:0] =:    | xxx+ 00000000                           |                 |                   | 000002C           | 0000058                 | xxxxxxx            |       |
| Ex_RSLT2[31:0] =:    | xxx+ 00000000                           |                 |                   | 00000042          | 000006E                 | xxxxxxxx           |       |
| ExWb_RD1[4:0] =:     | 00                                      |                 |                   |                   | ) <mark>01</mark>       | ) <mark>0</mark> 3 | xx    |
| ExWb_RD2[4:0] =:     | 00                                      |                 |                   |                   | ) <mark>0</mark> 2      | )@4                | xx    |
| ExWb_RSLT1[31:0] =:  | 0000000                                 |                 |                   |                   | ) <mark>0000002C</mark> | 00000058           | xxxx  |
| ExWb_RSLT2[31:0] =:  | 0000000                                 |                 |                   |                   | 00000042                | 0000006E           | xxxx  |



#### Sample program: vector add with two branches



#### Simple branch predictor: bimodal

- Program has many branch instructions. The behavior may depend on each branch. Use one counter for one branch instruction
- How to predict
  - Select one counter using PC, then it predicts 1 if the MSB of the register is one, otherwise predicts 0.
- How to update
  - Select one counter using PC, then update the counter in the same way as 2bit counter.



#### An innovation in branch predictors in 1993

- Using branch history
  - global branch history
  - local branch history
- 2-level branch predictor and Gshare
- Assume predicting the sequence 1110 1110 1110 1110 1110 ...

11101110 ? 111011101 ?

- 1110111011 ?
- 11101110111 ?
- 111011101110 ?



#### **Recommended Reading**

- Combining Branch Predictors
  - Scott McFarling, Digital Western Research Laboratory
  - WRL Technical Note TN-36, 1993

#### A quote:

"In this paper, we have presented two new methods for improving branch prediction performance. First, we showed that using the bitwise exclusive OR of the global branch history and the branch address to access predictor counters results in better performance for a given counter array size."



### Gshare (TR-DEC 1993)

- How to predict
  - Using the exclusive OR of the global branch history and PC to access PHT, then MSB of the selected counter is the prediction.
- How to update
  - Shifting BHR one bit left and update LSB by branch outcome in IF stage.
  - Update the used counter in the same way as 2BC in WB stage.



#### Bi-Mode (MICRO 1997)

- A choice predictor (bimodal) is used as a meta-predictor
- How to predict
  - Like Gshare, both of Taken PHT and Untaken PHT make two predictions.
  - Select one among them by the choice predictor which tracks the global bias of a branch.
- How to update
  - The used PHT is updated in the same way as 2BC.
  - Choice predictor is update in the same way as bimodal



#### YAGS (Yet Another Global Scheme)

- Using two tagged PHTs
- When a PHT miss, choice PHT makes a prediction.



#### Alpha 21264's hybrid branch predictor

- A hybrid of local prediction and global prediction implemented in DEC Alpha 21264 which was the state-of-the art commercial processor.
- A choice predictor is used as a meta-predictor





• Using not a meta-predictor but a majority vote



#### Perceptron (HPCA 2001)

- How to predict
  - Select one perceptron by PC
  - Compute y using the equation. It predicts 1 if y>=0, predicts 0 if y<0
- How to update
  - Train the weights of used perceptron when the prediction miss or |y| < T



#### Branch predictors based on pattern matching

- Find the longest matching pattern (green rectangle)
- Select the proper matching length or long matching pattern (blue rectangle)
- Count the number of 0 and the number of 1 after the pattern (red rectangle), then predict.



#### Partial Pattern Matching (CBP 2004)



#### Some typical branch predictors until 2004



- ISCA (International Symposium on Computer Architecture)
- MICRO (International Symposium on Microarchitecture)
- PACT (International Conference on Parallel Architectures and Compilation Techniques)
- ASPLOS (International Conference on Architectural Support for Programming Languages and Operating Systems)

#### Prediction accuracy

- The accuracy of 4KB Gshare is about 93%.
- The accuracy of 4KB PPM is about 97%.



#### Mid-term report

- 1. For details of the assignment, please visit the lecture support page. http://www.arch.cs.titech.ac.jp/lecture/ACA/
- 2. Submit your report printed on A4 paper at the beginning of the next lecture on January 7, 2019

# Four stage pipelined processor supporting ADD and BNE, which does not adopt data forwarding (proc07.v)



## Four stage pipelined processor supporting ADD and BNE, which does not adopt data forwarding (proc07.v)

| <pre>initial begin /* initialize the instruction &amp; data memory p.imem.mem[0] = {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[1] = {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[2] = {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[3] = {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[4] = {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[5] = {6'h5, 5'd4, 5'd5, 16'hfffb}; p.imem.mem[6] = {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[8] = {6'h0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[8] = {6'h0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[8] = {6'h0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[9] = {6'h0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[1] = {6'h0, 5'd0, 5'd0, 5'd0, 5'd0, 6'h20}; p.imem.mem[1] = {6'h0, 5'd0, 5'd0, 5'd0, 6'h20}; p.regfile.r[1] = 1; p.regfile.r[2] = 22; p.regfile.r[3] = 0; p.regfile.r[4] = 4; p.regfile.r[5] = 0; end</pre> | & regfile<br>// NOP<br>// L1: add<br>// NOP<br>// NOP<br>// NOP<br>// NOP<br>// NOP<br>// NOP<br>// NOP<br>// NOP | */<br>\$5, \$5, \$1<br>\$4, \$5, L1<br>\$5, \$0, \$0<br>\$2, \$0, L1 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| <pre>while(1){    for(int i=1; i!=4; i++){    } }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                                                   |                                                                      |

| RRS, RRT, TKN: | 4              | 1                        | 1                                 |
|----------------|----------------|--------------------------|-----------------------------------|
| DDC DDT TVN    |                | 5                        | 1                                 |
| RRS, RRT, TKN: | 4              | $\frac{1}{2}$            | 1                                 |
| RRS, RRT, TKN: | 4              | 3                        | 1                                 |
| RRS RRT TKN:   | 4              | 4                        | Û                                 |
| RRS, RRT, TKN: | 22             | Ō                        | 1                                 |
| DDC DDT TVN    |                |                          |                                   |
| RRS, RRT, TKN: | 4              | 1                        | 1                                 |
| RRS, RRT, TKN: | 4              | 2                        | 1                                 |
| RRS, RRT, TKN: | 4              | $\overline{2}$           | ī                                 |
| RRS, RRT, TKN: | 4              | Å.                       | Õ                                 |
| RRS, RRT, TKN: | 22             | Ō                        | 1                                 |
| RRO, RAI, IRN. |                |                          |                                   |
| RRS, RRT, TKN: | 4              | 1                        | 1                                 |
| RRS, RRT, TKN: | 4              | $\frac{\overline{2}}{3}$ | $\begin{array}{c}1\\1\end{array}$ |
| RRS, RRT, TKN: | 4              | 3                        | 1                                 |
| RRS, RRT, TKN: | 4              | -<br>Ă                   | Ô                                 |
| RRS, RRT, TKN: |                | 0                        | 1                                 |
|                | 22             | -                        |                                   |
| RRS, RRT, TKN: | 4              | 1                        | 1                                 |
| RRS, RRT, TKN: | 4              | $\overline{2}$           | 1                                 |
| RRS, RRT, TKN: | 4              |                          | ī                                 |
| RRS, RRT, TKN: | 4              | -                        | Ō                                 |
| DDC DDT TVN    |                |                          | <u> </u>                          |
| RRS, RRT, TKN: | 22             | Q                        | 1                                 |
| RRS, RRT, TKN: | 4              | $\frac{1}{2}$            | 1                                 |
| RRS, RRT, TKN: | 4              | 2                        | 1                                 |
| RRS, RRT, TKN: | -<br>4         | 3                        | ī                                 |
| RRS, RRT, TKN: | 4              | Ă                        | Ō                                 |
| RRS, RRT, TKN: | 22             | 0                        | 1                                 |
| RRS, RRT, TKN: |                |                          |                                   |
| RRS, RRT, TKN: | 4              | $\frac{1}{2}$            | 1                                 |
| RRS, RRT, TKN: | 4              | 2                        | 1                                 |
| RRS, RRT, TKN: | 4              |                          | ī                                 |
| RRS, RRT, TKN  | Ā              | Ă                        | ń                                 |
| DDC DDT TVN    |                |                          |                                   |
| RRS, RRT, TKN: | 22             | 0                        | 1                                 |
| RRS, RRT, TKN: | 4              | $\frac{1}{2}$            | 1                                 |
| RRS, RRT, TKN: | 4              | 2                        | 1                                 |
| RRS, RRT, TKN: | 4              | 3                        | ī                                 |
| RRS, RRT, TKN: | 4              | 4                        | Ō                                 |
| DDC DDT TVV    |                |                          |                                   |
| RRS, RRT, TKN: | 22             | 0                        | 1                                 |
| RRS, RRT, TKN: | 4              | 1                        | 1                                 |
| RRS, RRT, TKN: | 4              | 2                        | 1                                 |
| RRS, RRT, TKN: | 4              | 3                        | ī                                 |
| RRS, RRT, TKN: | $\overline{4}$ | 2<br>3<br>4              | Ō                                 |
| DDC DDT TVN    |                | 0                        |                                   |
| RRS, RRT, TKN: | 22             | U                        | 1                                 |
|                | 4              |                          | 1                                 |

#### Exercise: how to update PHT and BHR of Gshare