Fiscal Year 2018



Course number: CSC.T433 School of Computing, Graduate major in Computer Science

# Advanced Computer Architecture

## 12. Thread Level Parallelism: Interconnection Network

www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W936 Mon 13:20-14:50, Thr 13:20-14:50

Kenji Kise, Department of Computer Science kise \_at\_ c.titech.ac.jp

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKY TECH

# Key components of many-core processors

- Main memory and caches
  - Caches are used to reduce latency and to lower network traffic
  - A parallel program has private data and shared data
  - New issues are cache coherence and memory consistency
- Interconnection network
  - connecting many modules on a chip achieving high throughput and low latency
- Core
  - High-performance superscalar processor providing a hardware mechanism to support thread synchronization



## Performance metrics of interconnection network

- Network cost
  - number of switches
  - number of links on a switch to connect to the network (plus one link to connect to the processor)
  - width in bits per link, length of link
- Network bandwidth (NB)
  - represents the best case
  - bandwidth of each link x number of links
- Bisection bandwidth (BB)
  - represents the worst case
  - divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line

#### **Bus Network**

- N processors, 1 switch (
  ), 1 link (the bus)
- Only 1 simultaneous transfer at a time
  - NB (best case) = link (bus) bandwidth x 1
  - BB (worst case) = link (bus) bandwidth x 1
- All processors can snoop the bus





#### **Ring Network**

- N processors, N switches, 2 links/switch, N links
- N simultaneous transfers
  - NB (best case) = link bandwidth x N
  - BB (worst case) = link bandwidth x 2
- If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case



# Cell Broadband Engine (2005)

- Cell Broadband Engine (2005)
  - 8 core (SPE) + 1 core (PPE)
    - each SPE has 256KB memory
  - PS3, IBM Roadrunner (12k cores)



PlayStation3 from PlaySation.com (Japan)







Diagram created by IBM to promote the CBEP, ©2005 from WIKIPEDIA

#### Intel Xeon Phi (2012)



#### Table 2. Intel<sup>®</sup> Xeon Phi<sup>™</sup> Product Family Specifications

#### MEMORY FORM PEAK DOUBLE PEAK INTEL\* CAPACITY PRODUCT FREQUENCY FACTOR &, BOARD NUMBER PRECISION MEMORY TURBO (GB) TDP (WATTS) OF CORES NUMBER THERMAL (GHz) PERFORMANCE BANDWIDTH BOOST SOLUTION<sup>4</sup> TECHNOLOGY (GFLOP) (GB/s) 3120P 57 1.1 N/A PCIe, Passive 6 300 1003 240 3120A PCIe, Active 300 57 1.1 1003 240 6 N/A 5110P PCIe, Passive 225 60 1.053 N/A 1011 320 8 Dense form 245 60 1.053 1011 N/A 5120D 8 352 factor, None Peak turbo 7110P PCIe, Passive 300 61 1.238 1208 352 16 frequency: PCIe, None 61 1208 352 16 7120X 300 1.238 1.33 GHz



#### Intel® Xeon Phi<sup>™</sup> Coprocessor Block Diagram



#### Crossbar (Xbar) Network

- N processors, N<sup>2</sup> switches (unidirectional), 2 links/switch, N<sup>2</sup> links
- N simultaneous transfers
  - NB = link bandwidth x N
  - BB = link bandwidth  $\times N/2$





A symbol of Xbar

#### Fat Tree (1)

- Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.
- Any time A wants to send to C, it ties up the upper links, so that B can't send to D.
  - The bisection bandwidth on a tree is horrible 1 link, at all times
- The solution is to 'thicken' the upper links.
  - More links as the tree gets thicker increases the bisection bandwidth



#### Fat Tree

- N processors, log(N-1) x logN switches, 2 up + 4 down = 6 links/switch, N x logN links
- N simultaneous transfers
  - NB = link bandwidth x N log N
  - BB = link bandwidth x 4



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

#### Mesh Network

- N processors, N switches, 4 links/switch, N x ( $N^{1/2}$  1) links
- N simultaneous transfers
  - NB = link bandwidth x 2N
  - BB = link bandwidth  $\times N^{1/2}$



#### Intel Single-Chip Cloud Computer (2009)

• To research multi-core processors and parallel processing.





A many-core architecture with 2D Mesh NoC



Intel Single-Chip Cloud Computer (48 Core)

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

#### Intel Skylake-X, Core i9-7980XE, 2017

• 18 core





#### Epiphany-V: A 1024 core 64-bit RISC system-on-chip



Summary of Epiphany-V features:

- 1024 64-bit RISC processors
- 64-bit memory architecture
- 64/32-bit IEEE floating point support
- 64MB of distributed on-chip memory
- 1024 programmable I/O signals
- Three 136-bit wide 2D mesh NOCs
- 2052 Independent Power Domains
- Support for up to 1 billion shared memory processors
- Binary compatibility with Epiphany III/IV chips

South IO

North IO

• Custom ISA extensions for deep learning, communication, and cryptography



Table 5: Epiphany-V Area Breakdown

#### 2D and 3D Mesh / Torus Network





#### Bus vs. Networks on Chip (NoC) of mesh topology





## Typical NoC architecture of mesh topology

- NoC requirements: low latency, high throughput, low cost
- Packet based data transmission via NoC routers and XYdimension order routing



#### Packet organization (Flit encoding)

- A flit (flow control unit or flow control digit) is a link-level atomic piece that forms a network packet.
  - A packet has one head flit and some body flits.
- Assume that each flit has typical three fields:
  - payload(data) or route information(tag)
  - type : head, body, tail, etc.
  - virtual channel identifier



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

Packet (tag + data)

#### Simple NoC router architecture

• Input buffer, routing, arbitration, flow control



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

N

PM

## Datapath of Virtual Channel (VC) NoC router

- To mitigate head-of-line (HOL) blocking, virtual channels are used
- Pipelining



## Pipelining the NoC router microarchitecture



"A Delay Model and Speculative Architecture for Pipelined Routers," L. S. Peh and W. J. Dally, *Proc. of the 7th Int'l Symposium on High Performance Computer Architecture*, January, 2001.

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

#### Average packet latency of mesh NoCs

- 5 stage router pipeline
- Uniform traffic (destination nodes are selected randomly)



CSC.

Thiem Van Chu, Myeonggu Kang, Shi FA and Kenji Kise: Enhanced Long Edge First Routing Algorithm and Evaluation in Large-Scale Networks-on-Chip, IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, (September 2017).

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH