Parallel and Reconfigurable VLSI Computing (1)

## **FPGA Introduction**

Hiroki Nakahara
Tokyo Institute of Technology

#### Outline

- Class guide
- FPGA Basis
  - FPGA Architecture
- Standard FPGA Design
  - RTL (Register Transfer Level)
- Summary

#### FY'19 Schedule

6/14 1 Tutorial & FPGA Basis

6/18 2 Hardware Preliminary

6/21 3 Walk Through FPGA Design

6/25 4 FPGA Architecture

6/28 5 RTL Design

7/ 2 6 FPGA Synthesis Flow

7/ 5 7 Practical RTL Design

7/ 9 Cancel

7/12 Cancel

7/16 8 RTL Design: Tiny Processor

7/19 9 High-Level Synthesis (HLS)

Design: Introduction

7/23 10 HLS Optimizations

7/26 11 Practical HLS Design

7/30 12 Complexity of Logic Functions,

and its Decomposition:

Synthesis for an FPGA

8/ 2 13 Deep Neural Network on an FPGA

#### Evaluation

- Report: TBD
- Exercises
  - Submit by PDF file to OCW-i
- Lecture Slides:
  - -> TOKYO TECH OCW

## FPGA Basis

# The Dilemma: Flexibility vs. Efficiency

- FPGAs often offer the best of both worlds replacing MPUs,
   DSPs, and dedicated ASSPs or ASICs
- Their on-the-fly reconfigurability helps them realize insystem logic functions that CPUs can't



#### **FPGA**

- Reconfigurable LSI or Programmable Hardware
- Programmable Logic Array and Programmable Interconnection
- Programmed by Reconfigurable Data



## Island Style FPGA



#### Xilinx CLB



The Design Warrior's Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright © 2004 Mentor Graphics Corp. (www.mentor.com)

## Logic Cell (Xilinx Inc. XC2000)



## Realization of a Logic Function

| <b>x0</b> | <b>x1</b> | <b>x2</b> | У |
|-----------|-----------|-----------|---|
| 0         | 0 0       |           | 0 |
| 0         | 0 1       |           | 0 |
| 0         | 1         | 1 0       |   |
| 0         | 1         | 1 1       |   |
| 1         | 0         | 0 0       |   |
| 1         | 0         | 1         | 0 |
| 1         | 1         | 0         | 0 |
| 1         | 1         | 1         | 1 |



| <b>x0</b> | <b>x1</b> | x1 x2 |   |
|-----------|-----------|-------|---|
| 0         | 0         | 0     | 0 |
| 0         | 0         | 1     | 1 |
| 0         | 1         | 0     | 1 |
| 0         | 1         | 1     | 1 |
| 1         | 0         | 0     | 1 |
| 1         | 0         | 1     | 1 |
| 1         | 1         | 0     | 1 |
| 1         | 1         | 1     | 1 |



#### Channel and Switch Block



#### Memory-based realizes "programmable"



## Realization of Logic Network

```
input ap clk;
input [0:0] reset:
output [15:0] bmp address0:
output [11:0] bmp d0;
input [7:0] pad0;
input [7:0] pad1:
output [31:0] ap return;
(* fsm_encoding = "none" *) reg [43:0] ap_CS_fsm = 44'b1;
      ap sig cseq ST st1 fsm 0;
     [11:0] ram_address0;
      [7:0] ram q0;
     [7:0] sprram address0;
     [7:0] sprram d0;
      [7:0] sprram_q0
```

CAD Tool Xilinx: Vivado Intel: Quartus II



Describe a logic by hardware description language (VHDL/Verilog-HDL)



#### Product Type Segments

- SRAM
- Flash based
- Antifuse

Global FPGA market share, by technology, 2015 (USD Million)



Source: https://www.grandviewresearch.com/industry-analysis/fpga-market

#### **FPGA Growth Trend**

 20 Years FPGAs have been swallowing up system components (by Altera, now a part of Intel)



## FPGA Mixed with GPUs: The Era of the Programmable SoC

 A generic example of an SoC FPGA, sometimes also known as an application services platform (ASP), shows a dual-core hard processor system with its complement of hard peripherals on the same die with an FPGA fabric



Xilinx ZYNQ Family

**Intel SoC Series** 

## Application Type Segments

- Industrial
- Automotive
- Consumer electronics
- Military & aerospace
- Telecom
- Data processing
- Others

#### iPhone7 Plus



Source: https://www.ifixit.com/Teardown/iPhone+7+Plus+Teardown/67384

#### Market by Application

U.S. FPGA Market by application, 2014 - 2024 (USD Billion)



Source: https://www.grandviewresearch.com/industry-analysis/fpga-market

## Market Share by Vendor

| Vendor         | 2015          |                 | 2016          |                 |                     |
|----------------|---------------|-----------------|---------------|-----------------|---------------------|
|                | FPGA<br>Total | Market<br>share | FPGA<br>Total | Market<br>share | Growth<br>CY15-CY16 |
| Xilinx         | \$2,044       | 53%             | \$2,167       | 53%             | 6%                  |
| Intel (Altera) | \$1,389       | 36%             | \$1,486       | 36%             | 7%                  |
| Microsemi      | \$301         | 8%              | \$297         | 7%              | -1%                 |
| Lattice        | \$124         | 3%              | \$144         | 3%              | 16%                 |
| QuickLogic     | \$19          | 0%              | \$11          | 0%              | -40%                |
| Others         | \$2           | 0%              | \$2           | 0%              | 0%                  |
| TOTAL          | \$3,879       | 100%            | \$4,112       | 100%            | 6%                  |

Source EEtimes 3/5/2017

#### FPGA vs. ASIC: Which is Best?



Source: https://anysilicon.com/fpga-vs-asic-choose/

#### Price Comparison

- ASIC→NRE: \$1.5M, Unit cost: \$4
- FPGA→NRE: \$0, Unit cost: \$8



Total Cost ASIC vs FPGA including NRE in MUSD

### Performance Comparison

 The divergence between primary programmable logic device technology and that used for ASICs has continued to grow



Source: http://archive.rtcmagazine.com/articles/view/102503

# FPGA Programming

## Standard FPGA Design



### FPGA Design Flow



#### How to write a HDL?





#### Comparison of # Lines

```
Y = X.dot(W)+B
```

Python: single line!

```
for(i=0;i<2;++i){
  for(j=0;j<2;++j){
    y[i][j] = x[i][j] * w[i][j];

  // Compute terms
  for(i=0;i<2;i++){
    for(j=0;j<2;j++){
      term = 0;
      for(k=0;k<2;k++)
          term = term + x[i][k]*w[k][j];
    y[i][j] = term;
    }
}</pre>
```

C/C++: ten lines

```
module mat_add(
    input clk, reset,
    input [7:0]x[0:3],
    output [7:0]y[0:3]
 reg [1:0]state;
reg [1:0]mux1, mux2;
reg [7:0]w0, w1;
reg [1:0]de_mux;
 always@( posedge clk or posedge rst)begin
    if( rst == 1'b1)begin
        state <= 2'b00
    end else begin
        case( state)
        2'b00:begin
            state <= 2'b01:
            mux1 <= 2'b10;
            mux2 <= 2'b11;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de mux <= 2'b11;
        end
        2'b01:begin
            state <= 2'b10;
            mux1 <= 2'b00;
            mux2 <= 2'b01;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de_mux <= 2'b00;
        2'b10:begin
            state <= 2'b11;
            mux1 <= 2'b00;
            mux2 <= 2'b01;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de_mux <= 2'b01;
        2'b11:begin
            state <= 2'b00:
            mux1 <= 2'b00;
            mux2 <= 2'b01;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de_mux <= 2'b10;
        end
        endcase
    end
wire [15:0]mul1, mul2;
wire [16:0]w_add;
assign mul1 = w0 * mux( mux1,x[0],x[1],x[2],x[3]);
assign mul2 = w1 * mux( mux2,x[0],x[1],x[2],x[3]);
assign w_add = mul1 + mul2;
assign y[0] = (de_mux == 2'b00) ? w_add : 2'bzz;
assign y[1] = (de_mux == 2'b01) ? w_add : 2'bzz;
assign y[2] = (de_mux == 2'b10) ? w_add : 2'bzz;
assign y[3] = (de_mux == 2'b11) ? w_add : 2'bzz;
endmodule
```

Verilog-HDL: 66 lines

#### Boolean Network

- Representation of a combinational logic circuit using a directed graph without a cycle
- Vertex: Logic gate, Edge: Input or output



## Logic Synthesis

• Synthesize from a given HDL specification to a Boolean network



## Technology Mapping

- A kind of a graph covering problem
- Goal: A depth optimized one by using a dynamic programming



#### Placement

- Problem to place the module (logic gate) into the slot (location)
  - 2D allocation problem → NP-complete
  - Approximation (Simulated annealing, or min-cut tech.)



#### Routing

- Global routing: Determine the rough wiring path
- Local one: Determine the wiring segment and switch



#### Pressor of Design Time

- Design Time = #lines 

  \$
- More higher-level description
  - High-level synthesis for C/C++

```
Y = X.dot(W)+B
```

Python: single line!

```
for(i=0;i<2;++i){
  for(j=0;j<2;++j){
    y[i][j] = x[i][j] * w[i][j];

  // Compute terms
  for(i=0;i<2;i++){
    for(j=0;j<2;j++){
      term = 0;
      for(k=0;k<2;k++)
          term = term + x[i][k]*w[k][j];
    y[i][j] = term;
    }
}</pre>
```

C/C++: ten lines

```
module mat_add(
    input clk, reset,
    input [7:0]x[0:3]
    output [7:0]y[0:3]
reg [1:0]state;
reg [1:0]mux1, mux2;
reg [7:0]w0, w1;
reg [1:0]de_mux;
always@( posedge clk or posedge rst)begin
    if( rst == 1'b1)begin
         state <= 2'b00
    end else begin
        case( state)
        2'b00:begin
            state <= 2'b01:
            mux1 <= 2'b10;
            mux2 <= 2'b11;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de mux <= 2'b11;
        end
        2'b01:begin
            state <= 2'b10;
            mux1 <= 2'b00;
            mux2 <= 2'b01;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de_mux <= 2'b00;
        2'b10:begin
            state <= 2'b11;
            mux1 <= 2'b00;
            mux2 <= 2'b01;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de mux <= 2'b01;
        2'b11:begin
            state <= 2'b00:
            mux1 <= 2'b00;
            mux2 <= 2'b01;
            w0 <= 8'b00101000;
            w1 <= 8'b11000101;
            de_mux <= 2'b10;</pre>
        end
        endcase
    end
wire [15:0]mul1, mul2;
wire [16:0]w_add;
assign mul1 = w0 * mux( mux1,x[0],x[1],x[2],x[3]);
assign mul2 = w1 * mux( mux2,x[0],x[1],x[2],x[3]);
assign w_add = mul1 + mul2;
assign y[0] = (de_mux == 2'b00) ? w_add : 2'bzz;
assign y[1] = (de_mux == 2'b01) ? w_add : 2'bzz;
assign y[2] = (de_mux == 2'b10) ? w_add : 2'bzz;
assign y[3] = (de_mux == 2'b11) ? w_add : 2'bzz;
endmodule
```

Verilog-HDL: 66 lines

### High-Level Synthesis (HLS)





## System on Chip FPGA



Source: Xilinx Inc. Zynq-7000 All Programmable SoC

## Conventional Design Flow for the SoC FPGA



- 1. Behavior design
- 2. Profile analysis
- 3. IP core generation by HLS
- 4. Bitstream generation by FPGA CAD tool
- 5. Middle ware generation

## System Design Tool for the SoC FPGA



- Behavior design
  - + pragmas
- 2. Profile analysis
- 3. IP core generation by HLS
- 4. Bitstream generation by FPGA CAD tool
- 5. Middle ware generation



Automatically done

## Summary

- FPGA: Reconfigurable LSI or Programmable Hardware
  - Consists of a programmable logic array and a programmable interconnection
  - Programmable (Memory-based) switch
- Standard FPGA design supports an RTL based one
  - Shifting to High-level (C/C++) design
- Benefits: Productivity, lower non-recurring engineering costs, maintainability, faster time to market

# Exercise: Install Vivado and SDK

## Setup Your FPGA Development Environment for Ubuntu 16.04 LTS



### Make Your Xilinx Account



Policy.

\$cd /Download

\$ chmod a+x Xilinx Vivado SDK Web 2017.4 1216 1 Lin64.bin

\$ sudo su (Change root user)

# ./Xilinx\_Vivado\_SDK\_Web\_2017.4\_1216\_1\_Lin64.bin



Select "Continue" at "A Newer Version is Available" Dialog, then click "Next".

At "Select Install Type" window, enter your User ID and Password, and choose "Download and Install Now", then click "Next".

Next, check all "I Agree" for accept license agreements, then "Next".



#### Select Edition to Install

Select an edition to continue installation. You will be able to customize the content in the next page.

#### Vivado HL WebPACK

Vivado HL WebPACK is the no cost, device limited version of Vivado HL Design Edition. Users can opt Generator for DSP to this installation.

Vivado HL Design Edition

Vivado HL Design Edition includes the full complement of Vivado Design Suite tools for design, includ Synthesis, implementation, verification and device programming. Complete device support, cable driv Users can optionally add Model Composer to this installation.

Choose Vivado HL WebPACK version, which is a free version.

In WebPACK installation, you are not necessary customize "Design Tools", "Devices", and "Installation Options". Just click "Next".

Set your installation directory to "/opt/Xilinx".

Wait 3-4 hours...

### Run Vivado

```
$sudo su (change root user)
#source /opt/Xilinx/2017.4/Vivado/settings64.sh
#vivado &
```

## Run Xilinx SDK (Software Development Kit)

```
$sudo su (change root user)
#source /opt/Xilinx/2017.4/Vivado/settings64.sh
#xsdk &
```

### Run Vivado HLS

```
$sudo su (change root user)
#source /opt/Xilinx/2017.4/Vivado/settings64.sh
#vivado_hls &
```

### Exercise 1

1. (Mandatory) Install Vivado HLx edition to your PC, and send e-mail screen-shot for startup windows (VIVADO, VIVADO\_HLS, SDK). Send e-mail with PDF including screen shots.

If you meet any troubles, don't hesitate to contact me.

nakahara@ict.e.titech.ac.jp

- 2. (Mandatory) Why an FPGA can be applied to the high-end CMOS process?
- 3. (Mandatory) Why RTL based design is necessary to FPGA implementation?
- 4. (Mandatory) Investigate FPGA market for the past 10 years.

Deadline is 18<sup>th</sup>, June, 2019, JST PM 13:20 (At the beginning of the next lecture)