#### Parallel and Reconfigurable VLSI Computing (7)

# Practical RTL Design

Hiroki Nakahara
Tokyo Institute of Technology

#### Outline

- Practical RTL design methodology
  - From behavior (C/C++ code) to HDL one
- Interface co-design
  - Control a hardware from an ARM processor
- RTL design optimization

# Practical RTL Design Methodology

# C/C++ to RTL

- Determine the specifications of the circuit
  - Timing chart, state transition diagram, performance, block diagram
- Consider the configuration of the module
  - Design and combine for each IP core
- Function assigned to each core and its resource
  - Including a consideration of interface
  - Often written in C/C ++
  - It becomes a testbench for a verification
- Convert C/C++ description to RTL
- Optimize behaviors (pipelining and parallelization)
  - Automation by the remaining work with CAD

# C/C++ Description for a Concept of Module

- Approximate ~300 lines for single function
- Data input/output (Interface)
- Data processing



# Case Study: FIR Filter

•  $x_n$ : N sampling signals and  $y_n$ : Output signal, then

$$y_n = \sum_{k=0}^{N-1} h_k x_{n-k}$$

, where a filter coefficient h<sub>n</sub> is given by

$$h_n = \frac{\rho_n}{2\pi} \int_0^{2\pi} d\tilde{\omega} e^{i\tilde{\omega}(n-\tilde{\tau})} H_0(\tilde{\omega}).$$

, where  $\widetilde{\omega}=2\pi\omega/\omega_s$  denotes normalized freq.,  $\omega_s$  denotes sampling freq.,  $H_0(\widetilde{\omega})\in \mathbf{R}$  denotes frequency characteristic,  $\mathbf{p_n}$  denotes window function, and  $\widetilde{\Gamma}=(N-1)/2$ .

#### Cont'd

• Deference equation for a FIR filter:

$$y[n] = a_0x[n] + a_1x[n-1] + a_2x[n-2] + a_3x[n-3] + a_4x[n-4] + \cdots + a_{N-1}x[n-(N-1)]$$

• Diagram for a FIR filter:



# FIR Filter Coefficient Design

- MathWorks Matlab with DSP System Toolbox
  - Sampling Freq.: 44.1 kHz
  - LPF for 20 kHz  $\rightarrow$  Normalized cut-off freq.
  - #Taps: 11
  - Window function: Hamming



#### C Behavior for a FIR Filter

https://github.com/HirokiNakahara/FPGA\_lecture/tree/master/Lec7\_Practical\_RTL\_design/fir.c

```
#include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #define N 11
    void fir(float *y, float x)
        float c[N] = \{ // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window \}
        -4.120289718403869e-03, -1.208600321298122e-02,
                                                                         void main()
        -2.650603053411641e-03, 9.166631627169690e-02,
11
12
         2.544318483405623e-01, 3.400000000000001e-01,
                                                                             float fs = 44100.0;
         2.544318483405623e-01, 9.166631627169690e-02,
13
                                                                             int len = 1000;
        -2.650603053411641e-03, -1.208600321298122e-02,
14
        -4.120289718403869e-03, };
15
                                                                             float f0 = 20000.0;
                                                                             float sin wave;
        static float shift reg[N];
17
                                                                             float fir out;
        float acc:
18
                                                                     42
        int i;
19
                                                                             int i;
20
21
        acc = 0.0:
                                                                             for( i = 0; i < len; i++){</pre>
22
        for (i = N - 1; i >= 0; i--) {
                                                                                  sin wave = sin( 2.0 * M PI * f0 * i / fs);
            if (i == 0) {
23
                acc += x * c[0];
24
                                                                                 fir( &fir out, sin wave);
                shift reg[0] = x;
25
            } else {
26
                                                                                 printf("%d %f %f\n", i, sin wave, fir out);
                shift reg[i] = shift reg[i - 1];
27
                                                                                 f0 = f0 - 10.0:
                acc += shift reg[i] * c[i];
28
29
        *y = acc;
```

# Debug for C Description

- Confirm the operation of FIR
- In/Out are reused as a testbench for HDL simulation
- Note, a parallel operation cannot be verified
- Area and speed of the circuit can not be estimated



#### Convert to Fixed Point Precision

https://github.com/HirokiNakahara/FPGA\_lecture/tree/master/Lec7\_Practical\_RTL\_design/fir\_int.c

```
#include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #define N 11
    #define PREC 65536 // 2**16 sign + 15bit precision
    void fir(int *y, int x)
9 ▼ {
10 ▼
         int c[N] = \{ // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window \}
             -136, -397, -87, 3004, 8338, 11142, 8338,
11
             3004, -87, -397, -136, };
                                                                 void main()
12
13
        static int shift reg[N];
                                                                      float fs = 44100.0;
14
                                                                     int len = 1000;
         int acc:
        int i:
                                                                      float f0 = 20000.0;
                                                                     float sin wave;
         acc = 0;
                                                                     int fir out;
19 ▼
         for (i = N - 1; i >= 0; i--) {
             if (i == 0) {
20 ▼
                                                                      int i:
                 acc += x * c[0];
21
                 shift reg[0] = x;
                                                                      for( i = 0; i < len; i++){}
23 ▼
             } else {
                                                                          \sin \text{ wave} = \sin(2.0 * \text{M PI} * \text{f0} * \text{i / fs});
                 shift_reg[i] = shift_reg[i - 1];
24
                 acc += shift reg[i] * c[i];
                                                                         fir( &fir out, (int)(sin wave * PREC));
             }
                                                                          printf("%d %f %f\n", i, sin wave, (float)fir_out / PREC);
                                                                          f0 = f0 - 10.0;
         *y = acc;
```

#### C Behavior to RTL

- RTL → Data path + FSM
- Re-write control while, for, switch statements to ifthen, goto statements, then convert FSM
- Assign label to each statement → FSM state number



# Example

#### Write FSM

- Convert if-then goto statement to FSM
  - Writing an FSM until you get used to it!
- Add an initialization processing (register value after resetting)
- Make the whole process an infinite loop
  - Generally, return to the initial state after finished all processing

### Example

```
L1: acc = 0, i = N - 1;

L2: if i == -1 then goto L4:

L3: if i != 0 then goto L3_3:

L3_1: acc += x * c[0];

L3_2: shift_reg[0] = x;

EndIf: goto Endloop:

L3_3: shift_reg[i] = shift_reg[i - 1];

L3_4: acc += shift_reg[i] * c[i];

Endloop: i = i -1, goto L2:

L4: *y = acc;
```



# Parallel Processing

Concurrent assignment

```
tmp=A; A<=B;
A = B; B<=A;
B = tmp;
```

Continuous assignments

$$A=B; B=C; \rightarrow A=C;$$

- Reduce number of states by parallel processing
- Considering simultaneous assignment from the starting FSM description

# More Simplify



#### RTL Simulation for an FIR Filter

See, https://github.com/HirokiNakahara/FPGA\_lecture/tree/master/Lec7\_Practical\_RTL\_design/

- FIR Filter module: fir\_1.v
- Testbench for FIR Filter: testbench\_fir\_1.v



# Interface Co-Design

#### Interface

• Data Transfer/Receive between modules



# AXI 4 bus: General Interface of ARM Embedded FPGA

- Complex protocols
  - High-level synthesis (HLS) can be easily generated with Directive
  - System design tool (SDSoC) automatically selects the best protocol

|                         | AXI4                                             | AXI4-Lite                                                       | AXI4-Stream                                      |
|-------------------------|--------------------------------------------------|-----------------------------------------------------------------|--------------------------------------------------|
| Dedicated for           | high-performance<br>and memory<br>mapped systems | register-style interfaces<br>(area efficient<br>implementation) | non-address<br>based IP<br>(PCle, Filters, etc.) |
| Burst<br>(data beta)    | up to 256                                        | 1                                                               | Unlimited                                        |
| Data width              | 32 to 1024 bits                                  | 32 or 64 bits                                                   | any number of bytes                              |
| Applications (examples) | Embedded, memory                                 | Small footprint control logic                                   | DSP, video, communication                        |

# Case Study: AXI4 Bus Connection

Led blinking via AXI-lite bus



### Create a New Project

Project location: C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{C:\footnote{

Target FPGA: Zybo-Z7-10 or (Z7-20)

Design Sources: None

Constraints: Zybo-Z7-Master.xdc

Simulation Sources: None

# Create AXI4 Peripheral

 Select "Tools->Create and Package New IP", then check "Create AXI4 Peripheral", and "Next"



# Specify IP Location

Type "ip\_repo" on your project directory, then "Next"



#### Edit Interface

• Set the default "AXI4-Lite Slave (four 32-bit registers)", and "OK", then "Finish"



# Edit "myip\_v1.0"

- Click Flow Navigator->PROJECT MANAGER -> IP Catalog
- Make sure "myip\_v1.0" under "User Repository" on "IP Catalog"
- Right click on "myip\_v1.0", then select "Edit in IP Packager"
  - Click "OK" to save the project location



# Synthesis "myip" on a New Vivado

 Make sure "myip\_v1\_0.v" as a wrapper and "myip\_v1\_0\_S00\_AXI.v" as a top module



# Edit "myip\_v1\_0\_S00\_AXI.v"

```
Project Summary
                    × Package IP - myip
                                            × myip v1 0 S00 AXI.v
c:/FPGA/test_ipgen/ip_repo/myip_1.0/hdl/myip_v1_0_S00_AXI.v
 13
               // Width of S_AXI address bus
 14
               parameter integer C_S_AXI_ADDR_WIDTH = 4
 15
 16
 17
               // Users to add ports here
 18
               output wire [3:0]LED,
 19
 20
               // Do not modify the ports beyond this line
 21
 22
               // Global Clock Signal
 23
               input wire S_AXI_ACLK,
 24
               // Global Reset Signal. This Signal is Active LOW
400
           // Add user logic here
401
           assign LED = slv_reg0[3:0];
402
403
           // User logic ends
404
405
           endmodule
```

# Edit "myip\_v1\_0.v"

```
// Parameters of Axi Slave Bus Interface SOO AXI
              parameter integer C_SOO_AXI_DATA_WIDTH
              parameter integer C_SOO_AXI_ADDR_WIDTH
                  Users to add ports here
18
              output wire [3:0]LED,
19
20
              // Do not modify the ports beyond this line
              input wire s00_axi_rready
45
46
      // Instantiation of Axi Bus Interface SOO_AXI
          myip_v1_0_S00_AXI # (
47
              .C_S_AXI_DATA_WIDTH(C_S00_AXI_DATA_WIDTH),
48
49
              .C_S_AXI_ADDR_WIDTH(C_S00_AXI_ADDR_WIDTH)
50
          ) <u>myip_v1_0_S00_AXI_inst (</u>
51
               .LED( LED),
52
               S AXI ACLK(s00 axi aclk)
53
               .S_AXI_ARESETN(s00_axi_aresetn),
               S AVI AMADDD(ann avi amadde)
```

# Re-Package IP

Switch to "Package IP" tab, then "Re-Package IP"



#### Add a ZYNQ Processor

• In the initial Vivado, Flow Navigator -> IP INTEGRATOR -> Create Block Design, then add a ZYNQ Processor, and "Run Block Automation"



# Add a "myip" IP

 Place your "myip" on the Block Design View, then click "Run Connection Automation", and "OK"



# Regenerate Layout



#### Make External



# Specify an External Port Name



# Write Software Code to Control "myip" from a ZYNQ Processor

- Click "Generate Bitstream", then "Export Hardware", and next, "Launch SDK"
- Create a new project as "myip\_test"

```
#include <stdio.h>
 #include "platform.h"
 #include "xil printf.h"
 #include "xparameters.h"
 #define LED *((volatile unsigned int *) XPAR MYIP 0 S00 AXI BASEADDR)
⊖int main()
     init platform();
     print("Hello World\n\r");
     int i, j;
     while(1){}
         for( i = 0; i < 6; i++){
             xil printf("i=%d\n", i);
             switch(i){
             case 0: LED = 0x1; break;
             case 1: LED = 0x2; break;
             case 2: LED = 0x3; break;
             case 3: LED = 0x4; break;
             case 4: LED = 0x5; break;
             case 5: LED = 0x6; break;
             default: LED = 0x0;
             for(j = 0; j < 10000000; j++);
     cleanup platform();
     return 0;
```

#### Source Code

Memory map is automatically generated by Vivado, and it is written in "xparameters.h"

Build the project, then "Xilinx->Program FPGA".

Next, Connect the Zybo to the PC
Run Terminal software (e.g. Tera Term for
Windows, gtkterm for Unix)
Connect "USB Serial Port" with 115200 bps
Select the project in the Project Explorer,
then, in "Menu", "Run As" -> "Launch on
Hardware (System Debugger)"

# RTL Design Optimization

### Pipelining

#### (a) Non-pipelining

Processing iteration 2 is done sequentially after the completion of iteration 1



. . .

#### (b) Pipelining (n = 3 stage)

Processing iteration 2 is done after the completion of stage 1 in iteration 1

|              |         |         |         |         |         | Fime $T_{pipe}$ |
|--------------|---------|---------|---------|---------|---------|-----------------|
| Processing 1 | Stage 1 | Stage 2 | Stage 3 | L/n     |         |                 |
| Processing 2 |         | Stage 1 | Stage 2 | Stage 3 |         |                 |
| Processing 3 |         |         | Stage 1 | Stage 2 | Stage 3 |                 |

## Pipeline Efficiency

Percentage of the actually achieved speedup to the maximum

$$S_{pipe}(N) = \frac{T(N)}{T_{pipe}(N)} = \frac{nN}{n+N-1} = \frac{n}{1+\frac{n-1}{N}}$$

If  $n \ll N$ , then  $S_{pipe}(N) \cong n$  and the speed-up factor over non-pipelining is n

Percentage of the actually achieved speedup to the maximum

$$E_{pipe}(n,N) = \frac{S_{pipe}(N)}{n} = \frac{1}{1 + \frac{n-1}{N}} = \frac{N}{N+n-1}$$

# Parallel Processing and Flynn's Taxonomy



## Loop Unrolling

Without unrolling

```
for ( int i = 0; i < N; i++){
  op_Read[i];
  op_MAC;
  op_Write[i];
}</pre>
```



Throughput: 3 cycles

Latency: 3 cycles

Operation: 1/3 data/cycle

Loop Unrolling for 3 Operations

```
for ( int i = 0; i < N/3; i+=3){
  op_Read[i*3];
  op_MAC;
  op_Write[i*3];
  op_Read[i*3+1];
  op_MAC;
  op_Write[i*3+1];
  op_Read[i*3+2];
  op_MAC;
  op_Write[i*3+2];
}</pre>
```

```
RD MAC WR
RD MAC WR
RD MAC WR
```

Throughput: 3 cycle

Latency: 3 cycle

Operation: 1 data/cycle

### Unrolling for a FIR Filter

```
int c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamil
    -136, -397, -87, 3004, 8338, 11142, 8338,
    3004, -87, -397, -136, };
static int shift_reg[N];
int acc;
int i;
acc = 0:
for (i = N - 1; i >= 0; i--) {
    if (i == 0) {
        acc += x * c[0];
        shift_reg[0] = x;
    } else {
        shift_reg[i] = shift_reg[i - 1];
        acc += shift reg[i] * c[i];
<sup>k</sup>y = acc;
```

```
static int shift reg[N];
shift_reg[10] = shift_reg[9];
shift reg[ 9] = shift reg[8];
shift reg[ 8] = shift reg[7];
shift reg[ 7] = shift reg[6];
shift_reg[ 6] = shift_reg[5];
shift_reg[ 5] = shift_reg[4];
shift_reg[ 4] = shift_reg[3];
shift reg[ 3] = shift reg[2];
shift reg[ 2] = shift reg[1];
shift_reg[ 1] = shift_reg[0];
shift reg[0] = x;
*y = shift_reg[10] * -136 + shift_reg[9] *
                                             -397
  + shift reg[ 8] * -87 + shift reg[7] *
                                            3004
  + shift reg[ 6] * 8338 + shift reg[5] * 11142
  + shift_reg[ 4] * 8338 + shift_reg[3] *
                                            3004
  + shift_reg[ 2] * -87 + shift_reg[1] *
                                             -397
  + shift_reg[ 0] * -136;
```

# Dataflow for Unrolling FIR Filter



# Pipelined Dataflow

• Insert a pipeline register and realized by a DSP block



#### RTL Simulation

- See, https://github.com/HirokiNakahara/FPGA\_lecture/tree/master/Lec7\_Practical\_RTL\_design/
- Source Code: fir\_pipe\_1.v, Simulation Code: testbench\_fir\_pipe\_1.v



#### Conclusion

- Conversion from Behavior to RTL by C Description
- Control HW via AXI 4 bus
- Optimization method
  - Concurrent assignment
  - Parallel Processing
  - Unrolling
  - Pipelining

#### Exercise

- (Mandatory) Control the LED from the ARM processor via the AXI4 bus
- (Mandatory) For the FIR filter, discuss the Pros. and Cons. of pipeline version, unrolling version, sequential version by comparing latency, throughput, and multiplier
- (Optional 1) Reduce the number of multipliers by using a symmetry property for coefficients of an FIR filter
- (Optional 2) Design the RTL for above FIR filter and show the simulation result

Send a report to OCW-I by PDF format

Deadline is 9<sup>th</sup>, July, 2019