### Parallel and Reconfigurable VLSI Computing (10)

# HLS Optimizations

Hiroki Nakahara

Tokyo Institute of Technology

**References:** 

[1] Micheal Fingeroff, "High-Level Synthesis Blue Book," Xlibris, 2010.
[2] Ryan Kastner, Janarbek Matai, Stephen Neuendorffer, "Parallel Programming for FPGAs," arXiv:1805.03648, 2018. https://arxiv.org/abs/1805.03648



### Outline

- HLS Optimizations though FIR Design
- Code reconstruction
  - Useful cording guideline for debugging
  - Performance analysis on LLVM-IR
  - Area-time trade-off
  - Code hoisting
  - Loop fission
  - Loop unrolling
  - Array partition
  - Loop pipelining
  - Bitwidth optimization

HLS Optimizations though FIR Design

### FIR Filter Background

•  $x_n$ : N sampling signals and  $y_n$ : Output signal, then

$$y_n = \sum_{k=0}^{N-1} h_k x_{n-k}$$

, where a filter coefficient  $h_n$  is given by

$$h_n = \frac{\rho_n}{2\pi} \int_0^{2\pi} d\tilde{\omega} e^{i\tilde{\omega}(n-\tilde{\tau})} H_0(\tilde{\omega}).$$

, where  $\tilde{\omega} = 2\pi\omega/\omega_s$  denotes normalized freq.,  $\omega_s$  denotes sampling freq.,  $H_0(\tilde{\omega}) \in \mathbb{R}$  denotes frequency characteristic,  $p_n$  denotes window function, and  $\tilde{\Gamma} = (N-1)/2$ .

### Cont'd

• Deference equation for a FIR filter:

 $y[n] = a_0 x[n] + a_1 x[n-1] + a_2 x[n-2] + a_3 x[n-3] + a_4 x[n-4] + \dots + a_{N-1} x[n-(N-1)]$ 

• Diagram for a FIR filter:



### C++ Behavior for a FIR Filter

https://github.com/HirokiNakahara/FPGA\_lecture/tree/master/Lec10\_HLS\_Design/fir.cpp

| Ģ    | Search or jump to / Pull requests Issues Marketplace Explore                |
|------|-----------------------------------------------------------------------------|
| ₽ Hi | irokiNakahara / FPGA_lecture 0 🖈 Star 1 😵 Fork 0                            |
| 0    | Code 🕕 Issues 0 🦙 Pull requests 0 🗐 Projects 0 🗐 Wiki 🔟 Insights 🔅 Settings |
| Bran | hth: master - FPGA_lecture / Lec10_HLS_Design / fir.cpp                     |
|      | HirokiNakahara Create fir.cpp 78818d6 36 seconds ago                        |
| 1 co | ntributor                                                                   |
| 162  | lines (128 sloc) 2.8 KB 🛛 Raw Blame History 🖵 🌶 前                           |
|      | 1 #include <stdio.h></stdio.h>                                              |
|      | 2 #include <stdlib.h></stdlib.h>                                            |
|      | 3 #include <math.h></math.h>                                                |
|      |                                                                             |
|      | 5 #include "ap_int.h"                                                       |
|      | 6<br>7 #define N 11                                                         |
|      | #define PREC 65536 // 2**16 sign + 15bit precision                          |
|      | 9                                                                           |
|      | /* comment out for bitwith optmization part */                              |
|      | <pre>typedef ap_int&lt;16&gt; data_t;</pre>                                 |
|      | <pre>2 typedef ap_int&lt;16&gt; coef_t;</pre>                               |
|      | <pre>3 typed ap_int(24&gt; ac_t;</pre>                                      |
|      | 4 /* ·                                                                      |
|      | 5 typedef int data_t;                                                       |
|      | 6 typedef int coef_t;                                                       |
|      | 7 typedef int acc_t;                                                        |
| 1    | 8 */                                                                        |
| 1    | <pre>9 void fir_array_partition(int *y, int x)</pre>                        |
| 2    |                                                                             |
| 2    | <pre>coef_t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window</pre>     |
| 2    | -136, -397, -87, 3004, 8338, 11142, 8338,                                   |
| 2    | 3 3004, -87, -397, -136, };                                                 |
| 2    | 4 #pragma HLS array_partition variable=c complete                           |
|      | 5                                                                           |
|      | <pre>static data_t shift_reg[N];</pre>                                      |

### Code Reconstruction

- Writing highly optimized synthesizable HLS code is often not a straightforward process.
- It involves a deep understanding of the application at hand, the ability to change the code such that the Vivado HLS tool creates optimized hardware structures and utilizes the directives in an effective manner
  - FSM-based RTL design experience will help to understand

### Convert to Fixed Point Precision

See, https://github.com/HirokiNakahara/FPGA\_lecture/tree/master/Lec7\_Practical\_RTL\_design/fir\_int.c

```
#include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    #define N 11
    #define PREC 65536 // 2**16 sign + 15bit precision
    void fir(int *y, int x)
9 🔻 {
10 🔻
        int c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
             -136, -397, -87, 3004, 8338, 11142, 8338,
11
            3004, -87, -397, -136, };
                                                               void main()
12
13
        static int shift reg[N];
                                                                   float fs = 44100.0;
14
                                                                   int len = 1000;
        int acc;
        int i;
                                                                   float f0 = 20000.0;
                                                                   float sin wave;
        acc = 0;
                                                                   int fir out;
19 🔻
        for (i = N - 1; i \ge 0; i - -) {
             if (i == 0) {
20 🔻
                                                                   int i;
                 acc += x * c[0];
21
                 shift reg[0] = x;
                                                                   for( i = 0; i < len; i++){</pre>
23 🔻
             } else {
                                                                       sin wave = sin(2.0 * M PI * f0 * i / fs);
                 shift_reg[i] = shift_reg[i - 1];
24
                 acc += shift reg[i] * c[i];
                                                                       fir( &fir out, (int)(sin wave * PREC));
             }
                                                                       printf("%d %f %f\n", i, sin wave, (float)fir_out / PREC);
                                                                       f0 = f0 - 10.0;
         *y = acc;
    }
```

## Debug for C Description

- Confirm the operation of FIR
- In/Out are reused as a testbench for HDL simulation
- Note, a parallel operation cannot be verified
- Area and speed of the circuit can not be estimated



## For Useful Coding

- Use typedef for different variables for changing the types of data (described later)
- Assign labels into loops for debugging

```
8 typedef int data t;
 9 typedef int coef t;
10 typedef int acc t;
11
12<sup>©</sup> void fir(int *y, int x)
13 {
14
       coef t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
15
            -136, -397, -87, 3004, 8338, 11142, 8338,
16
            3004, -87, -397, -136, };
17
18
       static data_t shift_reg[N];
19
       acc t acc;
       int i;
20
21
22
        acc = 0;
       FIR_LOOP: for (i = N - 1; i \ge 0; i - ) {
23
           if (i == 0) {
24
25
                acc += x * c[0];
                shift reg[0] = x;
26
27
           } else {
                shift_reg[i] = shift_reg[i - 1];
28
                acc += shift_reg[i] * c[i];
29
            }
30
31
        }
32
        *y = acc;
33 }
```

### Performance Analysis on Vivado HLS

 Click "Analysis", right click on each block, then select "Goto Source"

| 🖻 ma     | in.cpp                           | 🛛 🖃 Perfo | ormance( | solution1) | x  |    |   |    |                        |     |           |        |        |
|----------|----------------------------------|-----------|----------|------------|----|----|---|----|------------------------|-----|-----------|--------|--------|
| Cu       | rrent Module : fir               |           |          |            |    |    |   |    |                        |     |           |        |        |
|          | Operation\Control Step           | CO        | C1       | C2         | C3 | C4 |   |    |                        |     |           |        |        |
| 1        | x read(read)                     |           |          |            |    |    |   |    | Operation\Control Step | C0  | <b>C1</b> | C2     | C3     |
| 2        | tmp(shl)                         |           |          |            |    |    |   | 1  | x read(read)           |     |           |        |        |
| 3        | p neg(-)                         |           |          |            |    |    | 2 | 2  | tmp(shl)               |     |           |        |        |
| 4        | tmp 7(shl)                       |           |          |            |    |    | 3 | 3  | p neg(-)               |     |           |        |        |
| 5        | tmp 2(-)                         |           |          |            |    |    | 4 | 4  | tmp 7(shl)             |     |           |        |        |
| -        | FIR LOOP                         |           |          |            |    |    |   | 5  | tmp 2(-)               |     |           |        |        |
| 7        | acc(phi mux)                     |           |          |            |    |    |   | -  | ∃FIR LOOP              |     |           |        |        |
| 8        | i(phi mux)                       |           |          |            |    |    |   | 7  | acc(phi mux)           | - ( |           | )      | 1      |
| 9        | tmp 1(icmp)                      |           |          |            |    |    |   | 0  | -                      |     |           | Goto S | ource  |
| 10       | tmp 3(+)<br>shift reg load(read) |           |          |            |    |    |   | 8  | i (phi mux)            |     |           | Goto V | eriloa |
| 11<br>12 | cl load(read)                    |           |          |            |    |    |   | 9  | tmp 1(icmp)            |     |           |        | _      |
| 13       | node 39(write)                   |           |          |            |    |    |   | .0 | tmp 3(+)               |     |           | Goto V | HDL    |
| 14       | node 32(write)                   |           |          |            |    |    |   | .1 | shift reg load(read)   |     |           |        |        |
| 15       | tmp 6(*)                         |           |          |            |    |    | 1 | .2 | cl load(read)          |     |           |        |        |
| 16       | p pn (phi mux)                   |           |          |            |    |    | 1 | .3 | node 39(write)         |     |           |        |        |
| 17       | acc 1(+)                         |           |          |            |    |    | 1 | .4 | node 32(write)         |     |           |        |        |
| 18       | i 1(+)                           |           |          |            |    |    | 4 | -  | + 6/+)                 |     |           |        |        |
| 19       | node 47 (write)                  |           |          |            |    |    |   |    |                        |     |           |        |        |

### Parallel Computation Manner

• As same as the RTL design, independent operations are executed in parallel

| main.cpp 🛛 🗊 Synthesis(solution1) 🛛 🖃 Performance(solution1)           |  |
|------------------------------------------------------------------------|--|
| 1 <sup>o</sup> void test( int a, int b, int c, int *x, int *y, int *z) |  |
| 2 {                                                                    |  |
| 3 * x = a + b;                                                         |  |
| 4 *y = b * c;                                                          |  |
| 5 	 *z = c + a - b;                                                    |  |
| 6 }                                                                    |  |
| 7                                                                      |  |
|                                                                        |  |

| 🖻 ma | ain.cpp 🛛 🗊 Synthesi    | Synthesis(solution1) |  |  |  |  |
|------|-------------------------|----------------------|--|--|--|--|
| Cu   | rrent Module :          | test                 |  |  |  |  |
|      | Operation\Con           | C0                   |  |  |  |  |
| 1    | c read(read)            |                      |  |  |  |  |
| 2    | <pre>b read(read)</pre> |                      |  |  |  |  |
| 3    | a read(read)            |                      |  |  |  |  |
| 4    | x assign(+)             |                      |  |  |  |  |
| 5    | node 18(wr              |                      |  |  |  |  |
| 6    | y assign(*)             |                      |  |  |  |  |
| 7    | node 20(wr              |                      |  |  |  |  |
| 8    | tmp(-)                  |                      |  |  |  |  |
| 9    | tmp 1(+)                |                      |  |  |  |  |
| 10   | node 23(wr              |                      |  |  |  |  |

|    | <pre>void loop_test( int x[100], int y[100], int z[100], int a[30], int b[30], int c[30])</pre> |   |
|----|-------------------------------------------------------------------------------------------------|---|
| 9  | {                                                                                               |   |
| 10 | int i, j;                                                                                       |   |
| 11 |                                                                                                 |   |
| 12 | LOOP1: for( $i = 0; i < 100; i++$ )                                                             |   |
| 13 | z[i] = x[i] * v[i];                                                                             |   |
| 14 |                                                                                                 |   |
| 15 | LOOP2: <b>for</b> ( j = 0; j < 30; j++)                                                         |   |
| 16 | c[i] = a[i] * b[i];                                                                             |   |
| 17 |                                                                                                 |   |
| 18 | }                                                                                               | / |
| 19 |                                                                                                 | / |
|    |                                                                                                 |   |
|    |                                                                                                 |   |
|    |                                                                                                 |   |

|    | Operation\Control | C0 | C1 | C2 | C3 | C4 | C5 |
|----|-------------------|----|----|----|----|----|----|
| 1  | ∃LOOP1            |    |    |    |    |    |    |
| 2  | i(phi mux)        |    |    |    |    |    |    |
| 3  | exitcond1(icmp)   |    |    |    |    |    |    |
| 4  | tmp 2(+)          |    |    |    |    |    |    |
| 5  | x load(read)      |    |    |    |    |    |    |
| 6  | y load(read)      |    |    |    |    |    |    |
| 7  | tmp 1(*)          |    |    |    |    |    |    |
| 8  | node 30(write)    |    |    |    |    |    |    |
| 9  | □LOOP2            |    |    |    |    |    |    |
| 10 | j(phi mux)        |    |    |    |    |    |    |
| 11 | exitcond(icmp)    |    |    |    |    |    |    |
| 12 | j 1(+)            |    |    |    |    |    |    |
| 13 | a load(read)      |    |    |    |    |    |    |
| 14 | b load(read)      |    |    |    |    |    |    |
| 15 | tmp 3(*)          |    |    |    |    |    |    |
| 16 | node 48(write)    |    |    |    |    |    |    |
|    |                   |    |    |    |    |    |    |

## Low Level Virtual Machine (LLVM)

- Modularized, reusable compiler and toolchain technology
- Front end of C, C ++, Objective-C etc.
- Convert to LLVM-IR (Internal Representation)
- Then, optimized for Hardware (FPGA)
- Rust, Clang, LDC, Vivado HLS, Intel OpenCL



#### fir: .frame r1.0,r15 # vars= 0, regs= 0, args= 0 .mask 0x00000000 addik r3,r0,delay\_line.1450 lwi r4,r3,8 # Unrolled loop to shift the delay line swi r4,r3,12 lwi r4,r3,4 swi r4.r3.8 lwi r4,r3,0 swi r4.r3.4 swi r5,r3,0 # Store the new input sample into the delay line addik r5,r0,4 # Initialize the loop counter addk r8,r0,r0 # Initialize accumulator to zero addk r4,r8,r0 # Initialize index expression to zero branc \$L2: muli r3,r4,4 # Compute a byte offset into the delay\_line array addik r9,r3,delay\_line.1450 lw r3,r3,r7 # Load filter tap lwi r9,r9,0 # Load value from delay line mul r3, r3, r9 # Filter Multiply addk r8, r8, r3 # Filter Accumulate addik r5, r5, -1 # update the loop counter bneid r5.\$L2 addik r4,r4,1 # branch delay slot, update index expression

rtsd r15, 8 swi r8,r6,0 # branch delay slot, store the output .end fir

### LLVM-IR Example: FIR Filter



This code is generated using microblazeel-xilinx-linux-gnu-gcc -O1 -mno-xl-soft-mul -S fir.c

### **Different Architectures**

• Sequential manner



### Area-Time Trade-off

• Sequential manner



### Vertical (Area) Horizontal (Time)







### Loop with Conditional Bounds

- Having a variable as the loop upper or lower bound often results in the loop counter hardware being larger than needed
  - Having an unconstrained bit width on the loop exit condition results in control logic larger than needed



## Optimizing the Loop Counter

- In order for HLS to reduce the bit width of the loop counter the loop upper bound should be set to a constant
- However, since the execution of each loop iteration is determined by the variable, "ctrl"
- It is done by using a conditional break in the loop body



## Calculating Performance

- Necessary to define precise metrics
- What is "fast" design?
  - Efficiency?
    - operations/sec
    - MACs/sec
    - bits/sec
  - Latency? Throughput? Computation time?
- High-level synthesis tools talk about the designs in terms of number of cycles, and the frequency of the clock
- Select adequate measurement of a target application
- Compare them using the same metric

### **Operation Chaining**

- Consider the multiply accumulate operation that is done in a FIR filter tap
- Assume that the add operation takes 2 ns to complete, and a multiply operation takes 3 ns



### Code Hoisting

- The if/else statement inside of the for loop is inefficient.
- For every control structure in the code, the Vivado HLS tool creates logical hardware that checks if the condition is met
- Therefore, the statements within the if branch can be "hoisted" out of the loop

```
35<sup>o</sup> void fir(int *y, int x)
                                                                            12<sup>o</sup> void fir_hoisting(int *y, int x)
36 {
                                                                             13 {
       coef_t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
37
                                                                             14
                                                                                     coef t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
38
            -136, -397, -87, 3004, 8338, 11142, 8338,
                                                                             15
                                                                                         -136, -397, -87, 3004, 8338, 11142, 8338,
           3004, -87, -397, -136, \};
39
                                                                             16
                                                                                         3004, -87, -397, -136, \};
40
                                                                             17
       static data t shift reg[N];
41
                                                                             18
                                                                                     static data_t shift_reg[N];
42
       acc t acc;
                                                                             19
                                                                                     acc t acc;
43
       int i;
                                                                             20
                                                                                     int i;
44
45
       acc = 0;
46
       FIR LOOP: for (i = N - 1; i \ge 0; i - ) {
                                                                                     acc = 0;
                                                                                     FIR LOOP NOIF: for (i = N - 1; i > 0; i - ) {
47
           if (i == 0) {
                acc += x * c[0];
                                                                             24
48
                                                                                         shift_reg[i] = shift_reg[i - 1];
                shift reg[0] = x;
                                                                                         acc += shift_reg[i] * c[i];
49
                                                                             25
50
           } else {
                                                                             26
                                                                                     }
                shift_reg[i] = shift_reg[i - 1];
51
                                                                             27
52
                acc += shift_reg[i] * c[i];
                                                                             28
                                                                                     acc += x * c[0];
53
           }
                                                                             29
                                                                                     shift reg[0] = x;
54
       }
                                                                             30
55
        *y = acc;
                                                                             31
                                                                                     *y = acc;
56 }
                                                                            32 }
```

### Comparison

### • Original

#### Timing (ns)

#### Summary

| Clock  | Target | Estimated | Uncertainty |
|--------|--------|-----------|-------------|
| ap_clk | 10.00  | 8.51      | 1.25        |

#### Latency (clock cycles)

#### Summary

|      | rval | Inte | Latency |     |  |
|------|------|------|---------|-----|--|
| Туре | max  | min  | max     | min |  |
| none | 45   | 23   | 45      | 23  |  |

#### Detail

- 🗄 Loop

#### **Utilization Estimates**

#### Summary

| Name            | BRAM_18K | DSP48E | FF    | LUT   |
|-----------------|----------|--------|-------|-------|
| DSP             | -        | -      | -     | -     |
| Expression      | -        | 3      | 0     | 149   |
| FIFO            | -        | -      | -     | -     |
| Instance        | -        | -      | -     | -     |
| Memory          | 0        | -      | 79    | 9     |
| Multiplexer     | -        | -      | -     | 120   |
| Register        | -        | -      | 215   | -     |
| Total           | 0        | 3      | 294   | 278   |
| Available       | 120      | 80     | 35200 | 17600 |
| Utilization (%) | 0        | 3      | ~0    | 1     |

### with Hoisting of "if"

#### **Performance Estimates**

#### Timing (ns)

#### Summary

| Clock  | Target | Estimated | Uncertainty |
|--------|--------|-----------|-------------|
| ap_clk | 10.00  | 8.51      | 1.25        |

#### Latency (clock cycles)

Summary

| Late | ency | Inte | rval |      |
|------|------|------|------|------|
| min  | max  | min  | max  | Туре |
| 41   | 41   | 41   | 41   | none |

- Detail
- 🗄 Loop

#### **Utilization Estimates**

#### Summary Name

| Name            | BRAM_18K | DSP48E | FF    | LUT   |
|-----------------|----------|--------|-------|-------|
| DSP             | -        | -      | -     | -     |
| Expression      | -        | 3      | 0     | 184   |
| FIFO            | -        | -      | -     | -     |
| Instance        | -        | -      | -     | -     |
| Memory          | 0        | -      | 79    | 9     |
| Multiplexer     | -        | -      | -     | 87    |
| Register        | -        | -      | 128   | -     |
| Total           | 0        | 3      | 207   | 280   |
| Available       | 120      | 80     | 35200 | 17600 |
| Utilization (%) | 0        | 3      | ~0    | 1     |

### Loop Fission

- The FIR has two fundamental operations: Shifts the data through the shift\_reg array, and the MAC operations
- Loop fission takes these two operations and implements each of them in their own loop
  - Each one is optimized independently, so it is a decomposition of an FSM

```
35e void fir(int *y, int x)
36 {
       coef_t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
37
38
           -136, -397, -87, 3004, 8338, 11142, 8338,
           3004, -87, -397, -136, \};
39
40
41
       static data t shift reg[N];
42
       acc t acc;
43
       int i;
44
45
       acc = 0;
46
       FIR LOOP: for (i = N - 1; i \ge 0; i - ) {
47
           if (i == 0) {
                acc += x * c[0];
48
                shift reg[0] = x;
49
50
           } else {
                shift reg[i] = shift reg[i - 1];
51
52
                acc += shift reg[i] * c[i];
53
           }
54
       }
55
       *y = acc;
56 }
```

```
void fir_loop_fission(int *y, int x)
     coef t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
         -136, -397, -87, 3004, 8338, 11142, 8338,
         3004, -87, -397, -136, };
     static data t shift reg[N];
     acc_t acc;
     int i;
     acc = 0;
     SHIFT REG: for (i = N - 1; i > 0; i - ) {
         shift reg[i] = shift reg[i - 1];
     }
     shift reg[0] = x;
     MACs: for (i = N - 1; i \ge 0; i - ) {
         acc += shift reg[i] * c[i];
     }
     *y = acc;
```

### Loop Unrolling

- By default, the Vivado HLS tool synthesizes for loops in a sequential manner
- The data path executes sequentially for each iteration of the loop
- Manually unrolling the SHIFT\_REG loop

```
for (i = N - 1; i > 1; i = i - 2) {
    shift_reg[i] = shift_reg[i - 1];
    shift_reg[i - 1] = shift_reg[i - 2];
}
if (i == 1) {
    shift_reg[1] = shift_reg[0];
}
shift_reg[0] = x;
```

### Unroll Pragma

```
#pragma HLS unroll factor=n
```

(if factor is none, the HLS tries to unroll all operations!!)

```
12<sup>o</sup> void fir_unroll(int *y, int x)
13 {
       coef t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
14
15
            -136, -397, -87, 3004, 8338, 11142, 8338,
            3004, -87, -397, -136, };
16
17
       static data_t shift_reg[N];
18
19
        acc t acc;
       int i;
20
21
22
        acc = 0;
       FIR_LOOP: for (i = N - 1; i \ge 0; i - ) {
23
24 #pragma HLS unroll factor=4
25
            if (i == 0) {
                acc += x * c[0];
26
                shift reg[0] = x;
27
28
            } else {
                shift reg[i] = shift reg[i - 1];
29
                acc += shift reg[i] * c[i];
30
            }
31
32
        *v = acc;
33
34 }
```

If you design does not synthesize in under 15 minutes, you should carefully consider the effect of your optimizations. It is certainly possible that large designs can take a significant amount for the Vivado HLS to synthesize them.

### Partition BRAM into Smaller One?

• Use "#pragma HLS array\_partition"



### Array\_Partition

#pragma HLS ARRAY\_PARTITION variable=(variable name) (access pattern) factor=(# of partitions) dim=(array dimension)



access pattern

my\_array\_0[6][4] my\_array[10][6][4] → partition dimension 1 → my\_array\_1[6][4] my\_array\_2[6][4] my\_array\_3[6][4] my\_array\_4[6][4] my\_array\_5[6][4] my\_array\_6[6][4] my\_array\_7[6][4] my\_array\_8[6][4] my\_array\_9[6][4]

> dimension (dim=0 denotes all dimension are partitioned)

### Example

### • Good performance! But...

#### void fir\_array\_partition(int \*y, int x)

```
{
    coef_t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
        -136, -397, -87, 3004, 8338, 11142, 8338,
        3004, -87, -397, -136, };
#pragma HLS array_partition variable=c complete
```

```
static data_t shift_reg[N];
acc_t acc;
int i;
```

shift reg[0] = x;

#### **Performance Estimates**

- Timing (ns)
  - Summary

| Clock  | Target | Estimated | Uncertainty |
|--------|--------|-----------|-------------|
| ap_clk | 10.00  | 8.74      | 1.25        |

#### Latency (clock cycles)

#### □ Summary

| Latency |     | Interval |     |      |
|---------|-----|----------|-----|------|
| min     | max | min      | max | Туре |
| 2       | 2   | 2        | 2   | none |

Detail

Instance

Loop

#### **Utilization Estimates**

#### Summary

| Name            | BRAM_18K | DSP48E | FF    | LUT   |
|-----------------|----------|--------|-------|-------|
| DSP             | -        | -      | -     | -     |
| Expression      | -        | 27     | 0     | 642   |
| FIFO            | -        | -      | -     | -     |
| Instance        | -        | -      | -     | -     |
| Memory          | -        | -      | -     | -     |
| Multiplexer     | -        | -      | -     | 21    |
| Register        | -        | -      | 764   | -     |
| Total           | 0        | 27     | 764   | 663   |
| Available       | 120      | 80     | 35200 | 17600 |
| Utilization (%) | 0        | 33     | 2     | 3     |

## Loop Pipelining

- All of the statements in the second iteration happen only when all of the statements from the first iteration are complete
- Schedule for three iterations of a pipelined version of the MAC for loop



## Loop Initiation Interval (II)

- The number of clock cycles until the next iteration of the loop can start
- Note that, this may <u>not always be possible</u> due to resource/timing constraints and/or dependencies in the code







## Data Type in C-language

- C language provides many different data types to describe different kinds of behavior
- The primary benefits of using these different data types in software revolve around the amount of storage that the data type require
- All of these data type have a size which is a power of 2
  - (unsigned/singed) int
  - float
  - double
  - (unsigned/singed) char
  - short long
  - long
  - long long

### Bitwidth Optimization

- The same benefits are seen in an FPGA implementation, but they are even more pronounced
  - Since the Vivado HLS supports a custom (arbitrary precision) data types
- #include "ap\_int.h", then you can use
  - unsigned: ap\_uint<width>, where width takes 1 to 1024
  - signed: ap\_int<width>

### More Reduced!

```
/* comment out for bitwith optmization part */
 typedef ap int<16> data t;
 typedef ap_int<16> coef_t;
typedef ap int<24> acc t;
.⊖ /*
typedef int data_t;
typedef int coef t;
typedef int acc_t;
*/
woid fir_array_partition(int *y, int x)
1 {
     coef t c[N] = { // 0.17 = 20KHz/44.1KHz, LPF, Hamming Window
         -136, -397, -87, 3004, 8338, 11142, 8338,
         3004, -87, -397, -136, };
 #pragma HLS array partition variable=c complete
     static data_t shift_reg[N];
     acc_t acc;
     int i;
1
1
      acc = 0;
      SHIFT REG: for (i = N - 1; i > 0; i--) {
 #pragma HLS unroll
         shift_reg[i] = shift_reg[i - 1];
     shift_reg[0] = x;
     MACs: for (i = N - 1; i \ge 0; i - ) {
#pragma HLS unroll
         acc += shift_reg[i] * c[i];
1
      }
      *y = acc;
}
```

#### **Performance Estimates**

#### Timing (ns)

Summary

| Clock  | Target | Estimated | Uncertainty |
|--------|--------|-----------|-------------|
| ap_clk | 10.00  | 8.74      | 1.25        |

Type

#### □ Latency (clock cycles)

| 🗆 Sum   | mary |          |     |
|---------|------|----------|-----|
| Latency |      | Interval |     |
| min     | max  | min      | max |

- 2 2 2 2 none
- Detail
  - Instance
  - 🗉 Loop

#### **Utilization Estimates**

#### Summary

| BRAM_18K | DSP48E                     | FF                                                       | LUT                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|----------|----------------------------|----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -        | -                          | -                                                        | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| -        | 27                         | 0                                                        | 642                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| -        | -                          | -                                                        | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| -        | -                          | -                                                        | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| -        | -                          | -                                                        | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| -        | -                          | -                                                        | 21                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| -        | -                          | 764                                                      | -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 0        | 27                         | 764                                                      | 663                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 120      | 80                         | 35200                                                    | 17600                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 0        | 33                         | 2                                                        | 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|          | -<br>-<br>-<br>-<br>-<br>- | - 27<br>- 27<br><br><br><br><br><br><br><br>27<br>120 80 | - 27 0<br>- 27 0<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>-<br><br>-<br><br><br><br><br>-<br><br>-<br><br><br>-<br>-<br>-<br><br>-<br>-<br>-<br>-<br>-<br>-<br>-<br>-<br>-<br>-<br>-<br>-<br>- |

- --

### Exercise

- 1. (Mandatory) Compare an unrolling FIR design with a pipelined one with respect to HW resource and performance
- 2. (Optional) Execute an unrolling FIR design on your ZYBO board

If you meet any troubles, don't hesitate to contact me.

nakahara@ict.e.titech.ac.jp

Deadline is 26<sup>th</sup>, July, 2019 JST PM13:20

(At the beginning of the next lecture)