# Multiplication and Shift Circuits 

Shmuel Wimer<br>Bar Ilan University, Engineering Faculty<br>Technion, EE Faculty

## Shift/Add Unsigned Multiplication Algorithms



In right shift multiplication the partial products $\boldsymbol{x}_{j} \boldsymbol{a}, 0<=j<=\boldsymbol{k}-1$, are recursively accumulated from top to bottom.

$$
p^{(j+1)}=\underbrace{\left(p^{(j)}+x_{j} a 2^{k}\right)}_{\text {add }} \times 2^{-1} \text { with } p^{(0)}=0 \text { and } p^{(k)}=p
$$

$x_{0} a 2^{k}$ will be multiplied by $\mathbf{2}^{-k}$ after $\boldsymbol{k}$ iterations. $\boldsymbol{a}$ is pre multiplied by $\mathbf{2}^{\boldsymbol{k}}$ to offset the effect of right shifts.

After $k$ iteration the recurrence yields $p^{(k)}=a x+p^{(0)} \mathbf{2}^{-k}=a x$ $\boldsymbol{a}$ is aligned to the left (MSB) $\boldsymbol{k}$ bits of a $\mathbf{2 k}$-bit . How to obtain $p=a x+y$ ? Initialize $p^{(0)}$ to $\boldsymbol{y} \mathbf{2}^{\boldsymbol{k}}$
(a) Right-shift algorithm


multiply by $2^{-1}$ by right-shift add partial product aligned to left

```
lllllllll
```

multiply by $2^{-1}$ by right-shift add partial product aligned to left


In left shift multiplication the partial products $\boldsymbol{x}_{k-1-j} \boldsymbol{a}, \mathbf{0 < = j < = \boldsymbol { k } - \mathbf { 1 }}$, are recursively accumulated from bottom to top.

$$
p^{(j+1)}=\underbrace{2 \underbrace{\text { shift left }} p^{(j)}+x_{k-1-j} a}_{\text {add }} \text { with } p^{(0)}=0 \text { and } p^{(k)}=p
$$

After $\boldsymbol{k}$ iteration the recurrence yields $\boldsymbol{p}^{(k)}=a x+p^{(0)} \mathbf{2}^{k}=a x$

How to obtain $p=a x+y$ ? Initialize $p^{(0)}$ to $y 2^{-k}$
(b) Left-shift algorithm


Serial multiplication by add and shift entails $\boldsymbol{k}$ additions and $\boldsymbol{k}$ shifts

Right shift is favored since the addition of partial product takes place at the MSB $\boldsymbol{k}$-bit part of the $\mathbf{2 k}$ word. In left shift it takes place at the LSB $\boldsymbol{k}$-bit part and carry can propagate to MSB part.

Left shift algorithm requires therefore $\mathbf{2 k}$ bit adder, while for right shift $\boldsymbol{k}$ bit suffice.

Hardware of right-shift multipliers (without control)


Reducing addition of partial product to one cycle


At each clock cycle adder's carry-out is written to MSB and LSB is used for multiplication (via a MUX)

## Hardware of left-shift multipliers (without control)

Next bit of multiplier

Adder is 2 k bits rather than k bits in right-shift.

Register sharing of multiplier and MSB of cumulative partial product is possible.

Shift registers

## Multiplication of Signed Numbers

- Sign-magnitude representation requires only XORing of the operands' sign bits.
- In 1's-complement, a negative operand is complemented and unsigned multiplication takes place. The result is complemented by XOR-ing of operands' sign bits.
- For 2's-complement, right-shift multiplication is proper for negative multiplicand and positive multiplier.


## Right-shift multiplication



## Negative multiplicand and multiplier



Negative multiplicand
Negative multiplier

Negative sign extensions

Handle correctly by subtracting $x_{k-1} a$ rather than adding

Hardware complements multiplicand and adds 1 via carry-in

## Hardware implementation (control logic not shown)



## Parallel Multiplication Algorithms



Dot diagram is convenient to illustrate large array multiplication.


The most obvious of adding $k N$-bit numbers is by cascading $k-1$ CPAs.


The most obvious of adding $k N$-bit numbers is by cascading $k-1$ CPAs.

This is slow and area consuming, taking $O(k N)$ time and area (not really).

Observation:
A Full-adder has three inputs $x, y$ and $z$.


It is producing an output $s$ of weight 1 and an output $c$ of weight 2.

The inputs are symmetric with respect to $\boldsymbol{s}$ and $\boldsymbol{c}$.

## Carry-Save Adder

The sum $X+Y+Z$ can therefore be obtained by first summing $x_{i}+y_{i}+z_{i}$ in parallel, producing $\boldsymbol{C}$ and $\boldsymbol{S}$.


Then summing $S$ and left shifted C by CPA. This is called CarrySave Adder (CSA).


Summation of $k$ numbers requires stacking $k-2$ CSAs and a single CPA.
The resulting delay is $O(k+n)$ rather than $O(k n)$ if CPAs were used (not exactly...).
CSA was invented by von Neumann early digital computer (1946).

## Unsigned Array Multiplication



Critical path has $N$ CASs and $M$-bit CPAs, yielding $O(N+M)$ delay. The $N$ LSBs are obtained directly from the sum outputs of CSAs.
The $M$ MSBs are obtained by CPA. It can be squashed in layout to occupy a rectangle.

## 2's Complement Array Multiplication

Same CSA array multiplication can be used.

$$
\begin{aligned}
P=y x & =\left(-y_{M-1} 2^{M-1}+\sum_{j=0}^{M-2} y_{j} 2^{j}\right)\left(-x_{N-1} 2^{N-1}+\sum_{i=0}^{N-2} x_{i} 2^{i}\right) \begin{array}{l}
\text { 2's } \\
\text { complement }
\end{array} \\
& =\sum_{i=0}^{N-2} \sum_{j=0}^{M-2} x_{i} y_{j} 2^{i+j}+x_{N-1} y_{M-1} 2^{M+N-2} \quad \text { positive } \\
& -\left(\sum_{i=0}^{N-2} x_{i} y_{M-1} 2^{i+M-1}+\sum_{j=0}^{M-2} x_{N-1} y_{j} 2^{j+N-1}\right) \text { negative }
\end{aligned}
$$

To handle the negative part, 2's complement will be used.
Recall that 2's complement equals 1's complement plus 1.
1's complement is obtained by bit complement.


## Acceleration of Serial Multiplication

## Observation: $2^{j}+2^{j-1}+\ldots+2^{i+1}+2^{i}=2^{j+1}-2^{i}$

Consequently, additions occurring by a string of 1 s in the multiplier can be replaced by an addition and a subtraction.

Bits $x_{i-1}$ and $x_{i}$ of the multiplier are encoded in $y_{i}$ as follows:
$\left(x_{i}, x_{i-1}\right)=(00)=>y_{i}=0$; No string of 1s in sight
$\left(x_{i}, x_{i-1}\right)=(01)=>y_{i}=1$; End of string of 1 s
$\left(x_{i}, x_{i-1}\right)=(10)=>y_{i}=-1$; Beginning of string of 1 s
$\left(x_{i}, x_{i-1}\right)=(11)=>y_{i}=0$; Continuation of string of 1 s
Example: radix-2 encoding of a 16 -bit word artifact $\begin{array}{rrrrrrrrrrrrrrrrrrr}1 & 0 & 0 & 1 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 & x \\ -1 & 0 & 1 & 0 & 0 & -1 & 1 & 0 & -1 & 1 & -1 & 1 & 0 & 0 & -1 & 0 & & y\end{array}$

##  <br> 

Above is a proper interpretation for signed multiplication. A MSB string 111... 111 of 1s will be encoded into a string of $000 . . .00^{-1}$, resulting in appropriate subtraction.

Problem: Assume that the unsigned value of $X$ is intended.
Booth encoding results in $-2^{15}$ rather than $+2^{15}$.
Solution: Add $2^{16}$ by extending $y$ with 1 MSB.


## Booth Encoding

Proposed by Booth in 1951 to accelerate serial multiplication (series of shift and add).

$$
P=Y \times X=Y \times 00111110 \text { requires } 5 \text { shifts and additions. }
$$

$$
\begin{aligned}
& Y \times 00111110=Y \times\left(2^{5}+2^{4}+2^{3}+2^{2}+2^{1}\right) \\
& =Y \times\left(2^{6}-2^{1}\right)=Y \times(01000000-00000010)
\end{aligned}
$$

requires 1 add, 1 subtract (add 2's complement) and 2 shifts.

$$
\begin{aligned}
& Y \times 00111010=Y \times\left(2^{5}+2^{4}+2^{3}+2^{1}\right)=Y \times\left[\left(2^{6}-2^{3}\right)+2^{1}\right] \\
& =Y \times(01000000-00001000+00000010)
\end{aligned}
$$

Multiplication can be considerably accelerated by turning sequences of 1 s into leading and trailing 1 s .
$P=\overbrace{011001}^{Y} \times \overbrace{100111}^{X}=\overbrace{011001}^{Y} \times \overbrace{\underbrace{}_{2} \underbrace{0}_{1} \underbrace{0111}_{3}}^{X}$
Instead of multiplying $Y$ and adding bit-by-bit of $X$ we look at groups of 2 bits, hence working in radix-4.

In radix-4 each partial product has 4 times the weight of its predecessor one.

Radix-4 multiplication will reduce to half the number of partial products, with 2-bit left shift at each one. The partial products are $\{0, Y, 2 Y, 3 Y\}$.
$3 Y$ is a problem since it cannot be obtained by a shift but rather requires addition $3 Y=2 Y+Y$.

Radix-4 algorithm implements $3 Y=4 Y-Y$ and $2 Y=4 Y-2 Y$.

Weight of LSB in current pair is twice the MSB in previous.
Weight of MSB in current pair is 4 times the MSB in previous.

$X=3$. $\mathrm{PP}=-Y$. $4 Y$ will be discovered in next step.
$X=1 . \mathrm{PP}=2 Y, 4 Y$ is carried from previous, which is 1 in current.
$X=2$. $\mathrm{PP}=-2 Y$. $4 Y$ will be discovered in next step.
$X=0 . \mathrm{PP}=Y$. No need for sign. Always $Y$ or 0 .


0 artifact
1 LSB
1
1
0000110010
Sign ext.
(1) 100011110
$\begin{array}{llllll}0 & 1 & 1 & 0 & 0 & 1\end{array}$
$\begin{array}{llllllllllll}0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 1 & 1\end{array}$

PP table defines the appropriate encoding of multiplicand: 0 , $Y,-Y, 2 Y$ or $-2 Y$.

## Partial Product (PP) Selection Table

Multiplier Selection Explanations

Radix-4 modified Booth encoding values

| Inputs |  | Partial <br> Products | Booth Selects |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $X_{2 i+1}$ | $X_{2 i}$ | $X_{2 i-1}$ | $P P i$ | SINGLE $i$ | DOUBLE $i$ | NEG $i$ |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | $Y$ | 1 | 0 | 0 |
| 0 | 1 | 0 | $Y$ | 1 | 0 | 0 |
| 0 | 1 | 1 | $2 Y$ | 0 | 1 | 0 |
| 1 | 0 | 0 | $-2 Y$ | 0 | 1 | 1 |
| 1 | 0 | 1 | $-Y$ | 1 | 0 | 1 |
| 1 | 1 | 0 | $-Y$ | 1 | 0 | 1 |
| 1 | 1 | 1 | $-0(=0)$ | 0 | 0 | 1 |

Radix-4 Booth encoder and selector
 extra 1 is added in the next row.

Radix-4 Booth-encoded partial products with sign extension for unsigned multiplication


To squash into rectangular floor plan the sign bit triangle should better be out.

Suppose that all partial products are negative.

If a particular PP is positive, the negation can be reverted by adding 1 to the LSB of the original 1s string.

resulting in this
configuration


$+$


Critical path involves: Booth encoder, select line driver, Booth selector, $N / 2$ CSAs and final CPA.
Selector resides in every bit of the array, consuming significant area. Good area/power/performance tradeoff is to downsize it as much as possible. (why?)

## Booth Encoding Signed Multiplier


$\mathrm{PP}_{8}$ is therefore not required.

## Wallace Tree Multiplication

Consider the following 9-bit unsigned multiplication

Every dot of the array represents a partial product.

Partial products are vertically summed by half and full adders (CSA).


Multiplication time complexity is $O(n)$, There are $n-2$ sequential CSAs additions.

Wallace tree accelerates CSAs time complexity to $O(\log n)$ by different organization of CSAs sums.

In each column of partial products, every three adjacent rows construct a group.

Reduction in each group is done by one of the following cases:
Applying a full adder (CSA) to the 3-bit groups

Applying a half adder to the 2bit groups

Passing any 1-bit group to the next stage without change


Passing any 1-bit group to the next stage without change

Sum of half adder stays in column, carry sent to next column.

Sum of full adder stays in column, carry sent to next column.

$$
\begin{aligned}
& \bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet \\
& \bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet \bullet
\end{aligned}
$$



All the full-adder (CSA) and half-adder additions in a stage are performed simultaneously.

Every stage has its own adders.

Data is progressing through $\mathrm{O}\left(\log _{3 / 2} n\right)$ stages (proven below).

The final two rows are summed by CPA.

Other groups organizations called Modified Wallace and Dadda reductions, yielding slight area improvement (number of circuits), are possible. Asymptotically all are similar.

## Time and Area Complexity

At each stage of the computation each group of 3 bits is reduced to 2 bits, with at most 2 bits left over.

The depth of Wallace tree $D(n)$ satisfies

$$
D(n)=\left\{\begin{array}{cl}
0 & \text { if } n \leq 2 \\
1 & \text { if } n=3 \\
1+D(\lceil 2 n / 3\rceil) & \text { if } n \geq 4
\end{array}\right.
$$

This is a recursive equation solved to $D(n)=\Theta(\log n)$.
The final addition is implemented by CPA.
Carry-lookahead adder takes $\Theta(\log n)$, so using CLA for final addition yields $\Theta(\log n)$ overall time complexity.

The number of adders $C(n)$ is $\Theta\left(n^{2}\right)$.
The number of bits in a row is between $n$ and $2 n$. There are $n$ rows so $2 / 3 n^{2}$ full and half adders are required in the first stage.

The number of rows is reducing by factor $2 / 3$ from stage to stage, hence the total sums to $\Theta\left(n^{2}\right)$ as well.

## Shifters

Logical shifter: Shifts the number to left or right and fills the empty spots with Os. Specified by << or >> in Verilog.

1011 LSR 1 = 0101; 1011 LSL 1 = 0110

Arithmetic shifter: Similar to logical, but on right shift fills empty spots with sign bit. Specified by <<< or >>> in Verilog.

1011 ASR 1 = 1101; 1011 ASL 1 = 0110

Barrel shifter (rotator): Rotates numbers cyclically.
1011 ROR 1 = 1101; 1011 ROL 1 = 0111

Conceptually, rotation of $N$-bit word involves array of $N \mathrm{~N}$ input MUXes to select each of the outputs from each of the possible input positions. This is called array shifter.

Array shifter requires a decoder to produce 1-of-N shift.

MUXes of more than 8 inputs have excessive parasitic capacitance, so it is faster to construct shifters from $\log _{v} N$ $v$-input MUXes. This is called logarithmic shifter.

Left rotate by $k$ bits is equivalent to right rotate by $N-k$ bits.
Computing $N-k$ requires subtracter in the critical path.

We take advantage of 2's complement and the fact that rotation is cyclic modulo $N \quad N-k=N+\bar{k}+1=\bar{k}+1$.

Left shift can therefore be done by first pre shifting right by 1 and then right shifting by the complement.

Logical and arithmetic shifts are similar to rotate except that the bits at one end or the other are replaced by 0 or sign bit.

## Funnel Shifter

Creates a $2 \mathrm{~N}-1$ bit input word
$Z$ from $A$ and kill variables. It then selects N -bit field from Z according to shift amount.


| Shift Style | $Z_{2 N-2: N}$ | $Z_{N-1}$ | $Z_{N-2: 0}$ | Offset |
| :--- | :--- | :--- | :--- | :---: |
| Logical Right | 0 | $A_{N-1}$ | $A_{N-2: 0}$ | $k$ |
| Logical Left | $A_{N-1: 1}$ | $A_{0}$ | 0 | $\bar{k}$ |
| Arithmetic Right | $A_{N-1}$ | $A_{N-1}$ | $A_{N-2: 0}$ | $k$ |
| Arithmetic Left | $A_{N-1: 1}$ | $A_{0}$ | 0 | $\bar{k}$ |
| Rotate Right | $A_{N-: 0}$ | $A_{N-1}$ | $A_{N-2: 0}$ | $k$ |
| Rotate Left | $A_{N-1: 1}$ | $A_{0}$ | $A_{N-1: 1}$ | $\bar{k}$ |



