# Clock Distribution 

Shmuel Wimer<br>Bar Ilan Univ. Eng. Faculty<br>Technion, EE Faculty

## Clock System Architecture



Chip

Chip receives external clock through I/O pad.
Clock generator adjusts the global clock to the external clock.
Global clock is distributed across the chip.
Local drivers and gaters drive the physical clocks to clocked elements.

## Global Clock Generation

- Receives external clock signal and produce the global clock distributed across the die.
- A large skew occurs between external clock and the physical clocks at clocked elements due to delay of distribution network (wires, buffers, gaters).
- Therefore, data at clocked elements is no more in sync with data at I/O pins.
- Phased Locked Loop (PLL) compensates this delay.
- PLL can perform frequency multiplication to obtain the required on-chip frequencies.


## Synchronous Chip Interface with PLL

Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be synchronized to the common clock.

A PLL produces the global clock of chip B such that it is in sync with the external clock.

Chip A
Chip B


## How PLL Works?



July 2010

## Phase - Frequency Detector (PFD)



The two flip-flops receive the signals at their clock input (one is usually a reference and the other is the sampled).

The output of the leading flip-flop is 1 for the lead duration.
Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.

What happens when the reference and the sampled signals are a shift of each other?


The spikes at Q_B are a result of the delay of the AND gate driving the CLR input of flip-flip and the internal delay from CLR to Q.

What happens when the reference and the sampled signals have different frequencies?


Sampled is more often 1 -value than the reference is, since rising edge of $B$ occurs more often than rising edge of $A$.

## Charge Pump



Converts PFD error (digital) to charge (analog), which then controls PLL VCO.
Charge is proportional to PFD widths: $Q_{\mathrm{cp}}=I_{\mathrm{up}} \times t_{\text {faster }}-I_{\mathrm{dn}} \times t_{\text {slower }}$.

## Current Mirror



Charge pump consists of current mirrors which are sources of constant current. Device $\mathrm{N}_{1}$ is in saturation since its gate is connected to high voltage. Ids (=lin) depends only on $\mathrm{V}_{\mathrm{gs}}$. $\mathrm{V}_{\mathrm{gs}}$ is similar in $\mathrm{N}_{2}$, hence lout=lin.

This is an ideal current source with infinite output impedance since lout is independent of $\mathrm{N}_{2}$ load; a change in output voltage doesn't affect lout.

Current mirror works similarly for P transistors.

## How Charge Pump Works?



## Faster Mode Vout $\rightarrow \mathbf{V}$ cc



## Slower Mode Vout $\rightarrow$ Vss



## Loop Filter



Differential amplifier connected as a unity-gain follower is used.

## Voltage Controlled Oscillator (VCO)

A Ring Oscillator cascades an odd number of inverters and feeds back the last output to first inverter (even number of inverters will be stable). It starts to oscillate spontaneously.


If $t_{\text {inv }}$ is the delay of an inverter and $n$ is the number of inverters, oscillation frequency is $f=1 /\left(2 n t_{\text {inv }}\right)$

Frequency can be controlled by number of inverters and supply voltage of inverter (higher voltage obtains faster inverter).

## Components of VCO



## Delay Locked Loop (DLL)

- It is a variant of PLL that uses voltage-controlled delay line rather than oscillator.
- It adjusts phase only. Frequency multiplication is impossible.
- It is simpler than PLL, less sensitive to Vctrl noise and requires simpler loop filter.
- It is very difficult to correctly design PLL and DLL. It requires expertise in control systems and analog circuit design.


## How DLL Works?



July 2010

## Clock Distribution Networks



H-Tree


X-Tree


Grid


Tapered H-Tree

## Tree Clock Network (Unconstrained)



No constraints imposed on buffers and wires.
Used mostly by automatic tools in automatic synthesis flows.
Can be used for small blocks within large design.
Tools aim at minimizing the variance of clock delays.

If $T_{C L K_{i}}$ is the clock delay at a leaf, the variance exapression

$$
\sum_{i=1}^{n}\left[T_{C L K_{i}}-\frac{1}{n} \sum_{i=1}^{n} T_{C L K_{i}}\right]^{2} \text { should be minimized. }
$$

Serpentine routing or extra buffers may be introduced to obtain small variance.

Constraints on power can be imposed by limiting number and size of clock buffers and width of wires.

## Clock Distribution with Grids



Clock driver tree spans height of chip Internal levels shorted together

Low skew but high power

(a)

Clock drivers are on perimeter

(b)

Clock drivers are on grid points

## Delay and Skew in Grid Distribution




## DEC's Alpha Microprocessor Clocking

| Product |
| :--- |
| Frequency |
| Transistors |
| Process |
| Power |
| Clock load |

Clock
Floorplan

Clock
skew
plot

## DEC Alpha 21264 Microprocessor Clock distribution



## Clock Distribution with Spines



## Intel's Pentium4 Clock Distribution




## Clock Distribution with Trees

## RC-Tree



Each branch is individually routed to balance RC delay

H-Tree


Recursive pattern to distribute signals uniformly with equal delay over area

More skew but less power

## Clock H-Tree

chip / functional block / IP


## IBM / Motorola PowerPC Clock Distribution

$0.22 \mu \mathrm{~m}$ technology
$17 \mathrm{~mm} \times 17 \mathrm{~mm}$ die size 19M transistors
6 level metal with copper interconnect technology
Clock tree on top 2 metal levels 1 GHz clock frequency Almost symmetric H-tree Simulated clock skew under 15ps


## Delay Calculation



We use Elmore delay model. Sub trees are modeled as capacitive loads

## Clock Skew and Jitter

- Clock should theoretically arrive simultaneously to all sequential circuits.
- Practically it arrives in different times. The differences are called clock skews.
- Skews result from paths mismatches, process variations and ambient conditions, resulting physical clocks.
- Most systems distribute a global clock and then use local clock gaters located near clocked elements.
Clock skew consists of the following components:
- Systematic is the portion existing under nominal conditions. It can be minimized by appropriate design.
- Random is caused by process variations like devices' channel length, oxide thickness, threshold voltage, wire thickness, width and space. It can be measured on silicon and adjusted by delay components.
- Drift is caused by time-dependent environmental variations, occurring relatively slowly. Compensation of those must takes place periodically.
- Jitter is rapid clock changes, occurring by power noise and clock generator jitter. It cannot be compensated.

Factors affecting clock skew, Intel 1998, 0.25u.


## Skew, Clock Cycle and Design Margins



Clock Jitter is the same order as skew, but far more difficult to compensate.

## Skew Modeling


$C_{1}$ - capacitive load, $I_{\mathrm{d}}$ - drive current, $V_{\mathrm{CC}}$ - voltage swing $\tau=C_{1} V_{\mathrm{CC}} / I_{\mathrm{d}}$

Consider a small change in the delay, taking the linear term.

$$
\begin{aligned}
\Delta \tau & =\frac{\partial \tau}{\partial V_{\mathrm{CC}}} \Delta V_{\mathrm{CC}}+\frac{\partial \tau}{\partial C_{1}} \Delta C_{1}+\frac{\partial \tau}{\partial I_{\mathrm{d}}} \Delta I_{\mathrm{d}}=\frac{C_{1}}{I_{\mathrm{d}}} \Delta V_{\mathrm{CC}}+\frac{V_{\mathrm{CC}}}{I_{\mathrm{d}}} \Delta C_{1}-\frac{C_{1} V_{\mathrm{CC}}}{I_{\mathrm{d}}^{2}} \Delta I_{\mathrm{d}} \\
& =\tau\left(\frac{\Delta V_{\mathrm{CC}}}{V_{\mathrm{CC}}}+\frac{\Delta C_{1}}{C_{1}}-\frac{\Delta I_{\mathrm{d}}}{I_{\mathrm{d}}}\right)=\alpha \tau
\end{aligned}
$$

Standard deviation of stage delay is $\sigma(\tau)=\alpha \tau . \alpha$ is around $5 \%$.

If clock buffers delays are normally distributed and independent of each other, then $\sigma\left(T_{\mathrm{CLK} 1}\right)=\sqrt{m} \alpha \tau$, and $\sigma\left(T_{\mathrm{CLK} 2}\right)=\sqrt{n} \alpha \tau$.

Skew is: $\sigma\left(T_{\text {skew }}\right)=\sigma\left(\left|T_{\text {CLK } 2}-T_{\text {CLK } 1}\right|\right)=(\sqrt{m+n}) \alpha_{\text {skew }} \tau$.

## Clock Distribution Switching Power

Consider $m$-level clock tree.
Let $C_{1 \_m}$ be the total load of its far-end driven sequential
elements.
Assuming a fixed fanout $k$ of each clock tree buffer, the
dynamic power is: $\quad P_{\mathrm{CLK}_{\_} m}=C_{\mathrm{l}_{\_} m} V_{\mathrm{CC}}^{2} f$,

$$
P_{\text {CLK } \_(m-j)}=\frac{C_{1 \_m}}{k^{j}} V_{\mathrm{CC}}^{2} f, 0 \leq j \leq m-1 .
$$

## Summing over whole clock tree:

$$
\begin{aligned}
& P_{\text {CLK }}=\sum_{j=0}^{m-1} C_{1 \_(m-j)} V_{\mathrm{CC}}^{2} f= \\
& V_{\mathrm{CC}}^{2} f \sum_{j=0}^{m-1} \frac{C_{\mathbf{l}-m}}{k^{j}}=V_{\mathrm{CC}}^{2} f C_{\mathrm{l}-m}\left[\frac{1-(1 / k)^{m}}{1-(1 / k)}\right]
\end{aligned}
$$

How much of the power is consumed by the far end drivers of clock tree?

$$
\frac{P_{\mathrm{CLK}_{-} m}}{P_{\mathrm{CLK}}}=\frac{1-(1 / k)}{1-(1 / k)^{m}} .
$$

Given the number of sequential elements in a block, at least $50 \%$ of the switching power is consumed by the far end drivers (clock tree is binary, $k=2$ ). This number approaches 1 rapidly with $k$ growth.

Example: Assume a block with $2^{14}$ sequential elements and H -tree clock distribution. Then $k=4$ and $m=7$. The far end drivers consume nearly $75 \%$ of the clock tree switching power, while adding the next upper level drivers brings it to more than $90 \%$.

## Active Clock De-Skewing

- Compensates process variability, temperature gradients, imperfect design.
- Can be implemented for global fixes (small HW overhead) or local fixes (high HW overhead).
- Can be used at testing for one time fix (variability occurring during manufacturing), or dynamically concurrently with chip operation.
- Its implementation is a difficult design challenge.


## Intel's Pentium2 De-Skewing System

1998, 450MHz Clock
0.25 u process

60pSec skew w/o fix
15pSec skew with fix

Two clock spines for two clock regions.

A phase detector detects relative shifts.

Clock of a region is shifted by a delay line.



Delay line consists of two cascaded inverters.
Each has a programmable load consists of eight parallel P-N gate capacitors.

The shift register stores a thermometer code for load programming in steps of $12 p S e c$.

## Another Programmable Delay Line

The signal from input to output is delayed or not according to the control bit.


Connecting a series of $n$ delay elements, each of delay $2^{i}, n-1 \geq i \geq 0$, it is possible to control the dely of the line to any value from 0 to $2^{n}-1$ dealy units by an $n$-bit control word $\left(D_{n-1}, D_{n-2}, \ldots, D_{0}\right)$.

## Intel's IA64 Itanium1 De-Skewing System

2000, 800MHz clock
0.18 u process

28pSec skew with fix
X4 increase w/o fix

30 independent de-skew regions.

Each cluster is driven from a global H -tree.

Delay circuit in de-skew region are similar to Pentium3 with 20-bit registers.



## Proposal for H-Tree Clock De-Skew Hierarchical Approach



If a phase detector (PD) has a skew guard band $g$, then guard bands may accumulate along tree paths.

For example, if a logic stage is shared between region $B$ and $C$, it may add 7 g time units to path delay.

## Proposal for H-Tree Clock De-Skew Mesh Approach

Clock is distributed by H tree, but de-skew takes place by neighbor leaves phase detection.

A delay buffer accepts phase inputs from its 4 neighbors and then decides of whether to increase, decrease or not change its delay.


## Clock Characteristics of Commercial Processors

| Name | Frequency (MHz) | $\begin{gathered} \hline \text { skew } \\ \text { (ps) } \end{gathered}$ | $\begin{aligned} & \text { Technology } \\ & (\mathrm{nm}) \end{aligned}$ | Clock Dist. style | Deskew |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Merom | 3,000 | 18 | 65 | Tree/Grid | Yes |
| Power6® | 5,000 | 8 | 65 | Sym. H-Tree/Grid | Yes |
| Quad-Core Opteron ${ }^{\text {TM }}$ | 2,800 | 12 | 65 | Tree/Grid |  |
| Xeon $®$ p processor | 3,400 | 11 | 65 | Tree/Grid | Yes |
| Itanium(R) 2 processor | >2,000 | 10 | 90 | Asymmetric tree | Yes |
| Power5(R) | >1,500 | 27 | 130 | Sym. H-Tree/Grid | No |
| Pentium(R) 4 processor | 3,600 | 7 | 90 | Recombinant tile | Yes |
| Itanium®R 2 processor | 1,500 | 24 | 130 | Asymmetric tree | Yes |
| Power4® | >1,000 | 25 | 180 | Tree/Grid | No |
| Itanium®R 2 processor | 1,000 | 52 | 180 | Asymmetric tree | No |
| Pentium® ${ }^{\text {® }}$ 4 processor | >2,000 | 16 | 180 | Spine/Grid | Yes |
| Itanium $®$ processor | 800 | 28 | 180 | H-Tree/Grid | Yes |

## Power Consumption in Chips

- Clock power may reach 50\% of total (dynamic + static).
- Clock gating is very useful and standard design practice
- Four gating methods:
- Synthesis based, automated by EDA tools, RTL compilers, inserted into clock-tree
- Clock enable signals manually defined by designer, inserted into clock-tree, FFs' clock input
- Data-Driven clock gating, inserted at FF-level
- Auto-gated FF, inserted at latch-level


## FF Data Toggling (DSP core)



## FF Data Toggling

63 control blocks of MP, 200k FFs


## Data-Driven CLK Gating




## How Many FFs To Group?

$k$ : \# flip-flops, $\quad q$ : FF probability of $D=Q \quad q=1-p$
Worst case : All FF are toggling independently of each other.

$$
\begin{array}{cl}
\text { Net saving per FF } & \begin{array}{l}
\text { Latch overhead } \\
\text { amortized over } k F F s
\end{array} \\
c_{\frac{\text { saving }}{\mathrm{FF}}} \geq q^{k}\left(c_{\mathrm{FF}}+c_{\mathrm{w}}\right)-\left[c_{\text {latch }} / k+(1-q)\left(c_{\mathrm{w}}+c_{\mathrm{OR}}\right)\right] \\
\text { Gater's disabling probability } & \text { Probability of enabling } \mathrm{FF}
\end{array}
$$

Derivate by $k$ : $\quad q^{k} \ln q\left(c_{\mathrm{FF}}+c_{\mathrm{W}}\right)+c_{\text {latch }} / k^{2}=0$


## Optimal Flip-flop $k$-size Grouping

Given $n$ flip-flops and $m+1$ clock cycles
$\mathbf{a}=\left(a_{1}, \ldots, a_{m}\right)$ is the activity (toggling) of flip-flop

$$
\begin{aligned}
& \mathrm{FF}_{1}: 01001110001010110 \\
& \mathrm{FF}_{2}: 01101010101100111
\end{aligned}
$$

$\left\|\mathbf{a}_{i} \oplus \mathbf{a}_{j}\right\|$ is the number of redundant clock pulses ocurring by jointly clocking $\mathrm{FF}_{i}$ and $\mathrm{FF}_{j}$

## FF Pairwise Activity Model

$G(V, E, w)$ : FF pairwise activity graph.
$v_{i} \in V$ corresponds to $\mathrm{FF}_{i}$.
$e_{i j}=\left(v_{i}, v_{j}\right) \in E$ is FF pairing.
$\mathbf{a}_{i} \mid \mathbf{a}_{j}$ is joint toggling.
$w\left(e_{i j}\right)=\left\|\mathbf{a}_{i} \oplus \mathbf{a}_{j}\right\|$ is redundant clock pulses, hence a waste.
$E^{\prime} \subset E:$ vertex matching

## Total power:

$$
P=2 \sum_{e_{i j} \in E^{\prime}}\left\|\mathbf{a}_{i} \mid \mathbf{a}_{j}\right\|=
$$

$$
\sum_{v_{i} \in V}\left\|\mathbf{a}_{i}\right\|+\sum_{e_{i j} \in E^{\prime}} \sum_{e_{i j} \in E^{\prime}}\left[\left\|\mathbf{a}_{i} \oplus\left(\mathbf{a}_{i} \mid \mathbf{a}_{j}\right)\right\|+\left\|\mathbf{a}_{j} \oplus\left(\mathbf{a}_{i} \mid \mathbf{a}_{j}\right)\right\|\right]=
$$

Essential + Waste
$\sum_{v_{i} \in V}\left\|\mathbf{a}_{i}\right\|+\sum_{e_{i j} \in E^{\prime}}\left\|\mathbf{a}_{i} \oplus \mathbf{a}_{j}\right\|=\sum_{v_{i} \in V}\left\|\mathbf{a}_{i}\right\|+\sum_{e_{i j} \in E^{\prime}} w\left(e_{i j}\right)$

## Flip-Flop Grouping Algorithm

$\mathrm{FF}_{j}$
Minimize: $\sum\left\|\mathrm{FF}_{i} \oplus \mathrm{FF}_{j}\right\|$

Optimal FFs pairing can be solved by minimal cost perfect graph matching.


Can be repeated for groups of size $4,8,16 \ldots$

Is repeated perfect matching optimal ?


No! Here is the optimal 4-size grouping

| FF1: | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| FF2: | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| FF6: | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| FF7: | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |

$\begin{array}{lllllllllllll}\text { FF3: } & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 1 & 1 & 0 \\ \text { FF4: } & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 \\ \text { FF5: } & 1 & 0 & 0 & 1 & 1 & 0 & 1 & 0 & 1 & 1 & 0 & 0 \\ \text { FF8: } & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0\end{array}$

## Multi-Bit Flip-Flop

Saving the power of internal CLK drivers



MBFF should be combined with data-driven CG to maximize energy savings.

Toggling vectors (VCD) are unfortunately not always available.

Data-to-clock toggling ratio (probability) is more often available.

How to utilize it for MBFF optimal grouping?

What is the energy waste in 2-bit MBFF?
$p_{j}\left(1-p_{i}\right)+p_{i}\left(1-p_{j}\right)=p_{i}+p_{j}-2 p_{i} p_{j}$

For n FFs grouped in $\mathrm{n} / 2 \mathrm{MBFFs}$ it is

$$
\sum_{j=1}^{n} p_{j}-2 \sum_{i=1}^{n / 2} p_{s_{i}} p_{t_{i}}
$$

What is the optimal MBFF grouping?

Group the FFs such that $\quad p_{1} \leq p_{2} \leq \cdots \leq p_{n}$

## Auto-Gated FF



## Look-Ahead CLK Gating




## Power Saving Per FF

$$
\begin{aligned}
& c_{\mathrm{dyn}}^{\mathrm{save}}= \\
& (1-p)^{k}\left(c_{\mathrm{FF}+\mathrm{CLK}}+c_{\mathrm{FF}}+c_{\mathrm{O}}\right)- \\
& p\left(c_{\mathrm{X}}+k c_{\mathrm{O}}\right)-\left(\frac{c_{\mathrm{FF}+\mathrm{CLK}}}{3}+c_{\mathrm{A}_{\mathrm{int}}}+c_{\mathrm{FF}}+c_{\mathrm{O}}\right)
\end{aligned}
$$

| $\mathrm{C}_{\mathrm{FF}}$ | $\mathrm{C}_{\text {CLK }}$ | $\mathrm{C}_{\text {FF+CLK }}$ | $C_{X}$ | $C_{0}$ | $\mathrm{C}_{\text {Aint }}$ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| 25.7 | 33.5 | 36.9 | 2.9 | 3.1 | 1.7 |

## Which FF to Gate?



## Results





