In synchronous digital designs, clock signal defines the reference for the transfer of data from one flop to another flop. Thus launch and capture of the data at each flop is controlled by the clock. Hence the whole design functionality depends on precise clocking. Further as the frequency and the number of flops are increasing, the clock tree forms the backbone of the design. Moreover, it eats up more and more design resources like power and wires.As the applications are increasing, more and more features are integrated on the same chip increasing the total gate count. At the same time the performance is increased by increasing the frequency of the design and thus shooting up its power
consumption of which clock alone consumes 35 per cent, although being only around two per cent of the whole design.
Clock tree network is also responsible for the robustness of the design. The cause of the failures of the chip during testing is most of the times due to poor clock design. The whole design works on the basis of clock tree balancing. The uneven scaling of the clock network delays for the flops, due to on-chip variations or across corner delay variations, may disturb the balance and thus result in timing failure. Further at lower nodes, the interconnect delay increases significantly. These effects generate reliability issues and cause timing yield loss. A carefully designed clock distribution network can reduce design-inherited clock skews and thus improve timing yield and reduce design cycle iterations.In this paper, we present the significance of clock net routing in the lower nodes and explain mathematically how the component of wire delay is increasing with shrinking technology. Then the resistance, capacitance and wire delay comparison is done between different technology nodes. Further, the impact of net delay on skew is covered with an example. Lastly, we explain various clock net topologies, analyze them from all aspects and finally conclude the best among them.
Clock net problem formulationThe flops are spread all over the design as per their module placement. The Clock tree network distributes the clock signal and clocks each flop at almost the same time. This network is composed of buffers and nets as shown in figure:
Figure 1: Clock network in design
Thus clock buffers and clock wires (also called interconnects) form the basic building blocks of a clock tree. These nets form the capacitive load for each buffer and thus define its cell delay. Also at lower nodes, these interconnects have significant effect on clock propagation through design. As the technology is shrinking, the proportion of the net delays increases further as explained below.
Figure 2: Clock interconnect
A major timing issue in deep sub micron technology is the wire delay due to wire load capacitance and the RC delay can be much larger than intrinsic cell delays. For a given wire, as shown in the picture, at a particular technology, the wire delay is defined as the product of resistance and capacitance. The resistance and the capacitance for the net can be described as follows:
Parasitic Resistance, R = ρ L/A = ρ x L / W x T
Ground Capacitance, C = ε A/H = ε L x W / H
Thus assuming zero coupling for the net, the delay can be written as:
Wire Delay, D = R x C = ( ρ x L) / (W x T) x ε (L x W) / H = ρ ε L2/ (T x H) ………. ( i )
Thus the wire delay is directly proportional to the square of the length and inversely proportional to the thickness and height of the net above the ground. Now, if the technology shrinks and the length, height and thickness scale proportionally by factors then from the equation (i) , the net delay remains same. But on the contrary due to physical limitations these are not scaled evenly resulting in even more increase in the proportion of net delays. Typical values of R and C (per unit length of wire) with respect to each layer for different technologies are as follows:
Table 1: Values of resistance, capacitance and RC Delay per unit length for different technologies
Figure 3: RC delay per unit length for metal M2
Different net topologiesFor clock tree synthesis, it is necessary to decide an optimal wire routing rule for
the clock nets so that clock power consumption, net delays, routing resources, crosstalk, EM and buffer delays are less. The width and spacing define each topology. Different net topologies followed in the clock tree synthesis are DWDS, SWDS, SWTS and SW with shielding.
a) Single width single space net topology: In SWSS, the width and spacing of net are equal to the minimum spacing between two adjacent metal wires. Due to lesser spacing, it has more coupling capacitance and thus crosstalk impact is more. It also increases the overall net capacitance and thus increases the net delays. For each metal wire routing, it requires 3 tracks.
Figure 4: Single width single space net topology
Figure 5: Single width double space net topology
c) Single width triple space net topology: In SWTS, clock net is again of single width and the spacing becomes three time the width of clock nets. The coupling capacitance becomes least in this case. Hence, the noise effect on clock tree decreases but congestion in clock tree increases as number of tracks consumed is maximum in this topology.
Figure 6: Single width triple space net topology
Figure 7: Double width double space net topology
e) Single width net with shielding topology: In this topology the clock is routed with single width and is shielded with power/ground signal on both the sides of the wire as shown in figure. It makes the clock signals immune from the crosstalk interference from the adjacent signals. On the contrary, it takes more routing resources and increases the capacitance of the clock net.
Apart from the net topologies the clock nets are routed in minimum number of adjacent layers with similar properties. Also these are to be closest to the cells so as to have minimum lengths of interconnects between the cells. For instance in a 8M layer design with layer 2 to layer 5 having same properties and standard cell pins in layer 2, the clock nets are routed in layer 4 (horizontal) and layer 5 (vertical) leaving the immediate layer (layer 3) for data nets.
Clock Net Routing
Moreover, as the technology is shrinking and utilization target number is increasing, more number of nets has to be packed in lesser area. Thus at congested places in the design, the nets have to detour to avoid shorts and spacing violations. To avoid detouring in the clock, nets are given maximum weight and are routed before the signal nets. Otherwise the nets may get detoured as shown in the Figure 10.
Figure 9: Sequential clock net routing
Figure 10: Concurrent clock net routing
Cell delay variation vs. Net delay variation
The term Clock gate ratio measures the proportion of the net delay in the clock
network. It is defined as the ratio of cell delay to total clock path delay from clock root to the register.
Therefore, in a robust design the net delay should be minimal, and so the gate ratio should be close to 1.
Figure 11: Clock path for two flip flops
The sole motivation to have zero net delays is because of different delay behavior of cells and nets with respect to temperature, process and voltage. Cells and nets have opposite behavior due to the inherent difference in properties of metals and semiconductors. Moreover the percentage difference in variation in them is different. Thus change in operating conditions disturbs the balancing of the flops and thus results in timing failures as explained below.
The clock path delay of the FF1 is balanced with the clock path delay of FF2. The clock network delay for FF1 can be written as:
FF1 clock latency = cell1 delay + cell2 delay + FF1 net delay
Similarly for the FF2,
At a particular operating condition, FF1 clock latency = FF2 clock latencyAt different operating condition the cell delay and the net delay change. The cell
delay and the net delay change differently. As the proportion of the cell delay and the net delay is different for the FF1 and FF2, on changing the operating point:
For example, let’s assume two flops as shown in figure above are having 10ns of latency at one operating point and thus are balanced. FF1 has 7ns of cell delay as
against 6ns of cell delay for FF2 and FF1 has 3ns of net delay as against 4ns of net delay for FF2 in clock path latency. At some other operating point, assuming 30% cell delay and 20% net delay variation the latency to FF1 becomes 8.5ns and to FF2 9ns, thus skewing the flops by 500ps as shown in figure below.
Analysis and results:
To validate our results, various experiments for net topology are done and each parameter is evaluated on 6M c40 design as shown below.
The above results are derived from a 40nm design. Let us analyze these results one by one for each routing topology.
In the case of SWSS, we can see that there is not too much congestion due to clock routes, but all the other parameters show unfavorable behavior. There are more clock buffers which in turn leads to higher clock power. The gate ratio is not very good, and understandably, the clock is prone to both EM as well as noise.
Coming to SWDS, we see that the buffer count is much reduced and hence the clock power too. Congestion is increased due to the double spacing requirement and clock nets are still prone to EM. The noise is reduced because the increased spacing decreases the coupling capacitance, but since the ground capacitance per unit length still remains the same as in SWSS, thus the noise reduction is not huge.
When we proceed with SWTS strategy, the buffer count is similar to the one achieved with SWDS. Hence the power also correlates with SWDS. Congestion further increases due to the extra spacing. We see a good gate ratio as in SWDS, but EM susceptibility is also same. With SWTS we are able to achieve significant noise reduction as the coupling capacitance is very much reduced since there is
triple spacing between wires.
On routing the clock using DWDS, we end up with a reduced buffer count, but slight increase in power due to higher drive strengths (this is because net capacitances are larger due to increase in area of wires). The congestion is slightly less than that seen in SWDS and we have a very good gate ratio. Though the spacing is same as in SWDS, yet noise is much less due to increased ground capacitance arising from double width of wires. This also results in the wires being far less prone to EM failure.
On the basis of the above discussion, it is clear that for this technology DWDS is the clock routing topology of choice. We mitigate noise and EM with a reasonable congestion number. Another advantage of DWDS is that clock nets can be designed with double-cut vias which helps in increasing the manufacturing yield.
At lower nodes, in which the proportion of net delay in the overall clock network delay rises significantly, the clock net topology defines the clock tree network latency, timing, power consumption, EM and crosstalk. Also, the skew variation across corners and thus the design robustness depends on it. Hence, the special net topologies have to be followed for the clock nets to have noise free, optimal net delay and power consumption. The smart selection of the net topology can further decrease the power, delay and crosstalk and for each technology node, it is possible to identify an optimal routing topology based upon the parameters discussed in this paper.
About the authors:
Ravi Chhabra (email@example.com) is a Senior design engineer at Freescale Semiconductors, Noida, India He has 3 years of industry experience in the fields of SoC placement, CTS, routing, noise, timing, DRC and DFM closure. He has worked on low power complex automotive SOCs in multiple nodes and done research for driving the CTS guidelines for lower nodes.
Srijith Nair (firstname.lastname@example.org) is a senior design engineer at Freescale Semiconductors, Noida, India. He has 3 years of experience in logical and physical Synthesis, STA and static low power verification.
Ekta Gujral (email@example.com)Ekta is an intern at Freescale Semiconductors, Noida, India. She has worked on clock tree synthesis and analysis across multiple technology nodes.