Hierarchical Partitioning is a methodology followed in physical design where a complete SoC is divided into sub chips, which are closed independently in parallel and later all sub chips are assembled to achieve complete SoC closure.
Need for partitioning
* Feasibility: If the design size is so big that tools are not capable of handling them, we loose the quality of output along with huge run time impact, partitioning approach helps in achieving good results and improved runtimes.
* Flexibility: With independent partition closure, partitioning flow gives a flexibility to take decisions independently like using svt cells in design, different latencies during CTS for every partition, there by provides better Manageability.
* Early closure: It provides an early closure on some part of design which will not change with new release of RTLs. Also some critical IPs like DDR which needs special attention can be closed independently.
* IP reuse: Some Partitions when hardened can be reuse directly in multiple projects.
Drawbacks of Partitioning
Partitioning a design also brings lot of challenges during execution. Following are the points should be considered well before we decide to go for partitioning.
* Partitioning results in over pessimistic interface closure. It can be because of the unpredictability of load & transition at the interface and because of improper budgeting (time division of interface paths).
* Integration efforts also increased.
* Because boundary optimization is not possible, performance (frequency) of interface paths in the design also reduces.
* Resource management, it needs more resources to work on each partition. Managing them and controlling their versions and before merger is very important.
It’s very important to decide, whether partitioning can really pull the tape out date or not.
Need of a block model and usage
When a SoC is implemented independent of blocks (partitions), to take care of the interface timing we need a model that represents the timing inside the block. Time budgeting takes care of the distribution of timing and generates that required model.
Consider the block in Figure 1, with applied constraints to be closed independently.
The model (Figure 2, below) can be generated which can represent this block at the top level implementation.
(The picture shows setup ARCs in blue, combinational ARCS in red and Sequential ARCs in green)
The following flowchart represents the partitioning flow starting from synthesis to the reassembly of design.
Doing synthesis is an optional activity for partitioned block. The complete SoC can also be synthesized as flat with some important considerations (explained later).
But for blocks which are required to start early in the SoC cycle or independent to the development of top level constraints, partitioned block is required to be synthesized separately. Constraints for the individual block for synthesis can be given with clock definitions, exceptions, and 50% input output delay. Here the main focus is to optimize reg to reg path and with some constrain IOs.
With the progress in synthesis optimization of all blocks all the netlists can be clubbed for generating first cut IO constraints can be generated using Encounter PrePlace Time Budgeting (explained later). It can be done without a floor plan.
After this when we get IO constraints, if needed we can reoptimize the individual block for IOs and with improved slack on IOs we can redo time budgeting for better distribution of budgets.
Similar to the individual block, top level netlist can also be optimized using the black box model (.lib) of each block before Time Budgeting.
This stage ensures prePlace timing closure (Stage 1 in flow chart). And we have perfect distribution of timing across interface of block and SoC top. The floor planning team meanwhile delivers the flat floorplan with the defined FENCE for every block. Fence is to be considered by every partition while flat placement.
The floor planning team should consider some points before delivering floorplan (explained later).
To save on runtime Encounter virtual prototyping flow can also be used in which only boundaries paths are considered for trial routing and extraction. It will also be explained later in detail.
This step (PreCTS time budgeting) involves timing driven pin placement of every block, fixing them and with the partitioning of design, multiple database required for every individual block implementation of the SoC and top are generated.
Every block and top level database can be now picked for placement and preCTS timing closure. An important point to note is that latencies of any clocks are not yet considered. Also Lib models generated doesn’t consider any latency in setup and sequential arcs.
While PreCTS closure, seeing the possibility of latency (experimenting on CTS) in each block it can be added for all clocks including virtual clocks. If required, flops can be pulled or pushed inside the block.
If after placement and optimization, there is any requirement to move pin of a block, it can be done but it should also be updated in block LEF file as well.
After every possibility of optimization at the interfaces, if redistribution of slack is required design should be reassembled and with Time Budgeting new set of IO constraints and .Lib model should be generated.
This stage (STAGE 2 in flow chart) ensures preCTS Timing Closure of Design.
IO Timing for every block can be closed at different (but single for every block) value of virtual latency.
Clocktree Planning is done by balancing the virtual clock latency of each block with their source latency. This will ensure the proper balancing of interfacing flops with useful skew used while block closure. It is explained later in detail.
After this, design can be recombined and with the help of budgeting postCTS .lib models can be generated which have latency information clubbed with setup and sequential arcs for further use.
Now every block can be closed independently with OCV derates and NOISE considered for that block only. Margins should be considered well for top level skew and noise effect. Selecting design margins are very important for a successful Interface closure. Various considerations for defining proper design margins are explained later.
Later, after removing the extra margins and considering the real impact of inter clock skew and Flat noise remaining violations should be fixed with STA analysis to achieve FINAL CLOSURE.
The primary role of this activity in partitioning is to well optimize the interface of the blocks to enable better judgment on IO margins. Depending on the design size and results of timing optimization run times we can go with following two approaches.The primary role of this activity in partitioning is to well optimize the interface of the blocks to enable better judgment on IO margins. Depending on the design size and results of timing optimization run times we can go with following two approaches.
Approach 1 (meant for bigger designs):
Every block can be synthesized with constraints considered with 50% IO delays (of period) along with other clock definitions and exceptions. Similarly, the top can be synthesized with 50% IO delay lib model which can easily be generated by black box bottom up approach. The mentioned synthesis runs will be more aggressively working on reg to reg timings.
Later, all the block netlists and top netlist can be combined to do Preplace budgeting, followed by optimization with improved IO delays and the preplace budgeting will become more accurate.
Approach 2 (meant for small designs):
In this approach complete design is synthesized at once. This approach is meant for small designs or designs where we can live with huge runtime and heavy memory requirement. This approach obviously will be less iterative in optimizing IO paths to be partitioned. After preplace budgeting on flat netlist, one optimization run on individual netlist (block and top) can be done to improve IO delays followed by preplace budgeting.
But for reasons where we have late deliveries of fresh IPs Approach 1 would be better. And we can close the corresponding IOs of late IP with extra pessimism, say 80-70 % on IO delays.
DFT considerations like scan chain formation and introducing IO ports at the partitioned boundary etc should be fed intelligently with least number of required ports at the partitioned boundaries.
Floorplan for Partitioning
While delivering floorplan, it’s important to consider following things:
* Fences are made for individual partition after experiments done for timing.
* All placement & routing blockages required *for top and partitioned block are in place.
* There should not be any placement blockage, across the boundary of partition; this can lead to increased length (load) on boundary nets.
* If needed, some space can be left with partial blockage across boundaries for hold fixing on partitioned IOs.
* No pin placement is defined for block (that will be done after timing driven placement).
It is very important to keep block floorplan and lef in synch. Also a review of floorplan after budgeting with pin placement done (and fixed) is recommended.
For partitioning, constraints are needed to budget the interface of partitions and needed at the very beginning stage. Top level constraints should have clock definitions, case analysis and exceptions related to interface will be enough to get correct budgeting constraints for partition.
Following are the considerations for top level constraints:
Clock definition: Interface should be budgeted only w.r.t. fastest clock. Library model other wise have ARCs w.r.t. every clock with varied time period and when model is used at the top level, tool always pick the pessimistic (greater value) from library giving huge violation for faster clocks, (CCR No. 662262). As a work around only fastest of all clocks should be defined. Budgets generated by it will be applicable for slower clocks.
Reg2Reg uncertainty: While budgeting, reg2reg uncertainty applied at the top level also gets budgeted as per the budgeting algorithm. So, to have accurate budgeting reg2reg uncertainty should be made equal to the in2reg and reg2out uncertainties decided as per the margins (discussed in next section of Design Margins).
Design margins for partitioned block
In designs where blocks are constraints, which will pop up because of the partitioning approach.implemented separately, it is very important to consider extra inter block margin along with the block uncertainty in the block.
There will be 4 types of paths across block
1) Input to reg path
2) Intra clock reg to reg path ( b/w B & C)
3) Inter clock reg to reg path ( b/w A & B)
4) Register to output path.
Type 1 and 4, will have virtual clock uncertainty, Type 2 will have intraclock uncertainty, and Type 3 will have interclock uncertainty.
Consider the following parameters:
X – Allowed intra clock skew inside partition (for clk1 and clk2)
Y – Allowed top level allowed skew (between clk1 and clk2)
D – % of OCV derates.
U – Noise effect (margin) for block.
V – Noise effect (margin) on block from top.
L – Latency of block inside block (Network latency)
S – Top level latency of clock (source latency)
J – Jitter of clock
crpr – Average block crpr effect.
CRPR – Average top level crpr effect.
Then at preCTS stage following will be the Total uncertainty values.
Intraclock uncertainty = J + X + D * 2* L + U + V – crpr
Interclock uncertainty = J + X + Y + D * 2 * (L+S) U + V – CPPR
Virtual clock uncertainty = J + X + Y + D * L + U + V – CPPR.
PrePlace Budgeting is useful in estimating budgets on the basis of pure logic thru netlist. Without placement info netlist can be loaded and based on the preplace timing summary we can use budgeting flow for to get budgets, and later after placement we can improve them with postPlace budgeting.
1. Load Design using conf with netlist.
2. Use partitioning constraints.
3. Update Timing at placement stage, use following commands:
-inst <instance name>
SaveTimingBudget -dir DIR -pt -lib -inputTransition -pinLoad
–inst <instance name>
NoTrialIPO – tool will not do any kind of optimization and budget purely on existing timing status of design.
Constant model - .lib generated will have constant single value, this single value will help us in debugging if some budget comes wrong.
Inst – Partitioned block instantiation name.
NoIncludeLatency – Tool will assume all clocks ideal and will not constitute latency factor in setup and sequential arcs while lib generation.
Flat Placement and Virtual IPO
With the combined netlist of all partitions and top along with floorplan delivered by the floorplanning team which includes all the required blockages and most importantly fences for every partition, a flat placement and optimization is done.
Fences will ensure every partition instance will be placed inside the defined location for partition. Also we should put all the boundary nets and pin of partition block as don’t touch to avoid tool to add extra ports at the boundary while budgeting.
Generating budgets at this stage will not be accurate, because of DRVs (fanout violations which are not yet optimized). As and when this violation is later fixed it will demand extra delay not counted in budgeting. To consider the above told fact, the option of virtual optimization is available.
Tool doesn’t actually do optimization, rather it estimates the requirement of additional delay to accumulate that drv (for e.g. to consume a fanout of 2000 we may need 6 stages of bufx8 giving an extra data path delay of 1 nsec). While estimating budgets that extra delay is assumed in data path.
Like in the mentioned figure, total data path considered for budgeting would be 3 + 2 + 1 (from virtual IPO) = 6nsec.
Another viewpoint one can think of is to optimize boundary DRVs before budgetin but that is not feasible, as we may not be interested in introducing extra ports at the partition boundary.
After placement if we may not be able to do optimization because of huge design size. In that case we can use Active Logic Views, in which tool will be guided to optimize interface path of partition only. This will be discussed in detail later.
After implementing placement and optimization on flat design (or using Active Logic Views), following steps can be done for partitioning the design.
Step 1: DeriveTimingBudget –trialIPO -noConstantModel -noIncludeLatency
Step 2: AssignPtnPin –markFixed
Step 3: partition
Step 4: savePartition -def -pt -lib -inputTransition -pinLoad -def
First command will derive budgets considering all special cases (discussed later in detail) across partition, second command will freeze all partition pin location and marked them fixed. Partition command will cut every partition around its pin. And last command will save the complete design with following information.
For Block: Floorplan, constraints, cpf (if used), conf and netlist.
For TOP: Floorplan, top constraints, cpf, lef and lib of partitioned block, conf and netlist
The unique advantage of partitioning with constraints is that, we can avoid iterations on pin placement of partition. The placement of every pin will be best suitable to get good interface timing, giving timing aware pin placement.
We can use top level cpf while partitioning, and we can get individual cpf for top level and block level. Also if we have some latency information (pulling or pushing) for design to see better timing we can use them through constraints.
When we use derive timing budget, tool use the following basic algorithm to derive the budget. Tool divide the existing logic and normalize it to time period, while doing so it consider all other basic parameters like setup arc, uncertainty etc.
Consider the basic examples of budgeting shown in figure below:
Justifying every budget
We can also justify any port’s budgeting, and can report the behavior of the path while budgeting. The command justifyTimingBudget will help us in verifying the generated budget and timing arc. It will show the timing report where the path with worst slack is budgeted with respect to every clock domain. Also, while derive timing budget warning report is generated which can tell us of the discrepancies during budget generation.
Also while derive timing budget warning report is generated which can tell us the discrepancies while budget generation.
PreCTS Design Closure
Individual designs (partitions and top with blocks lib/lef) can be processed independently for rePlacement (if required) and timing optimization. We have to meet register to register timing, but for IOs if we are not able to close them by repetitive optimization efforts. There could be three ways to proceed further in the same priority order in which they are mentioned.
1. Use Assemble design and redistribute time budgets (constraints and lib), these will be more accurate.
2.We can also play with virtual clock latency. (Inter block boundary useful skew) explained below in detail. Special clock balancing will be required to balance blocks from top level, discussed after a while during Clock tree synthesis planning.
3. Look for pulling pushing of boundary flops to have more band of time period. (Intra block boundary useful skew). We should avoid doing this if possible, as it will introduce extra hold in design.
Inter block boundary useful skew technique: Every Design IOs are closed with respect to virtual clock.
Generally its latency is put same value as the latency of actual clocks build inside block. Virtual Latency can be defined as the latency of the flops which are in other partition, talking to interface. Consider following situation shown in figure, we have input ports to register paths have violation of -1nsec (WNS), while register to output ports are met by
0.5 nsec (WNS), at virtual latency of 5. If we change virtual latency to 4nsec, violations will shift towards output port, in to reg path will meet and reg to out paths will start violating by -0.5nsec, we can look of optimization scope for reg to out, which may not be fully optimized because of positive slack earlier.
Similarly we can close all individual Partition and top level design at a particular virtual clock latency number.
Important: Above mentioned inter block useful skew technique is not supposed to be implemented for block interfaces of top level design. It is useful for all individual partitions.
Clock Tree Synthesis
It methodology would be a little different for top level design and individual block partition.
Block level CTS: every partition clock will be build as per the mentioned latency target, and postCTS IO timing with skipped intra block skew margin should not be degraded much as they have already optimized and timing closed with same latency on clocks.
Top Level CTS: It would be little bit tricky. We have to build all the registers at the top level using normal methodology, but sink points of all the partition clocks will be build at their individual source latency, and how we will derive these individual source latency values, is explained below.
“For every Partition sum of source latency and virtual clock latency should be made equal to top level latency to avoid any timing violation across the partition because of inter block skew.”
The above mentioned technique will ensure postCTS timing correlation between all partitions and top level. There will not be any setup or hold violation when we check flat timing after merging all designs.
Assembling and Re-evaluating.
At any stage of design, we can assemble the design for refinement of budgeting constraints and if required we can also repartition complete design for improved pin location for boundary pins.
If we only want to improve the timing budgets, we can easily assemble design in timer mode, with .v and .spef of top level and partitions. If we also want to improve pin location, we can use assembledesign command to reassemble the design, followed by one run of optimization using Active Logic Views. Design is then re-Partitioned to get better placement.
These activities can be independently performed separately and improved constraints, lib model, new partition def (with improved pin location) along with new lef, can be incrementally fed at any stage of design.
Lib model and constraints together, and constraints and lib model together should always be in sync for successful flat design closure.
Note: While reevaluating the budgets use –notrialIPO (with deriveTimingBudget) to avoid optimization.
Post-CTS Lib Generation:
After CTS previously used partition .lib models cannot be used at top level, as it wont have clock latency arcs inside it, to generate it, postCTS time budgeting has to be done again after combining all the design’s postCTS .v and .spef, and an extra option of “-includeLatency” has to be used while derive timing budget command (deriveTimingBudget).
At the same time if partitions are closed with timing, and no more changes are expected ILM models can also be used to see more accurate timing. It will later be discussed in detail under chapter “Type of Models and Usage”.
Hold Closure Strategy
Hold Strategy is important for closing the boundary hold violation, as we should not end up inserting too many unwanted buffers at the boundaries.
The following are the ways are the improved ways of fixing boundary hold in the order they have mentioned.
A) One of the early strategies of closing hold was to give zero input and output delays at the boundary ports, so all the interfaces will be closed without considering the logic.
B) Later it is improved, and either input or output of the every partition is used for hold fixing. For e.g. if input to register violation of first partition is fixed with zero input delay, and when complete register to register path is seen from top level it will be hold timing closed.
C) Hold Budgeting: The most accurate and lesser pessimistic approach.
Using Time budgeting flow, a min library of every partition along with min IO delay constraints for every corner can be developed which will have accurate information depending on the logic. It would certainly be more realistic and we could save unnecessary buffer additions at the interface.
Following is the flow usage,
Option “-setupHold” should be use with derive timing budget command,
All the min delays will also be updated in constraints and separate max and min libraries will be developed to be used for setup and hold respectively.
Usually, inter block partition is not hold timing critical, so taking a judgment before starting this activity is must. In that case, flat top level design can be analyzed for hold and manual hold closure can also be advantageous.
There could be two type of noise on an individual partition, Intra block noise and inter block noise, Noise being very much dependent on the routing strategy and not that much predictable moving from one design to another. Even if same design is done with different approaches of placement and congestion aware routing, Noise results can differ and are tough to predict. Normally, we speak about noise and conclude results after deciding the entire placement and routing parameters.
In case of noise a register to register timing path can also be affected from an aggressor sitting outside the partition. But “how often?” is the question we are more interested in.
Flat Noise estimation = Inter block noise (X %) + Inter block noise (Y %)
Seeing the partition layer information, we can conclude estimate of what percentage effect of noise would be inter block and intra block.
For e.g. if we have a block layer partition consuming 5 metal layer out of 7, will have more inter block noise compare to a partition which consumes all 7 metal layer.
Depending to the partition information, we can distribute noise margins for inter block and intra block noise. Partition can be closed independently considering intra block noise, later flat noise runs can be done to resolve inter block noise.
We also have options of using XILMS which are crosstalk aware partition models that can be used to consider inter block at top level, but use of such options are still questionable to their development status.
Also if designs are not big and running a flat noise is not a problem, complete noise can be resolved in one go, subjected to the availability of all partitions at the same time with routing.
Type of models and Usage:
Models as we have understood is the replica of interface timing of any partition, which provide timing information to the top level design for independent closure.
We have different choices of model depending on the availability and accuracy. But we should understand the usage, as every model is important in its ability and is irreplaceable by other.
There are two terms “Target” and “Result” interestingly supports the model definition. Target type models are generated based on the proposed timing for partition, where as Result type models are generated when blocks are closed for that target.
The available models are:
Quick Timing Model (also known as prototype model): These models are target based models which are aligned with the proposed timing conditions of the interface. For e.g. If time period of a simple interface path is 10nsec, and input delay on the interface is 6 nsec, so targeted arc for the partition would be of 4 nsec (ignoring the setup and other margins). These predicted models are used for the early closure of the top level independent to other partitions.
Interface Timing Model (ILM): These are the result based timing model, which constitute .v and .spef of interface of the design. But these models neither budget the interface nor they generate any target, they simply reflect the current condition of interface timing to the top level.
Certainly they would be more accurate, but should be used when block (partition) level interface timing is closed. As now the scope of ILMs are only limited for timing evaluation, and support while optimization is in progress.
Extracted Timing Models (ETM): To support Top level optimization with timing closed partition timing models,
ETMS are generated, which is an exact copy of existing interface timing, but it is in the form of .lib model. It can be used for optimization.
With the increased development of ILM based optimization, Usage of ETMs will get reduced with time. And ILMs are the models of great use.
With the increased size and covering multiple functionalities in an SoC, the need of Partitioning has grown to its extreme. But opting for Partitioning technique is not always beneficial. A complete analysis and planning is required to justify the need of it. Partitioning strategy which claims “divide and rule” also brings up many complexities in the design flow. It is very important to understand the technique conquering all the special cases to reduce the turnaround time and achieving an efficient timing closure. With this article we have described the complete flow, including all the intermediate steps. With this, it will become easier for all domains working together to resolve the partitioning challenges.
About the author:
Ateet Mishra is a Senior Design Engineer at Freescale Semiconductor, India. He has 6 years of industry experience in various fields of VLSI, such as Static Timing Analysis, Physical design and Synthesis. He has been associated with Freescale since the beginning of his career and has successfully taped out multiple SoCs. Contact him at R65850@freescale.com