Hyper-Registers can achieve 2X or more core performance compared to previous generations of high-end FPGAs. To achieve this enhanced performance, you must optimize your designs using the following steps:
In this application note, retiming refers to moving the physical location of existing registers in a design to balance the propagation delay between the registers. Retiming also performs sequential optimizations by moving registers backwards and forwards across combinatorial logic. Retiming across node splits and merges may involve register duplications or merges. By balancing the propagation delays between each stage in a series of registers, the retiming process shortens the critical paths, reduces the clock period, and increases the frequency of operation.
In the HyperFlex architecture, Hyper-Retiming uses the Hyper-Registers that are available in both the interconnect routing and at the inputs of all major functional blocks. There are a few restrictions that can prevent registers from being moved during Hyper-Retiming, such as asynchronous clears, cross-clock boundaries, and I/O ports. To achieve the maximum performance gain from Hyper-Retiming, you must modify your RTL to allow register movement. The Quartus® II software includes tools to easily identify the restrictions that must be removed to take maximum advantage of the performance gains available from the HyperFlex architecture.
In conventional retiming, the design software tries to improve timing by rerouting to an unused ALM that is near the ALM being used. Conventional retiming is limited by the availability of an unused ALM nearby and the additional delay incurred by routing to the unused ALM. Because additional ALMs and routing resources are used, conventional retiming requires an incremental placement and routing process.
In the new Hyper-Aware CAD flow in the Quartus II software, the Fitter tool is aware of the retiming optimizations that can be done later by using the Hyper-Retimer. This retiming prediction allows the Fitter to prioritize the critical paths to reflect the future retiming step. The critical paths that can be fixed by retiming are de-emphasized in favor of paths that cannot be retimed. Initially, Place and Route focuses on the paths that cannot be re-timed immediately. The Hyper-Retimer then re-allocates the use of ALM registers to Hyper-Registers that are in optimal locations for balancing the slack in a register chain. Because the Hyper-Registers are present in the interconnect routing, the routing congestion is less, which allows easier timing closure.
In both circuits, the longest path is the distance between the two registers on the right. The path is highlighted in red. Traditional retimers would consider the paths between the two registers on the right similarly critical in both circuits. However, the Hyper-Aware CAD flow considers the entire series of registers. The Fitter understands that the Hyper-Retimer will balance the delays between the series of registers in the top circuit, exposing the more critical circuit on the bottom.
In the conventional flow, because the retimer operates before the Fitter, it predicts the placement and routing without knowing the critical paths:
- Static Timing Analyzer
In Hyper-Retiming, the retimer operates after the Fitter:
- Static Timing Analyzer
Because the Hyper-Retimer operates after the Fitter, it knows all the circuit delays and critical paths. The Hyper-Retimer places the Hyper-Registers in better locations to increase performance on the placed and routed design. It also disables the registers previously allocated by the Fitter. In Hyper-Retiming, there are no place and route fix-up changes and no issues with delay or congestion prediction as there are with traditional retiming.
Use the following baseline Quartus II Settings File (QSF) assignments to add
the Stratix 10 family (the S10_EARLY_ACCESS device is an Arria 10 periphery with
a Stratix 10-like core) to your project:
- set_global_assignment –name FAMILY arria10
- set_global_assignment –name DEVICE S10_EARLY_ACCESS
The fast corner timing model is currently not available. To bypass the fast corner timing
model, disable the multicorner timing analysis with the following QSF
- set_global_assignment –name OPTIMIZE_MULTI_CORNER_TIMING OFF
- set_global_assignment -name TIMEQUEST_MULTICORNER_ANALYSIS OFF
Use the following QSF assignment to enable the Hyper-Retimer:
- set_global_assignment –name HYPER_RETIMER ON
Before enabling the Hyper-Aware CAD flow, you must enable the
Hyper-Retimer. In the Hyper-Aware CAD flow, the place and route tool is aware
of the retiming optimizations that can better prioritize critical paths. For
more information about this, refer to the
Hyper-Aware CAD Flow section. Use the following QSF
assignment to enable the Hyper-Aware CAD flow:
- set_global_assignment –name HYPER_AWARE_OPTIMIZE_TIMING ON (default)
The following options are useful for the Hyper-Aware CAD flow. Refer to the Hyper-Aware CAD Flow section for details.
A one-time setup process is required to enable the HyperFlex
compilation flow. If you cannot see the
Hyper-Retimer option in the
Tasks window, perform the following:
Tasks window, click
Figure 3. Tasks Window with Customize Button
- Click New.
- Create a new custom flow named HyperFlex Compilation based on the existing Compilation flow.
resulting pop-up window, check
HyperRetimer as well as the sub-task
Figure 4. HyperRetimer Option
- Click OK on the next two windows. The HyperRetimer option appears under compilation tasks.
- In the Tasks window, click Customize.
To compile your design, perform the following tasks:
- In the Compilation Flow window, click the compile icon to run the Mapper, Fitter, Hyper-Retimer, and TimeQuest Timing Analyzer.
- Enter the following command in the command line: - quartus_sh –flow compile <project>
HyperFlex Compilation flow runs an extra “Hyper-Retiming”
step. To run just this Hyper-Retimer step, enter the following command:
- quartus_sta -retime <project>
After compilation, a <project>.rtm.rpt file is generated. This report can be viewed under the Table of Contents > Hyper-Retimer > Compilation Report window.
CAD Flow Stage
Major QSF Assignments Affecting the Flow
Analysis and Synthesis
HYPER_RETIMER assignment enables the following features:
Place and Route
HYPER_RETIMER assignment enables the following features:
The Hyper-Retimer provides details about fMAX and the Retiming Limit as shown in the following figures:
The Retiming Limit Details report provides the following information for each clock domain:
- Reason why the retiming process cannot further improve performance
- Technical details about the critical chain limiting the retiming process
- Recommendations for resolving the critical chain
The fMAX seen in the results can be used as a baseline to compare the performance improvements that the Hyper-Retimer is theoretically capable of performing. To realize these performance improvements, you will likely need to make the recommended RTL changes in your design to remove retiming restrictions. Before making any changes, you can explore the performance capabilities of a design by using Fast Forward Compile.
The Place and Route tool in the Hyper-Aware CAD flow takes into account that the Hyper-Retimer will be run later. Based on this assumption, the Place and Route tool prioritizes the critical paths. The Fitter predicts that the retimer will de-emphasize paths that can be fixed by retiming and focuses on paths that can’t be retimed (such as loops). The Hyper-Aware CAD flow uses the register delays predicted by the retimer to adjust timing in the place and route tool, and expose the real critical paths that cannot be fixed by the retimer. You can modify your RTL design to add more registers after exploring the areas where registers can help improve performance.
Retiming registers that are close to each other can potentially trigger hold violations at higher speeds. This situation is reported as a short path in the retiming report under Path Info. Short paths are also reported if enough Hyper-Registers are not available. When nodes involve both a short path and a long path, adding pipeline registers to both paths can help with retiming. Additional options are provided in the Hyper-Aware CAD flow to balance the latency of longer register chains due to added pipeline registers and to avoid short path issues.
The QSF setting to enable the Hyper-Aware CAD flow is:
set_global_assignment –name HYPER_AWARE_OPTIMIZE_TIMING ON
To make the place and route tool aware of these short paths, use the following QSF assignment to enable short path optimization while realizing the results:
set_global_assignment -name HYPER_AWARE_OPTIMIZE_SHORT_PATHS ON
The following figure shows how a short path can limit retiming. In this example, forward retiming pushes a register onto two paths, but one path has an available register for retiming, while the other does not.
In the circuit on the left, if register #1 is to be retimed forward, the top path has an available slot. However, the lower path can’t accept a retimed register because it would be too close to an adjacent register already being used, causing hold time violations. If the place and route tool is aware of these short paths (HYPER_AWARE_OPTIMIZE_SHORT_PATHS ON), then it routes the registers to longer paths, as shown in the circuit on the right. This practice ensures that sufficient slots are available for retiming. This feature interacts with the Fast Forward Compile feature.
The following two examples use the QSF assignment to address short paths:
Case 1: A design works at 400 MHz and the Fast Forward Compile report recommends adding a pipeline stage to reach 500 MHz and a second pipeline stage to achieve 600 MHz performance.
At that point, the limiting reason is the short path / long path. Using the assignment HYPER_AWARE_OPTIMIZE_SHORT_PATHS before adding the two stages does not help. In this case, first add the recommended two-stage pipelining to reach 600 MHz performance. Then if the limiting reason is short path / long path, setting HYPER_AWARE_OPTIMIZE_SHORT_PATHS ON triggers the router to fix the short paths in the design.
Case 2: A design works at 400 MHz and the Fast Forward Compile report does not make any recommendations to add pipeline stages.
If the short path / long path is the immediate limiting reason for retiming, turn on HYPER_AWARE_OPTIMIZE_SHORT_PATHS to fix the short path / long path issue. The router will fix the short paths in the design.
By default, the Fitter tries to optimize the existing design. In the default mode, the Fitter is not aware of any recommendations from Fast Forward Compile. If the Fast Forward Compile recommends adding pipeline registers, you should add the registers first (or alternatively, add assignments to indicate that the registers will be added). Then, enabling HYPER_AWARE_OPTIMIZE_SHORT_PATHS helps the design. If the short path is due to a structure in the design, the design might need to be optimized.
Alternatively, turn on HYPER_AWARE_OPTIMIZE_SHORT_PATHS when using the following QSF assignment in Fast Forward Compile:
This setting allows registers to be added so that the limiting paths at 400 MHz and 500 MHz can be optimized to enable the Fitter to see the short path / long path limit at 600 MHz.
Before retiming, registers occupy ALM locations. When these registers are retimed into Hyper-Registers, there are density and performance implications on the previously occupied ALM registers. The flipflop location cannot be used by other logic, and a routing penalty is incurred to route to the flipflop. To address this, there is a setting that temporarily removes the ALM registers that are to be placed in the Hyper-Registers prior to Place & Route, and restores them after retiming. This is particularly useful after registers have been added (such as for pipelining). Turn this setting ON for pipelined designs to see the area benefit from the Hyper-Registers, and a potential performance benefit.
set_global_assignment -name HYPER_AWARE_OPTIMIZE_REGISTER_CHAINS ON
The Summary section in the retimer report shows how many ALM registers removed from the Fitter are in Hyper-Registers. The number removed from the Fitter is usually different than the number added in the retimer. A register removed that was driving multiple blocks may be added back to the design as multiple separate registers driving multiple blocks in the retimer.
Initially, the Fitter checks whether the number of registers in the design is more than the physical ALM registers. If HYPER_AWARE_OPTIMIZE_REGISTER_CHAINS is ON, the Fitter takes into account registers that are good candidates for Hyper-Registers. Next, the remaining registers are placed in the physical ALMs. Therefore, a design can have more RTL registers than there are ALM registers and still fit into a device.
The Hyper-Aware CAD flow assumes that the Hyper-Retimer is going to run. If the Hyper-Aware CAD flow is enabled (by default), run the Hyper-Retimer. The compilation results may be poor if the Hyper-Retimer is not run, because the early assumptions made by place and route are not realized by the Hyper-Retimer.
Similarly, if the HYPER_AWARE_OPTIMIZE_SHORT_PATHS and HYPER_AWARE_OPTIMIZE_REGISTER_CHAINS settings are enabled, run the Hyper-Retimer. Ensure that HYPER_AWARE_OPTIMIZE_TIMING is turned on while running the Hyper-Retimer.
By accurately predicting retimer delays early on, the Hyper-Aware CAD flow improves fMAX, at the cost of a slight increase in Fitter compile time. With the default basic HYPER_AWARE_OPTIMIZE_TIMING setting, current results show an average improvement of 7% for fMAX, for about a 30% time increase.
The Hyper-Retimer has a Fast Forward Compile that is based on the Hyper-Aware CAD flow.
Fast Forward Compile allows you to explore the performance of your design. Fast Forward Compile analyzes how fast a design can run with the HyperFlex architecture by virtually implementing Hyper-Retiming and Hyper-Pipelining. It also identifies bottlenecks for Hyper-Optimization. Fast Forward Compile assumes that the changes required for optimization can be made without making any assumptions about their impact on the design. These changes are suggested in the retiming report. Based on the report, you can decide how to implement these changes in a functionally valid way.
With Fast Forward Compile, you can see where design changes can be made to improve performance before any actual modifications are made.
The QSF assignment to enable Fast Forward Compile is:
set_global_assignment –name HYPER_RETIMER_FAST_FORWARD ON
If a design has not met its target speed, run the retimer with this setting to see the potential performance improvement. The Mapper and Fitter optimize the existing design. The retimer provides guidelines for performance improvements and detailed recommendations such as retiming, pipelining, or RTL optimizations that can be made to the design to realize speed improvements. Note that some Fast Forward options that assume retiming restrictions can be removed or pipeline stages can be added do require the Mapper and Fitter. Refer to the annotations in Table 1.
Use options to control Fast Forward Compile assumptions, such as with asynchronous clears and user-preserve pragmas. The HYPER_RETIMER_FAST_FORWARD_ASYNC_CLEAR setting has options to allow the retimer to handle asynchronous clears in different ways. The HYPER_RETIMER_FAST_FORWARD_USER_PRESERVE_RESTRICTION setting has options REMOVE and PRESERVE to tell the retimer whether to assume that all your preserves are to be removed or kept. Accepted recommendations are executed, while the rejected ones are not executed. If no new recommendations are accepted, you do not need to recompile the Mapper and Fitter; just run the retimer with Fast Forward enabled to show the potential speed improvements.
A QSF assignment in the Fast Forward Compile directs the Fitter to optimize a design that assumes all asynchronous clears can be removed and unlimited pipeline stages can be added. This allows the retimer to provide the best estimated fMAX and to report the final limiting critical path (shown as the Hyper-Optimization step). If this QSF assignment is used, the retimer bypasses incremental step by step limitations and recommendations. This is useful to get a high level measure of the ultimate fMAX so that you can see how much faster the design can be without any advanced optimizations. The optional QSF assignment for this type of compilation is:
set_global_assignment -name HYPER_RETIMER_FAST_FORWARD_TARGET_MAX_PERFORMANCE ON
Use this QSF assignment with the previously defined Optimize Short Paths set to ON. Targeting maximum performance enables the Fitter to virtually add pipeline registers that can create short paths. These short paths are offset by the OPTIMIZE_SHORT_PATHS QSF assignment.
set_global_assignment -name HYPER_AWARE_OPTIMIZE_SHORT_PATHS ON
Compile the design to rerun the Mapper and the Fitter with these two settings to get the ultimate fMAX of your design. Although the step-by-step recommendations to achieve incremental results are not reported, the Fast Forward Details still reports the Fast Forward Optimizations Applied to provide information on what it virtually implemented (asynchronous clears removed and pipe stages added). You must realize these recommendations to reach the target fMAX.
Not all recommendations can be implemented. Fast Forward Compile allows you to perform what-if analyses such as:
- What if certain resets were converted from asynchronous to synchronous?
- What if the resets were removed?
- What if certain asynchronous clears were left untouched and pipeline stages were added somewhere?
The results are available from the GUI and in the <project>.rtm.rpt file. The report provides information about Hyper-Register usage and detailed recommendations to achieve incremental improvements.
In this case, 813 (about 31%) of the registers were moved from ALMs to Hyper-Registers. The number of Hyper-Registers actually used is greater than the number of registers moved from the ALMs. This occurs because a register that drives multiple blocks in Place and Route may be removed from the ALM and added back to the design as multiple separate registers driving the multiple blocks during retiming.
The clock fMAX summary report shows the fMAX increase that can be achieved with Hyper-Retiming. In this design, the fMAX has increased from 416.32 MHz to 616.14 MHz.
The fMAX report also shows what fMAX can be achieved with Hyper-Retiming and lists recommendations to optimize the design. For information on the Hyper-Pipelining and Hyper-Optimizing entries, refer to AN715: Hyper-Pipelining for Stratix 10 Designs and AN716: Hyper-Optimization for Stratix 10 Designs. With Fast Forward ON, the fMAX represents the netlist at its highest performance, for each step (or type of design change). The restricted fMAX is 1 GHz, which is the maximum frequency supported by Stratix 10 devices, even if the report shows a higher achievable speed.
The next section of the report shows the Fast Forward Summary for all clock domains in a design (in this example, there is only one), and the breakdown of the steps needed for each stage of performance increase per clock domain. This example focuses on step #1, removing asynchronous clears and restrictions to retiming existing registers only.
These results can also be examined in the Quartus II Compilation Report, organized in various windows. You can open the Compilation Report by clicking in the toolbar. Go to Hyper-Retimer > Contents Panel Table and select Fast Forward Details for Clock Domain <clk>. In the Step column, select Base Performance to bring up the following window:
The Recommendation tab gives information about the changes needed to achieve the next incremental step in performance. The information addresses the register chain with the worst slack. In this example, the report lists asynchronous clear ports at both dout_reg and din_reg as restrictions to retiming. Note that these nodes are part of a bus and only the nodes in a chain of registers with the worst slack are listed here.
To analyze more detailed results and recommendations of a Fast Forward Compile, in the Fast Forward Summary for a given Clock Domain window, select the next steps in the table. For the topic of retiming, the focus is mostly on Fast Forward Step #1. This first step lists asynchronous clears that are hypothetically removed to achieve an fMAX of 616.14 MHz. The Fast Forward Optimizations Applied in Clock Domain column shows how many changes the Hyper-Retimer assumes it can make to the design to achieve the fMAX. In this example, a detailed list of 1341 registers with their asynchronous clears hypothetically removed is listed below. It provides detailed recommendations about design changes to be made to achieve the stated performance:
After virtually removing the asynchronous resets on these registers, the Recommendations for Critical Chain tab shows the next incremental gating item to retiming. The next recommendation is to add a pipeline stage in specific paths.
The Fast Forward Summary for a Clock Domain always ends with Hyper-Optimization as the last step.
The Fast Forward compilation shows that you can get a big performance improvement if only the Hyper-Retimer could move registers around. The Base Performance fMAX is just the results with the Fast Forward QSF assignment turned OFF.
For more information about the Fast Forward steps after #1, refer to the Hyper-Pipelining for Stratix 10 Designs and Hyper-Optimization for Stratix 10 Designs application notes.
A Post-Hyper-Retimer Netlist Viewer displays the design as if all Fast Forward compilation recommendations were followed. This feature is currently enabled by the following .ini in the Early Access release:
The Viewer shows the Hyper-Registers and bypassed ALMs used in a design.
For comparison, refer to the following screenshots of an example design. The screenshot on the left in the first figure is before retiming in the RTL viewer. The screenshot on the right shows the Post-Fit Viewer. The next screenshot is after retiming with the Post-Hyper-Retimer Viewer.
This design has four banks of registers, a wide AND gate, and two output register stages. The bottom screen shot shows the second stage of the register banks retimed forward into the first bank of Hyper-Registers. The third and fourth stages are retimed forward across AND gates (shown in purple) into Hyper-Registers. The first output stage is moved to a Hyper-Register (the right-most register shown in pink).
The following assignments apply different scenarios to asynchronous clears when using Fast Forward Compile:
set_global_assignment -name HYPER_RETIMER_FAST_FORWARD_ASYNCH_CLEAR REMOVE
set_global_assignment -name HYPER_RETIMER_FAST_FORWARD_ASYNCH_CLEAR PRESERVE
set_global_assignment -name HYPER_RETIMER_FAST_FORWARD_ASYNCH_CLEAR CONVERT
set_global_assignment -name HYPER_RETIMER_FAST_FORWARD_ASYNCH_CLEAR AUTO (default)
The REMOVE option forces the Hyper-Aware CAD flow to assume that the asynchronous clears are removed during place and route. It also forces the retimer to automatically assume that the registers can be moved in Base performance mode.
The PRESERVE option forces the Hyper-Aware CAD flow to not assume any movement of the asynchronous clears during place and route. It also forces the retimer to assume that the registers will not be removed and that it should explore alternative ways to improve fMAX.
The CONVERT option forces the Hyper-Aware CAD flow to convert the asynchronous clears to synchronous clears.
The AUTO option preserves asynchronous clears until a performance limit is reached. It then removes asynchronous clears. This is the default option.
If none of the three options is specified, the Hyper-Aware CAD flow will not assume that the asynchronous clears can move during Place & Route. However the retimer will explore these moves in Fast Forward Compile. The assignments can be applied on a global, entity, or instance basis. Refer to the Reset Requirements for Retiming section for reset strategies for maximizing performance.
The REMOVE and CONVERT options affect how Place & Route sees the asynchronous clears, and slightly changes how the recommendations are reported. An example showing a first (baseline) compile is shown in the Baseline Results section. The first recommendation is to fix the retiming restriction at Register dout_reg because it uses an asynchronous clear.
The Fast Forward Optimizations Applied section in the report lists all the removed asynchronous clears. The next step is to add pipeline stages. Use the REMOVE or CONVERT clears to compare the fMAX results before making RTL changes. This is the ideal fMAX. In reality many asynchronous clears can be converted, especially the ones around the critical paths, there can be some that must be preserved and can’t be removed or converted.
Compiling with the PRESERVE option assumes that asynchronous clears cannot be removed even in a Fast Forward Compile. The Fast Forward Compile report will not recommend removing the asynchronous clears that cannot be removed.
To realize the exploration results of Fast Forward Compile, changes must be made in the design according to the recommendations in the report. The amount and type of changes required depends on the target speed. If the reports show that the target fMAX can be achieved by just removing or converting the asynchronous clears, then that may be sufficient. Asynchronous clear modifications can be done in the following ways:
- Tying the clears to inactive
- Converting the clears to synchronous
- Removing the clears altogether
Refer to the Reset Requirements for Retiming section for details about reset strategies for HyperFlex-friendly architecture. After the RTL source code edits are made, turn Fast Forward OFF with the following QSF assignment.
set_global_assignment –name HYPER_RETIMER_FAST_FORWARD OFF
Recompile the design and analyze the results. Check the Clock fMAX Summary report, and note the new fMAX with the code changes. With Fast Forward OFF, the fMAX is the result with the netlist as-is, and there is no speculation. If the target fMAX is not reached, then re-enable the Fast Forward compiler to find other opportunities for optimization:
set_global_assignment –name HYPER_RETIMER_FAST_FORWARD ON
Apply the recommended changes at a broader level. For example, if a recommendation is to remove the asynchronous clear to a register in a multi-bit bus, make the changes across the width of the bus. Doing so can reduce the need for multiple iterations.
The process of exploration followed by realization can be repeated if the target fMAX is not yet achieved. The final revision of the design should have the Fast Forward Compile turned OFF when generating device programming files. Programmer Object Files (POFs) cannot be generated with Fast Forward Compile ON.
To load the reports into the TimeQuest Report pane, in the Tasks pane, in the HyperRetimer section, select Load Hyper-Retimer Reports.
The retimer can also be launched from inside TimeQuest, under the HyperRetimer task.
Viewing the Hyper-Retimer reports in TimeQuest has the following benefits:
- Integration with netlist viewers
- Easier to perform timing analysis
- More drill-down capabilities at each Fast Forward step
For example, after Hyper-Retiming you can locate a node or paths in the Hyper-Retimer Viewer the same as you would in the Technology Map Viewer (after fitting).
Even with the retimer’s focus on critical chains or series of registers, TimeQuest always reports one critical path. This is the register-to-register path that can’t be made to run any faster. Traditionally, report timing looks like this in TimeQuest::
The worst slack in this example design is -1.233 ns and includes a path from din_reg. This path corresponds to a path in the critical chain as reported in the Fast Forward details for clock domain X, from the last Fast Forward step, before Hyper-Optimization:
The critical path is indicated with Long Path (Critical). The Long Path (Critical) part of a chain is a good candidate for timing optimization because it is longer than other segments in the chain.
For more information on critical chains or Hyper-Optimization, refer to AN716: Hyper-Optimization for Stratix 10 Designs.
The HyperFlex architecture includes registers everywhere. This application note does not describe the architecture details. The basic idea is that a register is available whenever a signal gets onto a new net or enters or exits an ALM, DSP block, or memory block. Hyper-Registers have only a clock and a data port, so they are limited in functionality. However, there are a number of ways the software can emulate control signals.
In the above figure, the top register chain is a normal register chain, with ALM registers used for each location in the chain. The performance of this chain is dictated by the longest single register-to-register path, which is 5 ns. The bottom picture shows this same register chain in the HyperFlex architecture. The endpoints of the chain are in the same location, and the logic and interconnects are all in identical locations, but the registers inside the chain have been moved from ALM registers (rectangles) to Hyper-Registers (circles). Because of the abundance of Hyper-Registers, retiming pushes the registers to a better location, thereby balancing the slack across the entire register chain.
In both cases, it takes five registers to go from input to output. The raw delay across the register chain in both cases is 13 ns. However, the upper register chain runs at 200 MHz while the bottom register chain runs at 307 MHz. With Hyper-Registers, the retiming granularity is reduced to the delay of an individual routing wire, or approximately 100 ps.
Register chains can vary greatly in size. For example, a design can have about 20 levels of logic, with registers to balance across the entire chain. There are other situations where the chains are very small. Then there are complicated combinations of long chains connected to short chains originating or ending in common nodes. These situations are discussed in more detail in the Hyper-Optimization for Stratix 10 Designs application note.
The Hyper-Retimer reports the critical register chain per clock domain. This contrasts with traditional timing analysis, which reports the critical path between two registers.
The following restrictions limit a register’s movement during retiming to improve slack on a register chain:
- Register has an asynchronous clear
- Register drives an asynchronous signal, such as an asynchronous clear
- Register is marked “don’t touch”
- Register is marked “preserve”
- Register is a clock source
- Register is a partition boundary for Quartus II Incremental Compilation
- Register is in a block type modified by an ECO operation
- Register location is in an unknown block (only flipflops, DDIOs, single port RAMs, and DSPs blocks are currently supported)
- Register is described in the RTL as a latch
- Register location is at an I/O boundary
- The combinatorial node is fed by a special source such as PLL control loop, JTAG, etc.
- Register is driven by a locally routed clock
- Register is an end-point of a timing exception (such as multicycle or false path constraint)
- Register can be retimed around and within a loop, but cannot be pushed into or pulled out of a loop
- Register has either an inverted input or output
- Register is part of a synchronizer chain
- There are multiple period requirements for paths that start or end at the register (that is, cross-clock boundaries) as shown in the following figure
The more complex the timing requirements, the more difficult it is to retime and verify retiming. Some basic examples of these restrictions are:
- SDC constraints such as false paths and multicycle paths
- Asynchronous resets
To address synchronizers, Fast Forward Compile can virtually add pipeline stages at clock domain boundaries. For metastability, the number of registers that make up a synchronizer chain is determined by the QSF setting ADV_NETLIST_OPT_METASTABLE_REGS. Each device family has a default value. The assignments can be applied globally or on instances. The retimer won't retime these registers and the Fitter will try to optimize them to reduce delay to increase the mean time between failure (MTBF).
SDC constraints that limit retiming can often be modified. However, too many SDC constraints can hurt Fast Forward results.
To handle asynchronous resets, there are techniques that can help. Refer to the Reset Requirements for Retiming section for more details. Also, when migrating from an existing device to one with a higher speed capability, increasing clock speeds will help get a better fMAX if the required fMAX is easily achieved.
In the following figures, Hyper-Registers are indicated by circles and ALM registers are indicated by rectangles. Used Hyper-Registers and ALM registers are blue, while unused registers are empty.
The following figure has four registers along its register chain: two ALM registers and two Hyper-Registers.
A Hyper-Register contains only data and clock inputs. It does not have any additional control signals such as asynchronous clears or clock enables. An ALM register has access to all the normal logic control signals.
There are places in designs, apart from the register chains, where the logic merges and splits, as shown in the following figure.
A merge is any place where multiple signals come together to create fewer output signals (usually one). A merge signifies a LUT or multiple LUTs. For example, if 10 signals merged together, multiple LUTs would be required to create that logic, and could be drawn with all 10 signals coming to a common point, or perhaps multiple merge points to specify each LUT.
In the following figure, a signal is pushed across a point where multiple signals join into one. There is a critical path coming in from the left and another one going out to the right, so the retimer tries to push these registers out to balance the slack. When it pushes the registers on the left side back through a split point, it must pull a register from every branch of that split and merge them. Likewise, when it pushes the registers on the right forward through the merge point, it must pull a register from every signal branch and merge them. Because merge points are created through LUTs, the design decreases the number of registers used from eight to two.
In the following figure, a split is represented by a single source that has multiple sinks. This usually signifies routing, where a wire segment splits into multiple wire segments to drive multiple end locations. Because Hyper-Registers are abundant in the HyperFlex architecture, pushing registers from one net to many is easy.
In the path on the left, the two registers are very close to each other with plenty of slack. A path coming in from the left, long path A, and a path going out to the right, long path B, both have critical timing. The retimer pushes these two registers out and spreads the extra slack it has in the short path. Pushing the left register back across the merge into long path A requires the register to be duplicated into the other branch. Likewise, pushing the right register forward across the split into long path B requires the register to be duplicated into the two other branches. Retiming, therefore, takes the two original registers and replaces them with five. The Hyper-Retimer can easily do this because of the abundance of Hyper-Registers.
During retiming, paths can affect each other through their join points. A join point can be either a merge or a split point.
The critical path is from A to B. The retimer can fix this by moving either of the registers closer to the other one. In both cases there is an available unused Hyper-Register. However, the retimer cannot move either of these registers due to related paths. For A to move right to the open register slot, it must be duplicated into the two side branches. The two lower bottom branches are fine, but the top branch has a register called “Status” that cannot move because it crosses clock domains and there is a False Path on it. The retimer cannot push registers across false paths.
The other option is to push B left to the open slot. To do that, it must merge with a register from every other branch. In this example, it needs one from C. However, C would be pushed left into two open slots (one of which is merged with B). The problem is that the path from C to D is almost critical too. It is slightly better than A to B, but if C gets pushed left, then C to D becomes worse and the overall design timing gets worse. As a result, the retimer can’t push B left into the open slot. So even though paths A to B and C to D aren’t directly related and have no points in common, their common join point (the open slot B and C would be merged into) determines what can be done with retiming.
Join points can be very complex. For example, it may be that moving B left makes C to D worse than the original A to B slack. Of course, the retimer can just move D left to make up for that, but it may have a join point with another path, which joins with another path, and so on. This can continue until you finally reach something that cannot be moved and this final path would be worse than the original A to B. In the end, the retimer cannot move B left because some path would get worse that is much further down the chain. It may not be obvious how the two are related.
The Hyper-Retimer understands long paths and all their interactions, and provides an ideal retiming solution. Deciphering a long register chain report can indicate the relation between the related chains. Finally, a very complex series of relationships between paths can be optimized by the retimer.
The naming convention is to append a resource used with _dff if a Hyper-Register is used, or with ~.comb if an ALM register is bypassed.
Hyper-Registers used on routing wires have the following naming convention:
<Original Name>_<Routing Wire Name>_dff
Hyper-Registers at inputs to LUTs, flipflops, and DSPs have the following naming convention:
The following example shows a register in an input:
The following example shows a register at a RAM input:
Bypassed flipflops in ALMs are appended with ~.comb:
The Hyper-Retimer assigns register names based on the physical location occupied by a register after retiming. If multiple registers with different names are merged together and those registers are then duplicated into other multiple registers, it may not be possible to trace a register back to its original location. For example, a design that begins with 9K registers can have more than 60K registers after retiming. With such large changes, it becomes nearly impossible to keep track of those registers with conventional naming schemes. The design does not increase in ALM or routing resources, because the extra registers were all Hyper-Registers that were originally bypassed.
This section describes asynchronous and synchronous resets and their impact on retiming. When implementing these common structures in RTL, you must understand the effects of retiming in evaluating your designs.
Hyper-Registers do not have asynchronous resets. One of the fundamental restrictions while retiming a design is to remove the ubiquitous asynchronous resets. To take advantage of the HyperFlex architecture and the Hyper-Retimer, you should, as much as possible, remove asynchronous resets and clears or convert them to synchronous. In the following example, a design has an asynchronous reset on the din and dout registers, and there are multiple extra_regs registers next to them.
The retimer cannot push through the two registers with asynchronous resets. This applies even if there are open slots available or if retiming opportunities exist that would normally allow the retimer to push the extra_regs either forward on the din side or backward on the dout side. Therefore, in this situation there cannot be any performance improvement.
Converting asynchronous resets to synchronous is sufficient in most cases. However, to get good retiming performance, also remove the asynchronous clears, except for the minimum required to achieve reset state, and to protect outputs from an undesirable state. If the asynchronous clears cannot be removed, the next best thing is to convert them to synchronous clears. There are several ways to minimize the number of asynchronous clears required in a design.
Asynchronous clears can be removed if a circuit naturally resets when the reset is held long enough to a steady-state equivalent of a full reset. For example, the following circuit uses a full asynchronous reset, and all registers are 0 upon reset:
In the following figure, some of the asynchronous clears are removed from the middle of the circuit. After a partial reset, if the modified circuit settles to the same steady state as the original circuit, then the modification may be considered functionally equivalent.
Cases involving inverting logic generally require additional synchronous clears to remain in the pipeline.
After the reset is removed and the clock is applied, the register outputs do not settle to the reset state as in the circuit above. In this case, the inverting register cannot have its asynchronous clear removed to be equivalent to the above circuit after settling out of reset.
A solution to non-naturally resetting logic because of inverting functions is to validate the output to synchronize with reset removal. Then, as long as the validating pipeline can enable the output when the computational pipeline is actually valid, the behavior is equivalent with reset removal. This process works even if the computation portion of the circuit does not naturally reset.
Converting the resets helps retiming, but there are still restrictions. The ALM has a dedicated LAB-wide signal which is often used for synchronous clears. Using the signal is determined by synthesis, but is usually dependent on the signal’s fan-out. A synchronous clear with a small fan-out is usually done in logic, while larger fan-outs use this dedicated signal. Even if the dedicated synchronous clear is used, the register can still be pushed into Hyper-Registers. This process is achieved through the bypass mode of the ALM register, where a signal can go right up to the register and still bypass it. When the register is bypassed, the sclr signal and other control signals can still be accessed.
In the following example, the LAB-wide synchronous clear feeds multiple ALM registers. A Hyper-Register is available along the synchronous clear path for every register.
During retiming, the top register in row (a) is pushed right into a Hyper-Register. This is achieved by bypassing the ALM register, but still using the SCLR logic that feeds that register. When the LAB-wide SCLR signal is used, an ALM register must exist on the data path, but it does not have to be used.
The retimer pushes the register in row (b) left into its data path. The register is pushed through a signal split of the data path and synchronous clear, and so the register must be pushed onto both nets, one in the data path and one in the synchronous clear path. This can be implemented because each path has a Hyper-Register.
Retiming becomes complicated if another register is pushed forward into the ALM. As shown in the following figure, a register from the asynchronous clear port and a register from the data path must be merged together.
Because the register on the synchronous clear path is shared with other registers, the register splits on the path to other synchronous clear ports as well.
In the following figure, the Hyper-Register at a synchronous clear is already being used and cannot accept another register. In this case, you cannot retime this register for the second time through the ALM.
There are two key architectural components that make it easy to move an ALM register with a synchronous clear forward or backward:
- The ability to bypass the ALM register
- A Hyper-Register on the synchronous clear path
If you want to push more registers through, retiming becomes difficult. Because of this, performance improvement is expected to be better with asynchronous reset removal than conversion to synchronous resets. Synchronous clears are often difficult to retime because of their wide broadcast nature.
The Hyper-Retimer does not retime registers driving an output port or being driven by an input port. If a synchronous clear is on one of these I/O registers, it cannot be retimed. This restriction is not typical of practical designs where resets are driven by logic, but may come up as an issue in benchmarking a smaller piece of logic where the reset may come from an I/O port. In this case, all the logic driven by that reset is stuck and can’t be retimed. Adding some registers to the synchronous reset path can fix this.
If a synchronous clear signal causes timing issues, duplicating it between the source and destination registers can help. The registers pushed forward do not need to contend for Hyper-Register locations with registers being pushed back. For small logic blocks of a design, this is a valid strategy to improve timing.
Synchronous clears can limit the amount of retiming. There are two issues with synchronous clears that cause problems for retiming:
- A short path, usually going directly from the source register to the destination register without any logic between them. This is not a problem by itself. Short paths are normally good, because their positive slack can be retimed out to longer paths, making the whole design run faster. But short paths are typically connected to long data paths that need to be retimed. By retiming lots of registers up and down these long paths, registers are getting pushed down or pulled up this short path. This issue isn’t a big problem in normal logic, but is aggravated because synchronous clears typically have large fan-outs.
- Synchronous clears have large fan-outs. When an aggressive retiming requires registers to be pushed up or down the synchronous clear paths, the paths can get cluttered until they can no longer accept more registers. This situation results in path length imbalances (also referred to as short path / long path), or no more registers can be pulled from the synchronous clear paths, causing an insufficient registers as the Limiting Reason on retiming.
Aggressive retiming is when a second register must be retimed through the ALM register.
Consider an ALM register that has a synchronous clear signal, as shown in the picture on the left. The middle picture shows that register has been retimed forward and the ALM register is bypassed. The picture on the right shows the register being retimed backwards, in which case a register must be pushed up the SCLR path. Because the HyperFlex hardware has these special features, a dedicated Hyper-Register on the SCLR path, and the ability to put the ALM register into bypass mode, you can push and pull this register. If pushed forward, then you must pull a register down the SCLR path and merge the two. If pushed back, then you must push a duplicate register up the SCLR path. You can use both of these options. However, bottlenecks can be created when multiple registers are pushing and pulling registers up and down the synchronous clear routing.
In summary, be practical about where to use resets. Control logic mostly requires synchronous reset. Logic that may not require a synchronous reset will help with timing.
- When writing new code that needs to run at high speeds, avoid synchronous resets wherever possible. This is generally in data path logic that either flushes out while the system is in reset, or its values are ignored when the system comes out of reset, until new, valid logic filters through.
- Control logic often requires a synchronous reset, so there is no avoiding it in that situation.
- For existing logic that runs at high speeds, remove the resets wherever possible. When you reach a point where you do not understand the logic well enough or aren’t confident with how it behaves when reset, leave the synchronous reset in. Only if it becomes a timing issue in your design should you spend time analyzing if and how the synchronous clear can be removed.
- Pipeline the synchronous clear. This will not help if registers need to be pushed back, but can help when registers need to be pulled forward into the data path.
- Duplicate synchronous clear logic for different hierarchies. This limits the fan-out of the synchronous clear so that it can be retimed with the local logic. Again, this may be done only after you determine the existing synchronous clear with a large fan-out is limiting how the design can be retimed. This is not difficult to do on the back-end because it does not change the design functionality.
- Duplicate synchronous clear for different clock domain and inverted clocks. This can overcome some retiming restrictions due to boundary or multiple period requirement issues.
Like synchronous resets, clock enables use a dedicated LAB-wide resource that feed a specific function in the ALM register. Similarly, the HyperFlex architecture has some special logic that makes retiming logic with clock enables easier. However, wide broadcast control signals such as clock enables (and synchronous clears) are difficult to retime.
The following figure shows that the sequence of retiming moves for the asynchronous clears in the Synchronous Resets and Limitations section apply to the clock enable control signals.
In the top circuit, there is a dedicated Hyper-Register on the clock enable path. If the register needs to be pushed back, it must be split so that another register is pushed up the clock enable path. Here, the Hyper-Register location can absorb it without problem. These features allow an ALM register with a clock enable to be easily retimed backward or forward (middle circuit), to improve timing. A useful feature of a clock enable is that its logic is usually generated by synchronous signals, so that the clock enable path can be retimed alongside the data path.
The figure shows how the clock enable signal clken, which is a typical broadcast type of control signal, gets retimed. In the top circuit, before retiming, an ALM register is used. The Hyper-Registers on the clock enable and data paths are also used. In the middle circuit, the ALM register has been retimed forward into a Hyper-Register outside the ALM, into the routing fabric. The ALM register is still being used, but it is not on the data path through the ALM. It is used to hold the previous value of the register. The clock enable mux now selects between this previous value and the new value based on the clock enable. The bottom diagram shows when a second register is retimed forward from the clock enable and data paths into the ALM register. The ALM register is now used in the path. This process can be repeated and multiple registers can be iteratively retimed across an enabled ALM register.
The clock enable structure can be divided into the following three categories.
The localized clock enable has a small fan-out. It often occurs in a clocked process or always block where the signal’s behavior is undefined under a particular branch of a conditional case or if statement. As a result, the signal retains its previous value, which is a clock enable. To check whether a design has clock enables, go to Fitter Report > Resource Section > Control Signals and check the Usage column. Because the localized clock enable has a small fan-out, retiming it is quite easy and usually does not cause any timing issues.
The high fan-out clock enable feeds a large amount of logic, enough where registers being retimed are pushing or pulling registers up and down the clock enable path for their specific needs, resulting in conflicts along the clock enable line. This is similar to the discussion on aggressive retiming in the Synchronous Resets Summary section. Some of the methods discussed there, like duplicating the enable logic, can be beneficial here.
These high-fan-out signals are typically used to disable a large amount of logic from running, and might occur when a FIFO’s full flag goes high. These signals can often be designed around, such as having the FIFO specify that it is almost full a few clock cycles earlier, and giving the clock enable a few clock cycles to propagate back to the logic it is disabling. These extra registers can be retimed into the logic if necessary. Altera recommends avoiding a high-fan-out signal whenever possible.
The third category is clock enable logic that is accompanied by Multicycle and False Path (occasionally) timing constraints. Clock enables are sometimes used to create a sub-domain that runs at half or quarter the rate of the main clock. Sometimes they’re used to control a single path whose logic changes every other cycle. However, the Hyper-Retimer does not retime registers that are endpoints of these timing exceptions. Because timing exceptions are generally used to relax timing, this case is less of an issue. If a clock enable is used to validate a long and slow data path, and the path still has trouble meeting timing, consider adding a register stage to the data path and removing the multicycle timing constraint on the path. The Hyper-Aware CAD flow will allow the retimer to retime the path to improve timing.
- Avoid asynchronous resets. If needed, use at logic boundaries, for control signals, for functionality.
- Avoid synchronous clears. They are usually broadcast signals that aren't retimer friendly.
- Avoid single cycle (stop/start) flow control. Examples are clock enables and FIFO full/empty signals. Consider using valid signals and almost full/ empty, respectively.
- The retimer runs as part of quartus_sta. Avoid using Fitter-only overconstraints since they would not be visible to the retimer.
For information about adding pipeline registers, refer to AN715: Hyper-Pipelining for Stratix 10 Designs.
For information about addressing loops and other RTL restrictions to retiming, refer to AN716: Hyper-Optimization for Stratix 10 Designs.
Added the following sections:
Updated the following sections:
|December 2014||2014.12.15||Initial release on MOLSON.|