This document assumes that you are familiar with OpenCL 2 concepts as described in the OpenCL Specification version 1.0 by the Khronos Group.
- Flat compile [--bsp-flow flat]: Performs a flat compilation of the entire design (BSP along with kernel generated hardware).
- Base compile [--bsp-flow base]: Performs a base compilation by using LogicLock restrictions from base.qsf file. The kernel clock target is relaxed so that the BSP hardware has more freedom to meet timing. A base.qar database is created to preserve the BSP hardware, which is the static region.
- Import compile [<default>]: Restores the timing closed static region from the base.qar database and compiles only the kernel generated hardware. It also increases the kernel clock target to obtain the best kernel maximum operating frequency (fmax).
- Static region: Represents the region having BSP related hardware that remains static. The timing is closed for this region during base compilation. In general, the goal is to minimize the chip resources used by this region to close timing.
- Kernel region: Represents the partial reconfiguration (PR) region that is reserved for freeze_wrapper_inst|kernel_system_inst module, which contains the kernel. In general, the goal is to reserve chip resources to a maximum extent for this region.
- Begin with flat compilation to understand where all main components of
the BSP gets placed naturally (especially the IP blocks with I/O connections such as
or DDR). While designing the BSP, you might have to consider
establishing pipeline stages in between the IPs to close timing. You should first run a flat
compile seed sweep to identify the recurrent failing paths, and then attempt to fix
- A good timing closure rate over flat compile seed sweeps will have higher chances of closing base compile timing.
- If you observe consistent failures in mm_interconnect* (component added by Qsys), then open the System with Qsys Interconnect viewer and observe the complexity of the failing interconnect. You can add pipelining flipflops in the viewer to improve timing. If you still cannot address the issue, you might have to break down the mm_interconnect* critical path by adding Avalon® pipeline bridges.
- During base compilation, start with
region that contains freeze_wrapper_inst|kernel_system_inst. With no other restrictions, Intel
Quartus® Prime can place the BSP hardware freely in the remaining static
region of the chip. Use the flat compile and chip planner to identify the size and location
of the BSP hardware, such as
and DDR. Then, reserve the kernel region by using
while avoiding the main clustered areas of the BSP
Tip: If the chip family used is same as the reference platform and if the BSP components are similar, it might be faster to start with the LogicLock® regions for freeze_wrapper_inst|kernel_system_inst that is shipped with the OpenCL reference BSP and work through the failures.
- You might add the following additional components to your BSP:
Tip: If you need to add pipeline bridges (for example, due to large routing delays causing timing failures), then consider the routing distance from source to destination logic in the chip and release some space reserved for the kernel region.
- Memory banks: If you add more memory banks, you should identify the I/O bank location since you may need to add pipeline bridges to meet timing.
- I/O channels: You can add I/O channels such as video, Ethernet, or serial interface. If you add I/O channels, you should identify the I/O bank location since you might need to apply new LogicLock® regions for pipelining if closing timing is difficult.
- Follow these general guidelines when reserving
- Attempt to place all DSP columns in the kernel_system unless required by the BSP.
- Attempt to reserve more resources for the kernel_system.
- Attempt to keep the number of notches in the kernel region to a minimum. The following figure illustrates a notch that was added to place a pipeline bridge between PCIe® and DDR bank.
- Perform a seed sweep on the base compilation instead of selecting the first base seed that meets the timing.
- Perform import compilation (by using few kernels from the example designs) on all the passing base seeds.
- Compute the average fmax for all base seeds.
- Select the base seed that yields the highest average fmax.
- To understand how fast the kernel can run without floorplan
- Perform a flat compilation of the kernel and observe the fmax.
- Perform an import compilation on the same kernel and observe the fmax.
- Compare fmax results.
- Never compare kernel fmax from a base compilation with a flat or an import compilation. Kernel clock targets are relaxed during base compilation and hence, you will never obtain good results.
- Observe the kernel clock critical path in base or import compilation. If the critical path is crossing from the kernel to the static region in the floorplan, change the floorplan or run few more base seeds to avoid this critical path.
- Obtain values for all resources in the FPGA from the top.fit.rpt or base.fit.rpt available under the Partition Statistics section of the Fitter report.
- Deduct the value for "freeze_wrapper_inst|kernel_system_inst" (kernel region).
The following table reflects the OpenCL BSP resource utilization of Arria® 10 GX devices in the 17.0 release.
|Total Available||Reserved for Kernel||Available for BSP||Used by BSP||%|
Observe that the floorplanning is executed in such a way that the static region will not have any DSP blocks.
|August 2017||Initial release.|