Welcome to the Intel® FPGA SDK for OpenCL™ support page! Here you will find information on how to emulate, compile, and profile your kernel. There are also guidelines on how to optimize your kernel as well as information on how to debug your system while running host application. This page is organized into two major categories based on the development platform—kernel developer for FPGA and host code developer for CPUs.
Enjoy your journey!
Intel® FPGA SDK for OpenCL™ enables software developers to accelerate their applications by targeting heterogeneous platforms with Intel CPUs and FPGAs. You can also download the Intel FPGA SDK for OpenCL separately from the Intel Quartus® software.
Intel® FPGA SDK for OpenCL™ provides two modes of development experience for users. For code builders, all the tools are integrated into the GUI, which allows them to design, compile, and debug the kernel. On the other hand, the command-line options are for conventional users.
- GUI/code builder
- Not available at the moment
- Command-line options
Here are some useful commands for kernel developers:
aocl kernel.cl -o bin/kernel.aocx –board
- To compile the kernel.cl source into an FGPA aocx kernel, -o is target output file. board_type is point to specific BSP folder(aoc --list-boards), example such as a10gx/c5soc/a10soc
aocl kernel.cl -o bin/kernel.aocx –board -march=emulator
- To build an emulation aocx kernel, the parameter
- march=emulator is to instruct compiler to generate an emulation kernel.
- To verify the BSP that is being used
- To check the current SDK version that has been installed
- To install the PCIe driver to the host computer
- To proceed with an internal board self-health check
- To program the kernal-only file into the FPGA flash memory
- To flash the complete flash memory through the JTAG connection
Intel® FPGA SDK for OpenCL™ is based on a published Khronos Specification and is supported by many vendors who are part of the Khronos group. Intel FPGA SDK for OpenCL has passed the Khronos Conformance Testing Process. It conforms to the OpenCL 1.0 standard and provides both the OpenCL 1.0 and OpenCL 2.0 headers by the Khronos Group.
Attention: The SDK currently does not support all OpenCL 2.0 application programming interfaces (APIs). If you use the OpenCL 2.0 headers and make a call to an unsupported API, the call will return an error code to indicate that the API is not fully supported.
The Intel FPGA SDK for OpenCL host runtime conforms with the OpenCL platform layer and API with some clarifications and exceptions, which can be found at the Support Statuses of OpenCL Features section of the Intel FPGA SDK for OpenCL Programming Guide.
Other Related Links:
Channels (I/Os or Kernel)
The Intel® FPGA SDK for OpenCL™ channel extension provides a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency. Use the following links below for more information on how to implement, use, and emulate channels:
- Implementing the Intel FPGA SDK for OpenCL Channels Extension
- Using Channels with Kernel Copies
- HTML Report: Kernel Design Concepts - Channels
- Transferring Data Via Intel FPGA SDK for OpenCL Channels or OpenCL Pipes
- Requirement for Multiple Command Queues in Channels or Pipes Implementation
Note: If you want to leverage the capabilities of channels but have the ability to run your kernel program using other SDKs, implement OpenCL pipes. For more information on pipes, see the following section on pipes.
Intel FPGA SDK for OpenCL provides preliminary support for OpenCL pipe functions, which are part of the OpenCL Specification version 2.0. They provide a mechanism for passing data to kernels and synchronizing kernels with high efficiency and low latency.
The Intel FPGA SDK for OpenCL implementation of pipes is not fully conformant to the OpenCL Specification version 2.0. The goal of the SDK's pipe implementation is to provide a solution that works seamlessly on a different OpenCL 2.0-conformant device. To enable pipes for Intel FPGA products, your design must meet certain requirements.
See the following links for more information on how to implement OpenCL pipes:
In a multi-step design flow, you can assess the functionality of your OpenCL™ kernel by executing it on one or multiple emulation devices on an x86-64 Windows* or Linux* host. The compilation of the design for emulation takes seconds to generate an .aocx file and allows you to iterate on your design more effectively without having to go through the lengthy hours required for the full compilation.
For Linux systems, the emulator offers symbolic debug support. Symbolic debug allows you to locate the origins of functional errors in your kernel code.
The link below has an overview of the design flow for OpenCL kernels and illustrates the different stages for which you can emulate your kernel.
The Emulating and Debugging Your OpenCL Kernel section from the Programming Guide contains more details on the differences between kernel operation on hardware and emulation.
Other Related Links:
- Emulating and Debugging Your OpenCL Kernel
- Emulating I/O channels
- Verifying Host Runtime Functionality via Emulation (Windows)
- Verifying Host Runtime Functionality via Emulation (Linux)
With the Intel® FPGA SDK for OpenCL™ Offline Compiler technology, you do not need to change your kernel to fit it optimally into a fixed hardware architecture. Instead, the offline compiler customizes the hardware architecture automatically to accommodate your kernel requirements.
In general, you should optimize a kernel that targets a single compute unit first. After you optimize this compute unit, increase the performance by scaling the hardware to fill the remainder of the FPGA. The hardware footprint of the kernel correlates with the time it takes for hardware compilation. Therefore, the more optimizations you can perform with a smaller footprint (that is, a single computing unit), the more hardware compilations you can perform in a given amount of time.
OpenCL Optimization for Intel FPGAs
To optimize the implementation of your design and get the maximum performance, understand your theoretical maximum performance and understand what your limitations are. Follow these steps:
- Start with a simple known good functional implementation.
- Use an emulator to validate the functionality.
- Remove or minimize the pipeline stalls that are reported with the optimization report.
- Plan memory access for optimal memory bandwidth.
- Use a profiler to debug performance issues.
The Profiler gives more insight into the system performance, which gives you direction to further optimize the algorithm in usage of the memory.
Remember that for FPGAs, the more resources that can be allocated, the more unrolling, parallelization, and higher performance can be attained.
Helpful Reports and Resources for Optimization
There are a number of system generated reports available to users. These reports give insight into the code, resource usage, and hints on where to focus to further improve the performance:
- Loop Analysis Report of an OpenCL Design Example
- Verifying Information on Memory Replication and Stalls
- Reviewing Area Information
- HTML Report: Area Report Messages
Understanding memory systems is crucial to efficiently implement an application using OpenCL.
Global Memory Interconnect
Unlike a GPU, an FPGA can build any custom load-store unit (LSU) that is most optimal for your application. As a result, your ability to write OpenCL code that selects the ideal LSU types for your application might help improve the performance of your design significantly.
For more information, refer to the Global Memory Interconnect section of the Intel FPGA SDK for the OpenCL Best Practices Guide.
Local memory is a complex system. Unlike typical GPU architecture where there are different levels of caches, an FPGA implements local memory in dedicated memory blocks inside the FPGA. For more information, refer to the Local Memory section of the Intel FPGA SDK for OpenCL Best Practices Guide.
There are a number of ways memory used can be optimized for improving the overall performance. For more information on some of the key techniques, refer to the Allocating Aligned Memory section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on the strategies to improve memory access efficiency, refer to the Strategies for Improving Memory Access Efficiency section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Understanding pipelines is crucial for leveraging the best performance of your implementation. Efficient use of pipelines directly improves the performance throughput. For more details, refer to the Pipelines section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on data transfer, refer to the Transferring Data Via Intel FPGA SDK for OpenCL Channels or OpenCL Pipes section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Stall, Occupancy, Bandwidth
Profile your kernel to identify performance bottlenecks. For more information on how profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance, refer to the Profiling Your Kernel to Identify Performance Bottlenecks section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Some techniques for optimizing the loops are:
For some tips on removing loop-carried dependencies in various scenarios for a single work item kernel, refer to the Removing Loop-Carried Dependency section of the Intel FPGA SDK for OpenCL Best Practices Guide.
For more information on optimizing floating-point operations, refer to the Optimizing Floating-Point Operations section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Area usage is an important design consideration if your OpenCL kernels are executable on FPGAs of different sizes. When you design your OpenCL application, Intel recommends that you follow certain design strategies for optimizing hardware area usage.
Optimizing kernel performance generally requires additional FPGA resources. In contrast, area optimization often results in decreased performance. During kernel optimization, Intel recommends that you run multiple versions of the kernel on the FPGA board to determine the kernel programming strategy that generates the best size versus performance trade-off.
For more information on strategies for optimizing FPGA area usage, refer to the Strategies for Optimizing FPGA Area Usage section of the Intel FPGA SDK for OpenCL Best Practices Guide.
Reference Design Examples
Some design examples that illustrate the optimization techniques are as follow:
This example shows the optimization of the fundamental matrix multiplication operation using loop tiling to take advantage of the data reuse inherent in the computation.
This example illustrates:
- Single-precision floating-point optimizations
- Local memory buffering
- Compile optimizations (loop unrolling, num_simd_work_items attribute)
- Floating-point optimizations
- Multiple device execution
This design example implements the time-domain finite impulse response (FIR) filter benchmark from the HPEC Challenge Benchmark Suite. For more information, refer to the Time-Domain Finite Impulse Response Filter Bank page.
This design is a great example of how FPGAs can provide far better performance than a GPU architecture for floating-point FIR filters.
This example illustrates:
- Single-precision floating-point optimizations
- Efficient 1D sliding window buffer implementation
- Single work-item kernel optimization methods
This design example implements a video downscaler that takes 1080p input video and outputs 720p video at 110 frames per second. This example uses multiple kernels to efficiently read from and write to global memory.
This example illustrates
- Kernel channels
- Multiple simultaneous kernels
- Kernel-to-kernel channels
- Sliding window design pattern
- Memory access pattern optimizations
This design example is an OpenCL implementation of the Lucas Kanade optical flow algorithm. A dense, non-iterative, and non-pyramidal version with a window size of 52x52 is shown to run at over 80 frames per second on the Cyclone® V SoC Development Kit.
This example illustrates:
- Single work-item kernel
- Sliding window design pattern
- Resource usage reduction techniques
- Visual output
Online training specific to OpenCL optimization with design examples are available at:
- OpenCL Optimization Techniques: Image Processing Algorithm Example
- OpenCL Optimization Techniques: Secure Hash Algorithm Example
In a multistep design flow, if the estimated kernel performance from emulation is acceptable, you can chose to collect information about how your design performs while executing on the FPGA.
You can instruct the Intel® FPGA SDK for OpenCL™ Offline Compiler to instrument performance counters in the Verilog code in the .aocx file with the --profile option. During execution, the Intel FPGA SDK for OpenCL Profiler measures and reports performance data that are collected during the OpenCL kernel execution on the FPGA. You can then review the performance data in the Profiler GUI.
The Profiling Your OpenCL Kernel section of the Intel FPGA SDK for OpenCL Programming Guide contains more information on how to profile your kernel.
Profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance. The Profile Your Kernel to Identify Performance Bottlenecks section of the Intel FPGA SDK for OpenCL Best Practices Guide contains more in-depth information on the Dynamic Profiler GUI and how to interpret profiling data such as stall, bandwidth, cache hits, and so on. It also contains Profiler analysis of several OpenCL design example scenarios.
Intel® FPGA SDK for OpenCL™ provides a compiler and tools for you to build and run OpenCL applications that target Intel FPGA products.
If you only require the Intel FPGA SDK for OpenCL's kernel deployment functionality, download and install the Intel FPGA Runtime Environment (RTE) for OpenCL.
The RTE is a subset of the Intel FPGA SDK for OpenCL. Unlike the SDK, which provides an environment that enables the development and deployment of OpenCL kernel programs, the RTE provides tools and runtime components that enable you to build and execute a host program, and execute precompiled OpenCL kernel programs on target accelerator boards.
Do not install the SDK and the RTE on the same host system. The SDK already contains the RTE.
Utilities and Host Runtime Libraries
The RTE for OpenCL provides utilities, host runtime libraries, drivers, and RTE-specific libraries and files.
The RTE Utility includes commands you can invoke to perform high-level tasks. The RTE utilities are a subset of of the Intel FPGA SDK for OpenCL utilities
The host runtime provides the OpenCL host platform API and runtime API for your OpenCL host application
The host runtime consists of the following libraries:
- Statically-linked libraries provide OpenCL host APIs, hardware abstractions, and helper libraries
- Dynamic link libraries (DLLs) provide hardware abstractions and helper libraries
For more information on utilities and host runtime libraries, refer to the Contents of the Intel FPGA RTE for OpenCL section of the Intel FPGA RTE for OpenCL Getting Started Guide.
You can now significantly reduce the system latency of your systems using host channels that allows streaming data from the host to stream directly into the FPGA kernel through the PCIe* interface while bypassing the memory controller. The FPGA kernel can begin processing the data immediately and does not have to wait for the data transfer to complete. Host channels are supported in the OpenCL run time application programming interfaces (APIs) and include emulation support.
For more details on host channels and emulation support, refer to the Emulating I/O Channels section of the Intel® FPGA SDK for OpenCL™ Programming Guide.
Profiling allows you to learn where your program spent its time and what are the different functions that are called. This information shows you which part of your program is running slower than you expected that might need a rewrite for faster program execution. It can also tell you which functions are being called more or less often than you expected.
The gprof is an open-source tool available in Linux* operating systems for profiling the source code. It works on time-based sampling. During intervals the program counter is interrogated to decide at which point in the code the execution has arrived.
To use the gprof, recompile the source code using the compiler profiling flag -pg
Run the executables to generate the files containing profiling information:
A specific file named “gmon.out” containing all the information that the gprof tool requires to produce a human-readable profiling data is generated. So, now use the gprof tool in the following way:
$ gprof source code gmon.out > profile_data.txt
profile_data.txt is the file that contains the information that the gprof tool uses to produce human-readable profiling data. This contains two parts: flat profile and call graph.
The flat profile shows how much time your program spent in each function, and how many times that function was called.
The call graph shows, for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function.
More information on the usage of gprof for profiling is available on the GNU website.
Intel® VTune™ Amplifier
The Intel® VTune™ Amplifier used for profiling helps you speed up and optimize execution of your code on Linux embedded platforms, Android*, or Windows* systems providing the following types of analysis:
- Performance analysis: Find serial and parallel code bottlenecks, analyze algorithm choices, and GPU engine usage, and understand where and how your application can benefit from available hardware resources
- Intel Energy Profiler analysis: Analyze power events and identify those that waste energy
For more information on the Intel V-tune Amplifier, visit the Getting Started with Intel VTune Amplifier 2018 for Linux OS website.
OpenCL™ host pipelined multithread provides a framework to achieve high throughput for algorithms where a large number of input data needs to be processed and the process for each data needs to be done in sequential order. One of the best applications of this framework is in heterogeneous platforms where high-throughput hardware or platform is used to accelerate the most time-consuming part of the application. Remaining parts of the algorithm must run in a sequential order on other platforms such as CPUs, to either prepare the input data for the accelerated task or to use the output of that task to prepare the final output. In this scenario, although the performance of the algorithm is partially accelerated, the total system throughput is much lower because of the sequential nature of the original algorithm.
In this AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Application Note, a new pipelined framework for high-throughput design is proposed. This framework is optimal for processing large input data through algorithms where data dependency forces sequential execution of all stages or tasks of the algorithm.
You must have administrator privileges on the development system to install the necessary packages and drivers required for the host software development.
The host system must be running one of the following supported Windows* and Linux* operating systems listed on the Operating System Support page.
Develop your host application for the Intel® FPGA SDK for OpenCL™ using one of the following development environments:
Windows OS systems
- Intel FPGA SDK for OpenCL
- Board support package (BSP)
- Microsoft* Visual Studio Professional version 2010 or later.
Linux OS systems
- Intel FPGA SDK for OpenCL
- RPM (RPM Package Manager; originally Red Hat Package Manager)
- C compiler included with GCC
- Perl command version 5 or later
The Intel® FPGA SDK for OpenCL™ Emulator can be used to check the functionality of the kernel. User can also debug OpenCL kernel functionality as part of the host application on Linux* systems. The debugging feature provided with the Intel FPGA SDK for OpenCL Emulator allows you to do so.
For more information, refer to these sections in the Intel FPGA SDK for OpenCL Programming Guide:
There are certain environment variables that can be set to get more debug information while running the host application. These are Intel® FPGA SDK for OpenCL™ specific environment variables, which can help diagnose problems with custom platform designs. The following table lists all of these environment variables as well as describes them in detail.
|ACL_HAL_DEBUG||Set this variable to a value of 1 to 5 to increase debug output from the hardware abstraction layer (HAL), which interfaces directly with the MMD layer.|
|ACL_PCIE_DEBUG||Set this variable to a value of 1 to 10,000 to increase debug output from the MMD. This variable setting is useful for confirming that the version ID register was read correctly and the UniPHY IP cores are calibrated.|
|ACL_PCIE_JTAG_CABLE||Set this variable to override the default quartus_pgm argument that specifies the cable number. The default is cable 1. If there are multiple Intel® FPGA Download Cables, you can specify a particular cable by setting this variable.|
|ACL_PCIE_JTAG_DEVICE_INDEX||Set this variable to override the default quartus_pgm argument that specifies the FPGA device index. By default, this variable has a value of 1. If the FPGA is not the first device in the JTAG chain, you can customize the value.|
|ACL_PCIE_USE_JTAG_PROGRAMMING||Set this variable to force the MMD to reprogram the FPGA using the JTAG cable instead of partial reconfiguration.|
|ACL_PCIE_DMA_USE_MSI||Set this variable if you want to use MSI for direct memory access (DMA) transfers on Windows* OS.|
|CL_CONTEXT_COMPILER_MODE_INTELFPGA||Unset this variable or set it to a value of 3. The OpenCL™ host runtime reprograms the FPGA as needed, which it does at least once during initialization. To prevent the host application from programming the FPGA, set this variable to a value of 3.|
Due to a loop in the host program, users may experience the OpenCL™ system slowing down while running it. To know more details about such a scenario, refer to the Debugging Your OpenCL System That is Gradually Slowing Down section of the Intel® FPGA SDK for OpenCL Programming Guide
The Intel Code Builder for OpenCL is a software development tool available as part of the Intel FPGA SDK for OpenCL. It provides a set of Microsoft* Visual Studio and Eclipse plug-ins that enable capabilities for creating, building, debugging, and analyzing Windows* and Linux* applications accelerated with OpenCL. For more information, refer to the Developing/Debugging OpenCL Applications Using Intel Code Builder for OpenCL section of the Intel FPGA SDK for OpenCL Programming Guide.
There are numerous design examples available to describe various applications in OpenCL™. Users can compile and execute these designs on a host with an FPGA board that supports Intel® FPGA SDK for OpenCL.
For more OpenCL design examples, refer to the OpenCL Developer Zone website.
- Introduction to OpenCL SDK for Intel FPGA SDK
- Introduction to Parallel Computing with OpenCL™ on Intel® FPGAs
- Writing OpenCL on Intel FPGAs
- Running OpenCL on Intel FPGAs
- OpenCL Optimization Techniques: Secure Hash Algorithm (SHA-1) Example
- OpenCL Optimization Techniques: Image Processing Algorithm Example
- Single-Threaded vs. Multi-Threaded Kernels
- Optimization and Emulation Flow in Altera® SDK for OpenCL
- Other OpenCL Training Courses
|How to Run Hello World and (Other Programs) with OpenCL™ on Cyclone® V SoC Using Windows* Part 1||This video describes the out-of-box procedure for running two applications, OpenCL™ HelloWorld and OpenCL fast Fourier transform (FFT) on the Cyclone® V SoC using a Windows* machine.|
|How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 2||This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine.|
|How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 3||This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine.|
|How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 4||This video describes the out-of-box procedure for running two applications, OpenCL HelloWorld and OpenCL FFT on the Cyclone V SoC using a Windows machine.|
|How to Run Hello World and (Other Programs) with OpenCL on Cyclone V SoC Using Windows Part 5|
|How to Package Custom Verilog Modules/Designs as OpenCL Libraries||The video discusses why customers could potentially use this feature to have their custom processing blocks (RTL) in OpenCL kernel code. The video explains the design example, such as the makefiles and config files, and explains the compilation flow. The video also shows a demo of the design example.|
|OpenCL on Altera® SoC FPGA (Linux* Host) – Part 1 – Tools Download and Setup||This video shows you how to download, install, and configure the tools required to develop OpenCL kernels and host code targeting Altera® SoC FPGAs.|
|OpenCL on Altera SoC FPGA (Linux Host) – Part 2 – Running the Vector Add Example with the Emulator||This video shows you how to download and compile an example OpenCL application targeting the emulator that is built into the OpenCL.|
|OpenCL on Altera SoC FPGA (Linux Host) – Part 3 – Kernel and Host Code Compilation for SoC FPGA||This video shows you how to compile the OpenCL kernel and host code targeting the FPGA and processor of the Cyclone V SoC FPGA.|
|OpenCL on Altera SoC FPGA (Linux Host) – Part 4 – Setup of the Runtime Environment||This video shows you how to set up the Cyclone V SoC board to run the OpenCL example and execute the host code and kernel on the board.|