Intel® FPGA SDK for OpenCL™ provides the tools you need to get started with developing your designs. Browse through the developer zone to find examples, development platforms, and partners that can make your design a success.

Related Links

Videos

Optical Flow and Pedestrian Detection Implemented with OpenCL


Get started with OpenCL™ and FPGAs with our low-cost, low-power SoCs to accelerate motion estimation and pedestrian detection algorithms.

 

 

Object Detection and Recognition with Neural Networks


See how our partner iAbra demonstrates machine learning with convolutional neural networks on FPGAs using OpenCL™ to accelerate object detection and recognition scenarios.


 

Achieve Power Efficiency in a Single Chip Solution Using OpenCL on Our SoCs


Watch how OpenCL™ accelerates the performance of a ray tracing algorithm. In this demo, we compare the performance of code written to run on the embedded ARM* Cortex*-A9 processor only, with the FPGA accelerated version in a Cyclone® V FPGA using the ARM processor as the host.

 

Unified Heterogeneous Programmability of OpenCL

 

Watch how OpenCL™ provides a unified platform for heterogeneous computing. In this demo, we retarget NVIDIA code written for a graphics processing unit (GPU) to a Stratix® V FPGA.




 

Accelerating Algorithm Performance with OpenCL by Offloading to an FPGA

 

Watch how OpenCL™ accelerates the performance of the Mandelbrot algorithm ̶ an iterative, arithmetically intensive floating-point algorithm.

 

 

 

 

The Intel FPGA OpenCL Application Developers provide services, software, and other intellectual property (IP), which can enhance your OpenCL-based product development. Intel's development partners have expertise in a wide range of application areas including:

  • High-performance computing
  • Financial trading, simulations, and analysis
  • Defense, intelligence, and aerospace
  • Image processing, medical systems, and bio-informatics
  • Computing for oil and gas exploration
 

Design Examples

The following examples demonstrate how to describe various applications in OpenCL along with their respective host applications, which you can compile and execute on a host with an FPGA board that supports the Intel FPGA SDK for OpenCL.

Basic Examples

Design Example Features Benefits Description

Hello World

  • OpenCL application programming interface (API) to initialize a device and run a kernel
  • Getting started
This simple design example demonstrates a basic OpenCL kernel containing a printf call and its corresponding host program.

Vector Addition

  • OpenCL API
  • Partition a large problem across multiple devices
  • OpenCL events and event profiling
  • Getting started
This simple design example demonstrates a basic vector addition OpenCL kernel and its corresponding host program.
Multithread
Vector Operation
  • Multithreaded host
  • Advanced host code
Two host threads launch two simultaneous kernels.
OpenCL Library
  • OpenCL Library
  • Advanced kernel code
Example designs that use OpenCL libraries containing Verilog and VHDL code to implement custom functions.

Network Platform Examples

Design Example Features Benefits Description
OPRA FAST Parser
  • Single work-item kernel
  • I/O channels
  • Low latency
  • 10G link saturation
This design example demonstrates a streaming parser commonly used in high-frequency trading algorithms. The parser accepts an OPRA FAST data stream and decompresses the fields for use upstream. It illustrates how you can process streaming messages efficiently to achieve 10G link saturation.

HPC Platform Examples

Design Example Features Benefits Description
Channelizer
  • Kernel channels
  • Multiple simultaneous kernels
  • Single work-item kernels
  • Performance
  • Getting started with kernel channels
This design example demonstrates a high-performance channelizer design using OpenCL. The channelizer combines a polyphase filter bank (PFB) with a fast Fourier transform to reduce the effects of spectral leakage on the resulting frequency spectrum.
Document Filtering
  • Working with 24-bit integers
  • Performance
This design example demonstrates use of Bloom filter for high-performance document filtering.

Finite Difference Computation (3D)

  • Single-precision floating-point optimizations
  • Single work-item kernel
  • Optimizations to minimize redundant memory use
  • Performance
This design example demonstrates a high-performance 3D finite-difference stencil-only computation using OpenCL. It shows how to efficiently describe a sliding window data reuse pattern.

FFT (1D)

  • Single-precision floating-point optimizations
  • Single work-item kernel
  • Performance
This design example demonstrates a high-performance 1D radix-4 complex fast Fourier transform (FFT) or inverse fast Fourier transform (IFFT) engine using OpenCL. This example takes advantage of the efficient sliding window data reuse pattern.

FFT Off-Chip (1D)

  • Single-precision floating-point optimizations
  • Kernel channels
  • Optimized memory accesses
  • Performance
  • Getting started with kernel channels
This design example is a high-performance implementation of a one million point FFT.  Such large FFTs cannot be done completely on the FPGA and this example demonstrates how to efficiently manage the memory accesses.
FFT (2D)
  • Single-precision floating-point optimizations
  • Kernel channels
  • Memory access pattern optimizations
  • Multiple simultaneous kernels
  • Mix of single work-item and NDRange kernels
  • Performance
  • Getting started with kernel channels
This design example demonstrates a high-performance 2D radix-4 complex FFT/IFFT engine using OpenCL. This engine is targeted at large problem sizes (1024x1024 by default) and uses global memory to store the intermediate transposition. One aspect highlighted by this example is how to efficiently perform matrix transposition in global memory.
Gzip Compression
  • Single work-item kernel
  • Stream-like processing
  • Published paper with implementation and results
  • High performance (vs. CPU, RTL, ASIC)
  • Parameterizable performance and compression quality
This design example showcases a high performance Gzip compression implementation using OpenCL for Intel FPGAs.
JPEG Decoder
  • Single work-item kernels
  • Kernel channels
  • Overlapping memory transfers and kernel invocations
  • Visual output
  • Scalable Performance
  • Getting started with kernel channels
This design example showcases a higher-performance JPEG decoding solution.
Mandelbrot Fractal Rendering
  • Double-precision floating-point optimizations
  • Multiple device partitioning
  • Visual output
  • Scalable Performance
This design example includes a kernel that implements the Mandelbrot fractal convergence algorithm and displays the results to the screen.
Matrix Multiplication
  • Single-precision floating-point optimizations
  • Local memory buffering
  • Compiler optimizations
  • Multiple device execution
  • Scalable performance
  • Getting started with optimization methods
This example shows the optimization of the fundamental matrix multiplication operation using loop tiling to take advantage of the data reuse inherent in the computation.
Monte Carlo Black-Scholes Asian Options Pricing
  • Double-precision floating-point optimizations
  • Kernel channels
  • Multiple device execution
  • Multiple simultaneous kernels
  • Scalable
  • Power-efficient performance
  • Getting started with kernel channels
This design example implements the Monte Carlo Black-Scholes simulation for Asian option pricing. This example shows how to run multiple kernels simultaneously, with each performing different parts of the simulation (random number generation, path simulation, and accumulation) and communicating using our channels vendor extension.
Sobel Filter
  • Integer arithmetic
  • Single work-item kernel
  • Efficient 2D sliding window line buffer
  • Visual output
  • Scalable performance
This design example demonstrates a seamless software solution of a Sobel filter in OpenCL to perform edge detection on an image and display the resulting filtered image on the screen.

Time-Domain FIR Filter

  • Single-precision floating-point optimizations
  • Efficient 1D sliding window buffer implementation
  • Single work-item kernel
  • Optimization methods
  • Performance
  • Getting started with optimization methods
This design implements the time-domain finite impulse response (FIR) filter benchmark from the HPEC Challenge Benchmark Suite. This design example is a great example of how FPGAs can provide far better performance than a GPU architecture for floating-point FIR filters.

Video Downscaling

  • Kernel channels
  • Multiple simultaneous kernels
  • Memory access pattern optimizations
  • Performance
  • Getting started with kernel channels
This design example implements a video downscaler that takes 1080p input video and outputs 720p video at 110 frames per second. This example uses multiple kernels to efficiently read from and write to global memory.

Cyclone SoC Platform Examples

Design Example Features Benefits Description

Multifunction Printer Error Diffusion

  • Single work-item kernel
  • Sliding window design pattern
  • Part of a Multifunction Printer system
  • Performance
This design is part of core printer pipeline. It implements a variant of Floyd Steinberg error diffusion algorithm The kernel takes a CMYK image and produces an equivalent image with every pixel half-toned. Such an output is the final stage of image processing inside a printer before it is send to the laser system. A whitepaper, “FPGA Acceleration of Multifunction Printer Image Processing Using OpenCL”, is also available for this example.

Optical Flow

  • Single work-item kernel
  • Sliding window design pattern
  • Resource usage reduction techniques
  • Visual output
  • Performance
This design example is a OpenCL implementation of the Lucas Kanade optical flow algorithm. A dense, non-iterative and non-pyramidal version with a window size of 52x52 is shown to run at over 80 frames per second on the Cyclone V SoC Development Kit.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

 

1 Product is based on a published Khronos Specification, and has passed the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org/conformance.

 

To leverage the flexibility of FPGAs and the variety of customizable hardware architectures that can be implemented, we have introduced the concept of reference platforms to provide a better out-of-the-box experience for performing OpenCL evaluations, creating custom OpenCL applications, and creating custom boards with an FPGA accelerator on it. The reference platforms contain both the hardware and software layers for our reference designs to communicate with the board.

The traditional OpenCL model has a host that passes data to the accelerator system over PCI Express* (PCIe*). For the High-Performance Computing (HPC) platform, the system requires a large amount of local bulk storage for processing the data that the host sends to the accelerator. These applications require large amounts of memory bandwidth and are systems where computing power is of most importance. This platform is the standard platform for OpenCL accelerators.

To get started evaluating the standard HPC platform architecture, you can:

  • Download a reference design that runs on the HPC platform and learn the OpenCL application development flow
  • Download an HPC platform by one of our Intel FPGA Preferred Board vendor boards
  • Purchase a commercial off-the-shelf (COTS) board that supports the HPC platform from one of our Intel FPGA Preferred Board vendors by clicking their logo below
BittWare GIDEL
Nallatech

Terasic

 

To learn how to use the reference platforms to create your own custom board, refer to the "Custom" tab below.

The Network platform deviates from the traditional OpenCL model by extracting the datapath from the PCIe command and status path. Data is now streamed into the kernels using I/O channels, without host interaction over two 10 Gb user datagram protocol (UDP) ports. This streaming architecture allows the host to configure the datapath pipeline and then step out of the picture for a much lower latency data processing path that traditional FPGA developers are used to. Applications using this platform are much more concerned with achieving a lower latency result.

To start evaluating the low-latency Network platform architecture, you can:

  • Download a reference design that runs on the Network platform and learn the OpenCL application development flow
  • Download a Network platform by one of our Intel FPGA Preferred Board vendors boards
  • Purchase a COTS board that supports the Network platform from one of our Intel FPGA Preferred Board vendors by clicking on their logo below
BittWare Nallatech

The SoC platform resembles the traditional OpenCL model with a shared global memory that is used to pass data between the ARM host and the FPGA accelerator, which in this case is the same package. There is also an optional version of the architecture that adds a scratch DDR3 SDRAM interface on the FPGA accelerator side.

To start evaluating the Cyclone V SoC platform, you can:

  • Download a reference design that runs on the network SoC platform and learn the OpenCL application development flow
  • Download the Intel FPGA SDK for OpenCL, which ships with the SoC reference platform for the Cyclone V SoC
  • Purchase a COTS board that supports the network platform from one of our Intel FPGA Preferred Board vendors listed below
Terasic  

Related Link

While it is convenient if the architecture of the FPGA accelerator you want falls into one of these existing categories, it is not required. These reference platforms are a starting point to aid in building your own custom FPGA. Start with the existing SoC or Network platform, and simply remove or modify the component interfaces for the ones you desire and rebuild it. This uses traditional FPGA design to create the “I/O ring” for the OpenCL kernels to communicate with the I/O interfaces that will be on your custom board.

In order to build your own custom FPGA accelerator board, you will need a few things. To start building a custom board support package from a blank template, start with the custom platform toolkit.

 

Documentation

 

Custom Platform Toolkit:  Windows or Linux downloads

  • Raw template for a platform
  • Board Test kernels to exercise the I/O interfaces
  • MMD header file to get started building drivers
  • HPC platform migration text file (from version 13.1)

To start with an existing platform and modify it, here are the current reference platforms available.

 

Arria 10 GX FPGA Development Kit Reference Platform

 

Stratix V Network Reference Platform: s5_net (w/ PLDA UDP stack):

 

Cyclone V SoC Reference Platform

Arria 10 Custom Platform for OpenCL

Several of Intel's partners have already created boards and ported the reference platforms to their boards for purchase, in either evaluation mode or full production. These third-party production boards are tested according to the strict requirements of the Intel FPGA Preferred Board for OpenCL Partner Program. Only preferred boards are optimized for the most current Intel FPGA device architectures and design software. Intel works closely with these selected partners to ensure that their boards continually meet these standards by running over 9,000 regression tests. An Intel FPGA Preferred Board for OpenCL typically contains the following items:

  • Specific OpenCL board design files (including the platform)
  • Quartus® Prime Development Kit Edition software (one-year evaluation license)
  • A license for Intel FPGA SDK for OpenCL
  • Reference designs
  • Documentation

Third-Party Boards

Third-party preferred boards are purchased directly from Intel FPGA Preferred Board partners. The terms and conditions for the license of each certified board may vary from partner to partner. Intel FPGA Preferred Boards for OpenCL are carefully developed by third-party partners to ensure the highest possible quality. If a problem is traced to a preferred board, the partner is responsible for resolving the problem. If a problem arises in Intel FPGA SDK for OpenCL, Intel will provide the appropriate engineering support.

 

Warranty

Third-party preferred boards for Intel FPGA SDK for OpenCL are provided without warranty from Intel. Intel disclaims all warranties, express and implied, with respect to the board supplied by the partner, including, but not limited to, implied warranties of merchantability, fitness for a particular purpose, title and non-infringement. The Intel FPGA Preferred Board partners may offer guarantees or warranties for design performance or functionality. Please contact the individual partners for details.

Optimization Training

OpenCL Optimization Techniques: Secure Hash Algorithm (SHA-1) Example (7 minutes)

This training provides a simple overview of the optimization methodology one would take when trying to optimize their OpenCL implementation for an FPGA, using the Secure Hash Algorithm (SHA-1) as an example.

 

OpenCL Optimization Techniques: Image Processing Algorithm Example (8 minutes)

This training provides a simple overview of an architectural optimization approach for targeting OpenCL on an FPGA for image processing algorithms.

 

Single-Threaded vs. Multi-Threaded Kernels (17 minutes)

Understand the differences between loop pipelining and parallel threads, and know when to use single-threaded (Task) and multi-threaded (NDRange) pipelining.

 

Optimization and Emulation Flow in Altera SDK for OpenCL (6 minutes)

See how you can optimize your FPGA-accelerated applications with the emulator and detailed optimization report features.

 

How to Do Reductions (PDF)

 

Being Careful with Memory Access Part 1 (PDF)

 

Being Careful with Memory Access Part 2 (PDF)

 

Optimizing OpenCL for Altera FPGAs (2 day)

This Instructor led training focuses on writing kernel functions that are optimized for Intel FPGAs, including hands-on exercises.

 

OpenCL Training Courses

Introduction to FPGA Acceleration for Software Programmers Using OpenCL

This training describes ways that you can use OpenCL to target an FPGA to create custom accelerated systems with an average of one fifth the power of competing accelerators, trends that are making FPGAs an important resource for accelerating software execution, and how OpenCL makes them accessible to software developers.

 

FPGA vs GPGPU (21 minutes)

Watch this short video to learn how FPGAs provide power efficient acceleration with far less restrictions and far more flexibility than GPGPUs. We will compare and contrast the approach to solving problems by leveraging this flexibility compared to the fixed architecture of the GPGPU.

 

OpenCL on Altera SoC FPGA (Linux Host)

Part 1 – Tools Download and Setup (5 minutes)

Part 2 – Running the Vector Add Example with the Emulator (4 minutes)

Part 3 – Kernel and Host Code Compilation for SoC FPGA (4 minutes)

Part 4 – Setup of the Runtime Environment (7 minutes)

These training courses walk you through getting started with OpenCL on an SoC in a Linux environment.

 

Introduction to Parallel Computing with OpenCL (30 minutes)

Get an overview of the OpenCL standard and the advantages of using Intel's OpenCL solution.

 

Writing OpenCL Programs for Altera FPGAs (1 hour)

Understand the basics of the OpenCL standard and learn to write simple programs.

 

Running OpenCL on Altera FPGAs (30 minutes)

Get to know the Intel FPGA SDK for OpenCL and learn to compile and run OpenCL programs on Intel FPGAs.

 

Building Custom Platforms for Altera SDK for OpenCL (1 hour)

Learn how to create a custom board support package for use with your board and the Intel FPGA SDK for OpenCL.

 

Introduction to OpenCL for Altera FPGAs  (1 day)

Get an overview of parallel computing, the OpenCL standard, and the OpenCL for FPGA design flow in this instructor-led training. The focus of the training is not on writing kernels, but rather going over the FPGA specific portion of creating an OpenCL environment for hardware acceleration.