When you have a set of data where each data point represents a d-dimensional observation, clustering divides that data into groups (or clusters) that are more similar to each other.
The Intel® FPGA SDK for OpenCL™ can help software engineers to develop and accelerate algorithms on higher efficiency platforms such as FPGAs much faster then coding in RTL.
In the following sections, we explain the k-mean clustering algorithm. Then, we discuss the OpenCL implementation of k-mean clustering. Finally, we provide some performance results comparing how the algorithm can be accelerated on an Intel® Arria® 10 FPGA versus an Intel Xeon CPU.
If you have a set of observations or data represented with , and each data includes features and no labels, k-mean clustering provides a simple method to group the whole data set into k clusters , where data in the same cluster are more similar to each other.
Each data () is labeled with the cluster , where the distance of the data point to the centroid of cluster is minimum compare to all other clusters.
The objective is , where is the mean of the data points in cluster .
Different methods for initializing this unsupervised clustering algorithm exist, and this initialization is important for reducing the complexity of the algorithm (the number of iterations required) and the results of clustering (local optimum).
Typically, if the initial centroids are closer to the final centroid, you can achieve better clustering results and lower complexity. This application note does not cover initialization methods.
- Pick data points from the input data set as centroids for clusters.
- Find the distance of each data into all centroids and label the data with the cluster that has the minimum distance to its centroid.
- Calculate the new centroid of all clusters as a mean of the labeled data from the previous step.
- Check the error.
If the centroids are no longer changing, the algorithm has converged, and you can exit the algorithm. Otherwise, repeat steps 2- 4.
K-mean clustering algorithm is very time consuming algorithm because it goes through all of the data set and finds the distance of each data to centroid of all data. For calculations like these, FPGAs are great target. The Intel® FPGA SDK for OpenCL™ provides a faster solution for you to benefit the flexibility of FPGA platforms and accelerate the clustering process for huge number of unlabeled input data.
In this section, we take the previously described algorithm, and implement it as a series of OpenCL™ kernels. Initial centroids for k clusters are chosen on the host code and then passed into an OpenCL™ kernel running on an FPGA. Therefore, you can modify the host code with a method that you like to use to initialize the centroids. All the data is passed to the initial kernel using global memory transfer.
The following diagram shows the data flow between the kernels.
The first kernel (Read_data) reads input data from global memory and saves it locally for the iterations of the algorithm that follow. Through an internal channel, this data is passed to next kernel, which is an autorun kernel, in packets of 16 input data.
The next kernel (dist_calc) calculates the distance of each data point to all centroids and labels the data in the cluster with the minimum distance. It also calculates the sum of all the data labeled in the same cluster.
The generated data at this level is passed to the next kernel (centroid_calc) through an internal channel. In this kernel, the sum of data in each cluster is read from the internal channel and added until all of the data is processed. Then, the new centroids are calculated as an average of all of the data categorized in each cluster.
If the distance of the new centroids to the previous one is less than the acceptable error, the data was successfully clustered and the kernel exits.
If the error is not acceptable, the new centroids are sent to the dist_calc kernel again through an internal channel for the next iteration.
We compared the performance of this implementation on an FPGA with the optimized k-mean implementation on a CPU. For both FPGA and CPU runs, the same data set was used.
The FPGA used for performance comparison was an Intel® Arria® 10 GX FPGA Development Kit. The FPGA was programmed with Intel® FPGA SDK for OpenCL™ Version 17.1 Update 1. During testing, the FPGA had an fMAX of 320 MHz.
The CPU used for performance comparison was an Intel® Xeon® E5-2680 (24 cores, no hyperthreading).
|Data Size (bytes)||FPGA||CPU|
|Time with initialization method 1 (ms)||Time with initialization method 2 (ms)||Time (ms)|
In this experiment, the number of clusters are set to 10.
Each data set includes 2 features of floating type and different numbers of input data sets (512 to 8192) are used to compare the performance of FPGA and CPU.
For the FPGA runs, we tried two initialization methods. In the first method, we used the first k-data as the centroids of the clusters. In the second method, we chose centroids randomly. With randomly-chosen initial centroids, the algorithm required fewer iterations and therefore achieved faster times.