How do banks identify risky loan applications or credit card companies detect fraudulent credit-card transactions? How do universities monitor students’ academic performance? How do you make it easier to analyze digital images? These are just a few situations that many companies face when dealing with huge amounts of data.
To deal with risky loan or credit card problems, we can divide data into clusters of similar characteristics and look for abnormal behaviors when comparing this to other data in the same cluster. For monitoring students’ performance, school faculties can base on the students’ score range to sort them into different groups. Using a similar concept, we can partition a digital image into many segments based on a set of pixels that closely resemble each other. The idea is to simplify the representation of a digital image making it easier to identify objects and boundaries in an image.
Dividing data into clusters or sorting students’ scores into different groups can be done manually if the amount of data is not large. However, it would be impossible to do it manually if data is in the range of terabytes and petabytes. Therefore machine-learning1 approach is a good way to solve these types of problems when dealing with large amount of data.
This article discusses an unsupervised2 machine-learning algorithm called K-means3 that can be used to solve the above problems. It also describes how Intel® Data Analytics Acceleration Library (Intel® DAAL)4 help optimize this algorithm to improve the performance when running it on systems equipped with Intel® Xeon® processors.
In the case of supervised learning,5 the algorithm is exposed to a set of data in which the outcome is known so that the algorithm is trained to look for similar patterns in new data sets. In contrast, an unsupervised learning algorithm explores the data set with an unknown outcome. Further, input samples are not labeled and the system has to label them by itself. The system will scan the data and group the ones with similar characteristics/behaviors into what we call clusters. Basically, the system partitions the data set into clusters of similar characteristics and behaviors.
K-means is an unsupervised machine-learning algorithm.
In order to create clusters, K-means first assigns an initial value of centroids, normally by randomly selecting it. These are the centers for the clusters. Ideally these centers should be as far apart from each other as possible. Next, take the objects in the data set and associate them to the nearest centers to form the initial clusters. Then calculate the new centroids based on the newly formed clusters. Again, re-associate the objects base on these new centroids. Keep repeating the steps of recalculating new centroids and re-associating the objects until the centroid locations are no longer changed or the algorithm goes through all the specified iterations.
The goal of the K-means algorithm is also to minimize the cost function J below. Sometimes, J is also called the objective or square-error function:
J = Square-error function
xi = Object i
cj = Centroid for cluster j
k = Number of clusters
nj = Number of objects in jth cluster
|| xi – cj || = Euclidean distance8 between xi and cj
Figures 1–5 show how the K-means algorithm works:
Figure 1. Data set layout showing the objects of the data set are all over the space.
Figure 2. Shows the initial positions of the centroids. In general, these initial positions are chosen randomly, preferably as far apart from each other as possible..
Figure 3. Shows new positions of the centroids after one iteration. Note that the two lower centroids are re-adjusted to be closer to the two lower chunks of objects
Figure 4. Shows the new positions of the centroids after many iterations. Note that the positions of the centroids don’t vary too much compared to those in Figure 3. Since the positions of the centroids are stabilized, the algorithm will stop running and consider those positions final.
Figure 5. Shows that the data set has been grouped into three separate clusters.
K-means can be used for the following:
The following lists some of the advantages and disadvantages of K-means.
Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of the latest Intel® processors. More about Intel DAAL can be found at 4. The K-means algorithm is supported in Intel DAAL. In this article, we use PyDAAL, the Python* API of Intel DAAL, to invoke K-means algorithm,. To install PyDAAL, follow the instructions in 6.
This section shows how step-by-step how to use the K-means algorithm in Python7 with Intel DAAL.
Do the following steps to invoke the K-means algorithm from Intel DAAL:
from daal.data_management import FileDataSource, DataSourceIface
import daal.algorithms.kmeans as kmeans
from daal.algorithms.kmeans import init
dataSet = FileDataSource(
init_alg = init.Batch_Float64RandomDense(nclusters)
initCentroids_ = init_alg.compute().get(init.centroids)
initCentroidsis the initial value of centroids.
initCentroidsvalue is randomly calculated using the above function, Batch_Float64RandomDense. Users can also assign a value to it.
cluster_alg = kmeans.Batch_Float64LloydDense(nclusters, nIterations)
result = cluster_alg.compute()
self.centroids_ = result.get(kmeans.centroids)
self.assignments_ = result.get(kmeans.assignments)
self.goalfunction_ = result.get(kmeans.goalFunction)
self.niterations_ = result.get(kmeans.nIterations)
K-means is one of the simplest unsupervised machine-learning algorithms that is used to solve the clustering problem. Intel DAAL contains an optimized version of the K-means algorithm. With Intel DAAL, you don’t have to worry about whether your applications will run well on systems equipped with future generations of Intel Xeon processors. Intel DAAL will automatically take advantage of new features in new Intel Xeon processors. What you need to do is link your applications to the latest version of Intel DAAL.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804