By Khang T Nguyen, Published: 09/07/2016, Last Updated: 09/07/2016
With the wide availability of the internet, text categorization has become an important way to handle and organize text data. Text categorization is used to classify news stories and find information on the Web. Also, in order to search for a photo on the web or be able to distinguish a horse from a lion, for example, there must be some kind of mechanism to recognize and classify the pictures. Classifying text or pictures is time consuming. This type of classification is a good candidate for machine learning^{1}.
This article describes a classification machine learning algorithm called support vector machine^{2} and how the Intel® Data Analytics Acceleration Library (Intel® DAAL)^{3} helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.
A support vector machine (SVM) is a supervised machine learning algorithm. It can be used for classification and regression.
An SVM performs classification by finding the hyperplane^{4} that separates between a set of objects that have different classes. This hyperplane is chosen in such a way that maximizes the margin between the two classes to reduce noise and increase the accuracy of the results. The vectors that are on the margins are called support vectors. Support vectors are data points that lie on the margin.
Figure 1 shows how an SVM classifies objects:
Figure 1:Classifying objects with a support vector machine.
There are two classes: green and purple. The hyperplane separates the two classes. If an object lies on the left side of the hyperplane, it is classified as belonging to the green class. Similarly, an object lying on the right side of the hyperplane belongs to the purple class.
As mentioned above, we need to maximize the margin H (the distance between the two margins) to reduce noise, thus improving the accuracy of the prediction.
In order to maximize the margin H, we need to minimize |W|.
We also need to make sure that there are no data points lying between the two margins. To do that, the following conditions need to be met:
xi •w+b ≥ +1
when yi =+1
xi •w+b ≤ –1
when yi =–1
The above conditions can be rewritten to:
yi (xi •w) ≥ 1
So far we have talked about the hyperplane as being a flat plane or as a line in a two-dimensional space. However, in real-life situations, that is not always the case. Most of the time, the hyperplane will be curved, not straight, as shown in Figure 2.
Figure 2:The hyperplane as a curved line.
For simplicity, assume that we are working in a two-dimensional space. In this case, the hyperplane is a curved line. To transform the curved line into a straight line, we can raise the whole thing into higher dimensions. How about lifting into a three-dimensional space by introducing a third dimension, called z?
Figure 3:Introducing a third dimension, z.
The technique of raising the data to a higher dimensional space so that we can create a straight line or a flat plane in a higher dimension is called a kernel trick^{5}.
Figure 4:Using a kernel trick to create a straight line or flat plane in a higher dimension.
SVMs can be used to:
Using SVM has advantages and disadvantages:
Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of latest Intel® processors. The SVM classifier is one of the classification algorithms that Intel DAAL provides. In this article, we use the Python* API of Intel DAAL, to build a basic SVM classifier. To install it, follow the instructions^{7}.
This section shows how to invoke the SVM algorithm in Python^{6} using Intel DAAL.
Do the following steps to invoke the SVM algorithm from Intel DAAL:
Import the necessary packages using the commands from and import
import numpy as np
from daal.data_management import HomogenNumericTable
from daal.algorithms.svm import training, prediction
from daal.algorithms import classifier, kernel_function
import daal.algorithms.kernel_function.linear
Create a function to split the input dataset into the training data, label, and test data.
Basically, split the input data set array into two arrays. For example, for a dataset with 100 rows, split it into 80/20: 80 percent of the data for training and 20 percent for testing. The training data will contain the first 80 lines of the array input dataset, and the testing data will contain the remaining 20 lines of the input dataset.
Restructure the training and testing dataset so Intel DAAL can read it.
Use the commands to reformat the data as follows (We treat trainLabels
and testLabels
as n-by-1 tables, where n is the number of lines in the corresponding datasets):
trainInput = HomogenNumericTable(trainingData)
trainLabels = HomogenNumericTable(trainGroundTruth.reshape(trainGroundTruth.shape[0],1))
testInput = HomogenNumericTable(testingData)
testLabels = HomogenNumericTable(testGroundTruth.reshape(testGroundTruth.shape[0],1))
where
trainInput:
Training data has been reformatted to Intel DAAL numeric tables.
trainLabels
: Training labels has been reformatted to Intel DAAL numeric tables.
testInput:
Testing data has been reformatted to Intel DAAL numeric tables.
testLabels:
Testing labels has been reformatted to Intel DAAL numeric tables.
Create a function to train the model.
algorithm = training.Batch_Float64DefaultDense(nClasses)
algorithm.input.set(classifier.training.data,trainInput)
algorithm.input.set(classifier.training.labels,trainLabels)
algorithm
: The algorithm object as defined in step a above.trainInput
: Training data.trainLabels
: Training labels.Model = algorithm.compute()
algorithm
:The algorithm object as defined in step a above.Create a function to test the model.
algorithm = prediction.Batch_Float64DefaultDense(nClasses)
algorithm.input.setTable(classifier.prediction.data, testInput) algorithm.input.setModel(classifier.prediction.model, model.get(classifier.training.model))
algorithm
: The algorithm object as defined in step a above.testInput
: Testing data.model
: Name of the model object.Prediction = algorithm.compute()
algorithm
: The algorithm object as defined in step a above.prediction
: Prediction result that contains predicted labels for test data.SVM is a powerful classification algorithm. It works well with a clear margin of separation. Intel DAAL optimized the SVM algorithm. By using Intel DAAL, developers can take advantage of new features in future generations of Intel® Xeon® processors without having to modify their applications. They only need to link their applications to the latest version of Intel DAAL.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804