How does a company forecast how much it can spend on future advertising in order to increase sales? Similarly, how much should a company spend in future training programs in order to improve productivities?
In both cases, those companies rely on the historical relationship between variables like, in the former case, the relationship between the sale value and advertising spending to predict the future outcome.
This is a typical regression problem that can be best solved using machine learning1.
This article describes a common type of regression analysis called linear regression2 and how the Intel® Data Analytics Acceleration Library (Intel® DAAL)3 helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.
Linear regression (LR) is the most basic type of regression that is used for predictive analysis. LR shows a linear relationship between variables and how one variable can be affected by one or more variables. The variable which is impacted by the others is called a dependent, response, or outcome variable and the others are called independent, explanatory, or predictor variables.
In order to use LR analysis, we need to examine whether LR is suitable for this set of data. To do that we need to observe how the data is distributed. Let's look at the following two graphs:
Figure 1. A straight line can be fitted through the data points
Figure 2. A straight line cannot be fitted through these data points
In figure 1, we can fit a straight line through the data points; however, there is no way to do that with the data points in figure 2. Only the curved (non-linear) line can be fitted through the data points in figure 2. Therefore, linear regression analysis can be done on the dataset in figure 1, but not on that in figure 2.
Depending on the number of independent variables, LR is divided into two types: simple linear regression (SLR) and multiple linear regression (MLR).
LR is called SLR when there exists only one independent variable, whereas MLR has more than one independent variable.
The simplest form of the equation with one dependent and one independent variable is defined as:
y = Ax + B (1)
y: Dependent variable
x: Independent variable
A: Regression coefficient or slope of the line
The question is how to find the best-fit line so that the difference between the observed and predicted values of the dependent variable (y) is minimum. In other words, find A and B in equation (1) such that |yobserved – ypredicted| is minimum.
The task to find the best-fit line can be done using the least squares method4. It calculates the best-fit line by minimizing the sum of the squares of the vertical differences (difference between the observed and predicted values of y) from each data point to the line. The vertical difference can also be called residual.
Figure 3. Linear regression graph
From figure 3 the green dots represent the actual data points. The black dots show the vertical (not perpendicular) projection of the data points onto the regression line (red line). The black dots are also called the predicted data. The vertical difference between the actual data and the predicted data is called residual.
Some of the applications that can make good use of linear regression:
Some of the pros and cons of LR:
Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of the latest Intel® processors. LR is one of the predictive algorithms that Intel DAAL provides. In this article, we use the Python* API of Intel DAAL to build a basic LR predictor. To install DAAL, follow the instructions in How to install the Python* Version of Intel® DAAL in Linux*5.
This section shows how to invoke the linear regression method in Python6 using Intel DAAL.
The reference section provides a link7 to the free datasets that can be used to test the application.
Do the following steps to invoke the algorithm from Intel DAAL:
import numpy as np
from daal.data_management import HomogenNumericTable
from daal.data_management import ( DataSourceIface, FileDataSource, HomogenNumericTable, MergeNumericTable, NumericTableIface)
from daal.algorithms.linear_regression import training, prediction
trainDataSet = FileDataSource( trainDatasetFileName, DataSourceIface.notAllocateNumericTable, DataSourceIface.doDictionaryFromContext )
trainInput = HomogenNumericTable(nIndependentVar, 0, NumericTableIface.notAllocate) trainDependentVariables = HomogenNumericTable(nDependentVariables, 0, NumericTableIface.notAllocate) mergedData = MergedNumericTable(trainData, trainDependentVariables)
Note: This algorithm uses the normal equation to solve the linear least squares problem. DAAL also supports QR decomposition/factorization.
algorithm = training.Batch_Float64NormEqDense()
algorithm.input.set(training.data, trainInput) algorithm.input.set(training.dependentVariables, trainDependentVariables)
algorithm: The algorithm object as defined in step a above. train
Input: Training data. trainDependent
Variables: Training dependent variables
trainResult = algorithm.compute()
algorithm: The algorithm object as defined in step a above.
testDataSet = FileDataSource( testDatasetFileName, DataSourceIface.doAllocateNumericTable, DataSourceIface.doDictionaryFromContext )
testInput = HomogenNumericTable(nIndependentVar, 0, NumericTableIface.notAllocate) testTruthValues = HomogenNumericTable(nDependentVariables, 0, NumericTableIface.notAllocate) mergedData = MergedNumericTable(testDataSet, testTruthValues)
algorithm = prediction.Batch()
algorithm.input.setTable(prediction.data, testInput) algorithm.input.setModel(prediction.model, trainResult.get(training.model))
algorithm: The algorithm object as defined in step a above.
testInput: Testing data.
Prediction = algorithm.compute()
Linear regression is a very common predictive algorithm. Intel DAAL optimized the linear regression algorithm. By using Intel DAAL, developers can take advantage of new features in future generations of Intel Xeon processors without having to modify their applications. They only need to link their applications to the latest version of Intel DAAL.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804