How do credit card companies detect fraud or abuse? How do network administrators discover intrusions? How do scientists know whether or not their experiments run correctly?
In order to do these things, they analyze the data set and look for data points that are out of normality. For example, credit card companies look for unusual, high charges in certain transactions or strange buying behaviors. These actions might indicate that a credit card has been stolen. Network administrators search in the log files for irregular activities on the network, like an unusual load from some locations or network access from a foreign IP address, which are good indications of potential network intrusion. Similarly, scientists look to see whether data is out of the normal or expected ranges as an indicator that an experiment is not running correctly.
These types of unusual or irregular activities are called outliers or anomalies. This article describes different methods to detect outliers1 in the data and how the Intel® Data Analytics Acceleration Library (Intel® DAAL)2 helps optimize outlier detection when running it on systems equipped with Intel® Xeon® processors.
An outlier is a data point that is significantly different (abnormal or irregular) or deviates from the remaining data (see Figure 1).
Figure 1: Outlier case #1.
Each purple dot represents a data point in a data set. From the graph, the two data points are considered outliers since they are very far away from the rest of the data points.
Figure 2: Outlier case #2.
Figure 2 shows another case of outliers. In this case a data set is grouped into three groups (clusters). Any data points that lie outside the groups are considered outliers.
Figure 3: Outlier case #3.
Figure 3 shows another case of outliers. Although the data sets are grouped into different groups, this case is different from that in Figure 2 because of the density of the data points. In Figure 2, data points are almost uniformly distributed in the groups, while those in Figure 3 have different density.
Outliers can be both good and bad. By detecting irregular activities (outliers) in the network, network administrators can potentially discover and prevent data intrusion. On the other hand, detecting outliers and eliminating them can help eliminate or minimize the impacts in the calculation results. Outliers can skew and mislead the training process of machine learning3 algorithms, resulting in longer training times and less accurate models. For example, in the case of a K-means clustering algorithm, outliers in the data set will pull the centroid of a cluster away from its intended location.
Common outlier causes include the following:
The common way to detect outliers is to plot the data set and then look at the graph, similar to those shown in figures 1–3.
In the second edition of the book Outlier Analysis,4 Charu C. Aggarwal provides the following outlier detection methods:
Because outlier detection methods can detect strange or abnormal data, they can be used to:
These are just some of applications of the outlier detection methods; there are many more.
Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of latest Intel® processors. In this article, we use the Python* API of Intel DAAL to illustrate how to invoke outlier detection functions. To install it, follow the instructions in the Intel DAAL documentation.5
From the Intel® DAAL manual, the following paragraph describes what a univariate outlier is and the formula to define the outlier region:
“Given a set X of n feature vectors x 1= (x 11,…,x 1p ), ..., x n = (x n1,…,x np ) of dimension p, the problem is to identify the vectors that do not belong to the underlying distribution. The algorithm for univariate outlier detection considers each feature independently. The univariate outlier detection method can be parametric, assumes a known underlying distribution for the data set, and defines an outlier region such that if an observation belongs to the region, it is marked as an outlier. Definition of the outlier region is connected to the assumed underlying data distribution. The following is an example of an outlier region for the univariate outlier detection:
where m n and σ n are (robust) estimates of the mean and standard deviation computed for a given data set, α n is the confidence coefficient, and g (n, α n ) defines the limits of the region and should be adjusted to the number of observations.”
This section shows how to invoke the outlier method in Python6 using Intel DAAL.
The following steps are used to invoke the univariate outlier detection algorithm from Intel DAAL:
from daal.data_management import FileDataSource, writeOnly, DataSourceIface, BlockDescriptor_Float64
from daal.algorithms.univariate_outlier_detection import InitIface, Batch_Float64DefaultDense, data, weights
DataSet = FileDataSource( trainDatasetFileName, DataSourceIface.doAllocateNumericTable, DataSourceIface.doDictionaryFromContext )
DataSet.loadDataBlock() nFeatures = DataSet.getNumberOfColumns()
algorithm = Batch_Float64DefaultDense()
results = algorithm.compute()
printNumericTable(results.get(weights), “outlier results“)
Note: some common data sets can be found at the UCI Machine Learning Repository.7
Outlier detection plays an important role in fraud detection, network security, and more. Intel DAAL optimizes the outlier detection methods by taking advantage of new features in future generations of Intel Xeon processors when running the methods on computer systems equipped with these processors.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804