# Intel® Summary Statistics Library: how to detect outliers in datasets?

Earlier I computed various statistical estimates like mean or variance-covariance matrix using Intel® Summary Statistics Library. In those cases I knew for sure that my datasets did not contain “bad” observations (points which do not belong to the distribution which I observed) or outliers. However, in some cases we need to deal with datasets which are contaminated with outliers. Sometimes this happens because process of data collection is not very reliable as in case of microarray technologies for measurements of gene expression levels. In other cases presence of outliers in datasets is result of intentional actions like network intrusion. Anyway, outliers in datasets can result in biased estimates and wrong conclusions about the object. How to deal with such datasets? To use outlier detection tool from Intel® Summary Statistics Library.

To see how this tool works I generated data from multivariate Gaussian distribution using the corresponding generator available in Intel® Math Kernel Library. Then I replaced some of the observations with the points from the multivariate Gaussian distribution which had significantly higher math expectation. Number of outliers was ~20%. Let us see how BACON algorithm for detection of outliers available in Intel® Summary Statistics Library identifies the outliers.

I feel that it does not make sense to remind about four stage model for the library usage as I showed them in more detail earlier. I’m going to consider the most important elements for the outlier detection method.  Before using the algorithm I should initialize its parameters. First, I need to define initialization scheme of the algorithm. The library gives me two options: Median- and Mahalanobis distance based schemes. I also need to define rejection level alpha and stopping criteria level beta (details on the parameters are provided in Manual of the library). It can be done as shown in the code below:

init_method = VSL_SS_BACON_MEDIAN_INIT_METHOD;

alpha = 0.05;

beta  = 0.005;

BaconN = 3;

BaconParams[0] = init_method;

BaconParams[1] = alpha;

BaconParams[2] = beta;

As usual I need to pass the parameters of my interest in the library using suitable editor of the library as shown below: