# Intel® Summary Statistics Library: how to detect outliers in datasets?

Por Dmitry Kabaev (Intel), publicado em 29 de agosto de 2008

Earlier I computed various statistical estimates like mean or variance-covariance matrix using Intel® Summary Statistics Library. In those cases I knew for sure that my datasets did not contain “bad” observations (points which do not belong to the distribution which I observed) or outliers. However, in some cases we need to deal with datasets which are contaminated with outliers. Sometimes this happens because process of data collection is not very reliable as in case of microarray technologies for measurements of gene expression levels. In other cases presence of outliers in datasets is result of intentional actions like network intrusion. Anyway, outliers in datasets can result in biased estimates and wrong conclusions about the object. How to deal with such datasets? To use outlier detection tool from Intel® Summary Statistics Library.

To see how this tool works I generated data from multivariate Gaussian distribution using the corresponding generator available in Intel® Math Kernel Library. Then I replaced some of the observations with the points from the multivariate Gaussian distribution which had significantly higher math expectation. Number of outliers was ~20%. Let us see how BACON algorithm for detection of outliers available in Intel® Summary Statistics Library identifies the outliers.

I feel that it does not make sense to remind about four stage model for the library usage as I showed them in more detail earlier. I’m going to consider the most important elements for the outlier detection method. Before using the algorithm I should initialize its parameters. First, I need to define initialization scheme of the algorithm. The library gives me two options: Median- and Mahalanobis distance based schemes. I also need to define rejection level alpha and stopping criteria level beta (details on the parameters are provided in Manual of the library). It can be done as shown in the code below:

init_method = VSL_SS_BACON_MEDIAN_INIT_METHOD;

alpha = 0.05;

beta = 0.005;

BaconN = 3;

BaconParams[0] = init_method;

BaconParams[1] = alpha;

BaconParams[2] = beta;

As usual I need to pass the parameters of my interest in the library using suitable editor of the library as shown below:

errcode=vsldSSEditOutliersDetection(task, &BaconN, BaconParams, BaconWeights );

You ask what is *BaconWeights* parameter? This is array of weights which will hold results of the algorithm and point at suspicious observations. Size of the array equals to number of points. Zero value in ith position of the array indicates that* i*-th observation requires special attention, and one is sign that the observation is “good”.

The next step is obvious – to call *Compute* function:

errcode = vsldSSCompute( task, VSL_SS_OUTLIERS_DETECTION,VSL_SS_BACON_METHOD );

Once the algorithm completes the array *BaconWeights* contains weights of the observations and needs to be analyzed. In my experiment I checked contents of the array and found 100% correctness of the results: all the outliers were correctly defined, and there were no false alarms. This array can be used in further analysis; it is enough to register it as array of observation weights and use in usual manner. Sure, after removal of the outliers statistical estimates of my interest were not biased.