The BACON algorithm is a tool for outlier detection that finds "suspicious" observations and provides unbiased statistical estimates for contaminated datasets.
Consider a dataset generated from a multivariate Gaussian distribution with the help of a corresponding generator available in Intel® Math Kernel Library (Intel® MKL). Some of the observations are then replaced with the outliers from the multivariate Gaussian distribution that has a significantly bigger mathematical expectation. The number of outliers is approximately 20%.
To detect the outliers using the BACON algorithm, do the following:
Initialize the algorithm parameters:
Define the initialization scheme of the algorithm. You can choose between Median- and Mahalanobis distance-based schemes.
Define rejection level as alpha and stopping criteria level as beta.
The parameters are initialized as follows:
init_method = VSL_SS_METHOD_BACON_MEDIAN_INIT; alpha = 0.05; beta = 0.005; BaconN = VSL_SS_BACON_PARAMS_N; BaconParams = init_method; BaconParams = alpha; BaconParams = beta;
For details on the parameters, see Table Structure of the Array of BACON Parameters of the Summary Statistics section in [MKLMan].
Pass the parameters into the library using a suitable editor:
errcode=vsldSSEditOutliersDetection( task, &BaconN, BaconParams, BaconWeights );
The BaconWeights parameter is an array of weights that holds the output of the algorithm and points at suspicious observations. The size of the array equals the number of observations. The 0 value in the i-th position of the array indicates that the i-th observation requires special attention. The 1 value indicates that the observation is unbiased.
Call the Compute routine:
errcode = vsldSSCompute( task, VSL_SS_OUTLIERS, VSL_SS_METHOD_BACON );
When the computation completes, the BaconWeights array contains weights of the observations that have to be analyzed. You can use this array in further data processing. Register this array as an array of observation weights and use it in the usual manner. Expectedly, after all outliers are removed, the statistical estimates for the contaminated dataset are not biased.