Intel® Summary Statistics Library: how to use the robust methods?

Intel® Summary Statistics Library provides several opportunities for processing the datasets “contaminated” with outliers.  Earlier I demonstrated how to detect “suspicious” observations in the dataset. Some ideas about performance of the algorithm for detection of outliers are provided here. Another approach to treat outliers is to use robust methods available in the library. I’d like to discuss how to use the algorithms for robust estimation of mean and variance-covariance matrix as they deserve special attention.


Robust methods in Intel® Summary Statistics Library form solution which is presented with two algorithms, Maronna, [1] and TBS, [2]. First algorithm is used to compute start “point” (covariance and mean) for the second one. TBS algorithm allows iterating until necessary accuracy is achieved or maximal number of iterations is completed. In addition to these parameters one can specify and pass maximal breakdown point (number of outliers the algorithm can hold) and asymptotic rejection probability (ARP), [2] in the library. To avoid iterations of TBS algorithm and compute robust estimate of mean and covariance using just Maronna algorithm it is enough to set number of iterations to zero.


As usual, to use Intel® Summary Statistics Library we pass through 4 stages: creation of the task, edition of task parameters, computation of stat estimates, and de-allocation of the resources. As all these steps are described here I only show how to use the editor for robust methods and Compute routine. Parameters of the algorithms, breakdown point, ARP, accuracy and maximal number of TBS iterations are passed as an array:  




breakdown_point = 0.2;


arp                      = 0.001;


method_accuracy = 0.001;


iter_num              = 5;


 


robust_method_params[0] = breakdown_point;


robust_method_params[1] = arp;


robust_method_params[2] = method_accuracy;


robust_method_params[3] = iter_num;



We also need memory t_est and cov_est where the robust estimates will be stored. In the example below covariance matrix is stored in full format specified in variable robust_cov_storage.
errcode =  vsldSSEditRobustCovariance( task, &robust_cov_storage, &robust_params_n, robust_method_params, t_est, cov_est );

Computation of the estimates is done by means of Compute routine:
 errcode=vsldSSCompute(task, VSL_SS_ROBUST_COVARIANCE, VSL_SS_ROBUST_TBS_METHOD ); 

To have an idea how Intel® Summary Statistics Library manages with outliers I create the task with dimension p = 10 and number of observations n = 10,000. The dataset is generated from multivariate Gaussian distribution with zero mean and covariance matrix which holds 1 on the main diagonal and 0.05 in the rest entries of the matrix. I then contaminate the dataset with shift outliers that have multivariate Gaussian distribution with the same covariance matrix and vector of means all equal to 5.



Use of non-robust algorithm for covariance and mean estimation for this dataset results in biased estimates. There is no surprise to get zero p-values for these estimates.  

Means:


0.2566,0.2583,0.2576,0.2633,0.2439,0.2556,0.2530,0.2716,0.2535,0.2519




Covariance:


2.2540


1.2715 2.1819


1.2852 1.2462 2.2046


1.2885 1.2684 1.2553 2.2310


1.2850 1.2581 1.2571 1.2526 2.2112


1.2650 1.2284 1.2419 1.2820 1.2430 2.1929


1.2789 1.2435 1.2550 1.2555 1.2574 1.2478 2.2113


1.2773 1.2692 1.2676 1.2751 1.2725 1.2733 1.2739 2.2448


1.2813 1.2579 1.2688 1.2723 1.2670 1.2713 1.2839 1.3061 2.2246


1.2696 1.2631 1.2515 1.2701 1.2597 1.2686 1.2554 1.2638 1.2780 2.1893



Use of Maronna algorithm (that is iter_num=0) results in the following estimates:




Means:


-0.0022,0.0081,-0.0075,0.0049,-0.0054,0.0012,-0.0087,0.0194,-0.0073,0.0022




p-values for means:


0.1792 0.6077 0.5640 0.3869 0.4281 0.1014 0.6375 0.9570 0.5602 0.1846  




Covariance:


0.9164


0.0605 0.8945


0.0617 0.0374 0.9269


0.0602 0.0570 0.0472 0.9294


0.0584 0.0469 0.0599 0.0443 0.9183


0.0552 0.0394 0.0395 0.0655 0.0484 0.9049


0.0487 0.0449 0.0471 0.0451 0.0564 0.0461 0.9186


0.0293 0.0555 0.0539 0.0456 0.0450 0.0574 0.0501 0.9149


0.0507 0.0339 0.0433 0.0504 0.0429 0.0603 0.0597 0.0696 0.8962


0.0375 0.0573 0.0470 0.0472 0.0502 0.0607 0.0420 0.0381 0.0484 0.8848 




p-values for covariance:


0.0000


0.2989 0.0000


0.2966 0.5842 0.0000


0.3471 0.4395 0.9592 0.0000


0.3994 0.9148 0.3590 0.8993 0.0000


0.5128 0.7023 0.6708 0.1869 0.8510 0.0000


0.8508 0.9752 0.9515 0.9411 0.4812 0.9714 0.0000


0.2669 0.4841 0.6001 0.9729 0.9530 0.4207 0.7751 0.0000


0.7151 0.4529 0.8765 0.7468 0.8689 0.2968 0.3317 0.0984 0.0000


0.6082 0.3734 0.9088 0.8997 0.7250 0.2720 0.8321 0.6358 0.7895 0.0000 



These estimates are much better; however the main diagonal of the matrix results in the zero p-value. To improve the estimate I do 5 iterations of TBS algorithm (my quick experiments show that it does not make sense to iterate longer as the estimates do not significantly change):




Means:


-0.0018,0.0034,0.0026,0.0067,-0.0108,0.0012,-0.0024,0.0122,-0.0057,-0.0044 




p-values for means:


0.1412 0.2612 0.2025 0.4860 0.7098 0.0943 0.1882 0.7693 0.4263 0.3381 




Covariance:


1.0524


0.0583 1.0172


0.0757 0.0426 1.0403


0.0653 0.0630 0.0490 1.0538


0.0672 0.0604 0.0559 0.0462 1.0367


0.0493 0.0295 0.0434 0.0784 0.0442 1.0261


0.0620 0.0429 0.0509 0.0453 0.0491 0.0488 1.0397


0.0410 0.0503 0.0476 0.0507 0.0497 0.0514 0.0497 1.0367


0.0450 0.0370 0.0486 0.0464 0.0430 0.0526 0.0622 0.0719 1.0179


0.0477 0.0587 0.0461 0.0562 0.0514 0.0645 0.0443 0.0346 0.0485 1.0070




p-values for covariance:


0.0002


0.6951 0.2249


0.1676 0.5972 0.0044


0.4613 0.5057 0.8450 0.0001


0.3761 0.5862 0.8152 0.7231 0.0095


0.8726 0.1942 0.6233 0.1170 0.6604 0.0646


0.5690 0.6118 0.9464 0.6795 0.8671 0.8653 0.0050


0.5092 0.9507 0.7992 0.9266 0.9002 0.9932 0.8944 0.0094


0.6867 0.4013 0.8656 0.7504 0.6147 0.9305 0.5185 0.2177 0.2065


0.8205 0.6243 0.7594 0.7800 0.9869 0.4071 0.6776 0.3207 0.8961 0.6185 




References


1. R.A. Maronna and R.H. Zamar, Robust Multivariate Estimates for High-Dimensional Datasets. Technometrics, 44, 307–317, 2002.



2. David M. Rocke. Robustness properties of S-estimators of multivariate location and shape in high dimension. The Annals of Statistics, 24(3), 1327-1345, 1996.
有关编译器优化的更完整信息,请参阅优化通知