Intel® Summary Statistics Library: how to deal with missing observations?

Real life datasets can have missing values. Sociological surveys and measurement of complex biological systems are two examples where the researcher can arrive at the point in which he should do something with missing observations. One can also treat outliers in datasets as samples which are also lost. Intel® Summary Statistics Library already contains functionality to detect outliers or get robust estimates in presence of “suspicious” observations. In my previous posts I described how to use those algorithms. Now I’d like to tell about other functionality, solution for dealing with missing values which is part of Intel® Summary Statistics Library. 


The method for accurate processing of the datasets with missing points is based on the approach in [1]. Expectation-Maximization (EM) and Data Augmentation (DA) algorithms are integral components of this solution (let me call it EMDA solution). Output of the algorithm is m sets of simulated missing points which can be imputed into dataset, thus, producing m complete data copies. For each dataset one can compute a stat estimate of his interest; the final estimate is a combination of these m estimates.


The usage model for EMDA method is similar to that for other algorithms of the library and described here: creation of the task, editing its parameters, computing, and de-allocation of the resources. I show how to pass parameters of the EMDA method into the library and to call the Compute routine.  



EM algorithm does em_iter_num iterations to compute an initial estimate for mean and covariance which are then used to start DA algorithm. EM algorithm can terminate earlier if given accuracy em_accuracy is achieved. In its turn DA algorithm iterates da_iter_num times. It also uses Gaussian random numbers undeneath. For this reason Intel® Summary Statistics Library uses Vector Statistical Library (VSL) component of Intel® Math Kernel Library. BRNG, SEED, and METHOD are the parameters used to initialize VSL generators. Values of the parameters should follow VSL requirements. The EMDA algorithm also requires number of missing values missing_value_num. To compute it I pre-process the dataset and also mark missing values using macro defined in the library VSL_SS_DNAN (if I work in single precision floating point arithmetic I use VSL_SS_SNAN macro). Parameters of the algorithm are passed into the library as the array: 

    em_iter_num       = 10;


    da_iter_num        = 5;


    em_accuracy       = 0.001;


    copy_num           = m;


    miss_value_num   = miss_num;


    brng                     = BRNG;


    seed                     = SEED;


    method                = METHOD;


 


    mi_method_params[0] = em_iter_num;


    mi_method_params[1] = da_iter_num;        


    mi_method_params[2] = em_accuracy;       


    mi_method_params[3] = copy_num;          


    mi_method_params[4] = missing_value_num;


    mi_method_params[5] = brng;


    mi_method_params[6] = seed;             


    mi_method_params[7] = method; 



Editor for EMDA algorithm accepts set of parameters:




errcode=vsldSSEditMissingValues(task, &mi_method_params_n, mi_method_params, &initial_estimates_n, initial_estimates, &prior_n, prior, &simulated_missing_values_n, simulated_missing_values, &estimates_n, estimates);



Array mi_method_params is described above. Array of initial estimates initial_estimates is used to start EM algorithm; its first p positions are occupied with vector of mean and the rest p*(p+1)/2 entries hold upper-triangular part of variance-covariance matrix (p is dimension of the task). Array prior is intended to hold prior parameters for EMDA algorithm; current version of the library does not support user-defined priors, and default values are used. Sets of simulated missing points will be returned in array simulated_missing_values, in total m x missing_value_num values. Missing values are packed one by one, first all missing points for 1st variable of random vector, then missing values for 2nd variable, and so forth. To estimate convergence of DA algorithm one passes array estimates which will hold mean/covariance for all iterations and all sets of simulated missing points, in total m x da_iter_num x ( p + 0.5 x (p2 + p) ). Each set of estimates is packed as above: first p positions are intended for mean, and the rest 0.5 x (p2 + p) entries hold upper-triangular part of the covariance matrix.


Finally, EMDA algorithm is run using Compute routine of the library:




errcode=vsldSSCompute(task,VSL_SS_MISSING_VALUES,SL_SS_MULTIPLE_IMPUTATION);



To have an idea how this solution works I created the task with dimension p = 10 and number of observations n = 10,000. The dataset is generated from multivariate Gaussian distribution with zero mean and covariance matrix which holds 1 on the main diagonal and 0.05 in the rest entries of the matrix. Ratio of missing values in my dataset is 10%; each observation may have one lost point in any position. I’m going to generate m=100 sets of simulated lost points. Start “point” for EM algorithm is vector of zero means and identity covariance matrix; pointer to array prior is set to 0 (size of this array prior_n is also 0).


First, I did trial run of the algorithm with da_iter_num = 10 and analyzed estimates in array estimates. It turned out that number of iterations for DA algorithm can be decreased to 5. I then simulated 100 sets of lost values, imputed simulated values in the dataset and formed 100 complete arrays. For each complete dataset I computed means and variance using the algorithms of Intel® Summary Statistics Library, in total 100 sets of estimates for mean and covariance: 




Set:           Mean:                                


    1         0.013687  0.005529  0.004011 ...  0.008066


    2         0.012054  0.003741  0.006907 ...  0.003721


    3         0.013236  0.008314  0.008033 ...  0.011987


  ...                                                                                              


   99        0.013350  0.012816  0.012942 ...  0.004076


  100       0.014677  0.011909  0.005399 ...  0.006457


___________________________________________________                                                  


Average  0.012353  0.005676  0.007586 ...  0.006004 




Set:        Variance:                            


    1        0.989609  0.993073  1.007031 ... 1.000655   


    2        0.994033  0.986132  0.997705 ... 1.003134   


    3        1.003835  0.991947  0.997933 ... 0.997069   


   ...                                                                                               


    99      0.991922  0.988661  1.012045 ... 1.005406   


   100     0.987327  0.989517  1.009951 ... 0.998941   


________________________________________________                                                     


Average  0.99241   0.992136  1.007225 ... 1.000804




Between-imputation variance:


0.000007 0.000008 0.000008 ... 0.000007




Within-imputation variance:


0.000099 0.000099 0.000101 ... 0.000100   




Total variance:


0.000106 0.000107 0.000108 ... 0.000108   



 I also computed 95% confidence intervals for vector of means:




95% confidence interval:                          


Right boundary of interval: +0.032939 +0.026372 +0.028406 ... +0.026744


Left boundary of interval:   -0.008234   -0.015020 -0.013233 ...  -0.014736



I was curious to know how well the EMDA algorithm performs and repeated the same experiment 20 times. All 20 95% confidence intervals contained true value of mean. Sounds good, is not it?  


References


1. J.L. Schafer. Analysis of Incomplete Multivariate Data, Chapman & Hall/CRC, 1997.

Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione