Dealing with Missing Observations


Real-life datasets can have missing values. For example, sociological surveys and measurement of complex biological systems have to deal with missing observations. Outliers in datasets can also be treated as lost samples. Intel® Math Kernel Library (Intel® MKL) provides the Expectation-Maximization and Data Augmentation (EMDA) method for accurate processing of datasets with missing points.

The EMDA method is based on the approach described in [Schafer1997], comprising the Expectation-Maximization (EM) and Data Augmentation (DA) algorithms. The EMDA method outputs m sets of simulated missing points that can be imputed into the dataset producing m complete data copies. For each dataset, you can compute a specific statistical estimate. The final estimate is a combination of such m estimates. For details on computational aspects and usage model of the algorithm, see Support of Missing Values in Matrices of Observations.

The parameters of the EMDA method are passed into the library as follows:

  1. The EM algorithm iterates em_iter_num  times to compute the initial estimate for the mean and variance-covariance used as the start point of the DA algorithm. The EM algorithm can terminate earlier if it achieves the given accuracy em_accuracy.

  2. The DA algorithm iterates da_iter_num  times. This algorithm uses Gaussian random numbers underneath. For this reason, EMDA algorithm uses VSL_BRNG_MCG59 basic random number generator with the pre-defined seed = 250and Gaussian distribution generator (ICDF method) available in Intel® MKL.

As the EMDA algorithm requires a number of missing values missing_value_num, you need to pre-process the dataset and mark all missing values using the VSL_SS_DNAN macro defined in the library. For a single-precision dataset, use the VSL_SS_SNAN macro. The algorithm parameters are passed into the library as the params array:

em_iter_num     = 10;
da_iter_num     = 5;
em_accuracy     = 0.001;
copy_num        = m; 
miss_value_num  = miss_num;
params[0] = em_iter_num;
params[1] = da_iter_num;         
params[2] = em_accuracy;        
params[3] = copy_num;           
params[4] = missing_value_num;

The editor for the EMDA method accepts the following set of parameters:

errcode = vsldSSEditMissingValues( task, &nparams, params, &init_estimates_n,
                                  init_estimates, &prior_n, prior,
                                  &simul_missing_vals_n, simul_missing_vals, 
                                  &estimates_n, estimates );

The EM algorithm starts using the array of initial estimates init_estimates. The vector of means occupies the first p positions of the array. The upper-triangular part of the variance-covariance matrix occupies the rest p*(p+1)/2 entries, where p is the dimension of the task. The prior array holds prior parameters for the EMDA algorithm.

The algorithm returns the sets of simulated missing points in the simul_missing_vals array. In total, m*missing_value_num values are returned. Missing values are packed one by one, starting from the missing points for the first variable of the random vector.

To estimate convergence of the DA algorithm, pass the estimates array holding the mean/variance-covariance for all iterations and all sets of simulated missing points, da_iter_num * ( p + 0.5 * (p2 + p) ) in total. In each set of the estimates, first p entries hold the mean, and the rest 0.5 * (p2 + p) entries hold the upper-triangular part of the variance-covariance matrix.

For the description of parameters passed into the EMDA algorithm using an editor and the requirements for the size of the arrays, see Support of Missing Values in Matrices of Observations.

To start the EMDA algorithm, call the Compute routine:

errcode = vsldSSCompute( task, VSL_SS_MISSING_VALS, SL_SS_METHOD_MI );


Consider a task with the dimension p = 10 and the number of observations n = 10,000. The dataset is generated from a multivariate Gaussian distribution with the zero mean and a variance-covariance matrix that holds 1 on the main diagonal and 0.05 in other entries. The ratio of missing values in the dataset is 10%. Each observation may have one missing point in any position. The goal is to generate m=100 sets of lost observations. The start point for the EM algorithm is the vector of zero means and the identity variance-covariance matrix. The pointer to the prior array is set to 0. The size of this array prior_n is also 0.

The workflow is as follows:

  1. A trial run of the algorithm with da_iter_num = 10 is performed.
    The analysis of the estimates in the estimates array shows that five iterations are sufficient for the DA algorithm.  

  2. 100 sets of missing values are simulated and imputed into the dataset, producing 100 complete data arrays.

  3. For each complete dataset, means and variance are computed using Summary Statistics algorithms:

Set:           Mean:                                 
    1        0.013687  0.005529  0.004011 ...  0.008066
    2        0.012054  0.003741  0.006907 ...  0.003721
    3        0.013236  0.008314  0.008033 ...  0.011987
   99        0.013350  0.012816  0.012942 ...  0.004076
  100        0.014677  0.011909  0.005399 ...  0.006457
Average      0.012353  0.005676  0.007586 ...  0.006004 
Set:         Variance:                             
    1        0.989609  0.993073  1.007031 ... 1.000655   
    2        0.994033  0.986132  0.997705 ... 1.003134   
    3        1.003835  0.991947  0.997933 ... 0.997069   
    99       0.991922  0.988661  1.012045 ... 1.005406   
   100       0.987327  0.989517  1.009951 ... 0.998941   
Average      0.99241   0.992136  1.007225 ... 1.000804
Between-imputation variance:
0.000007 0.000008 0.000008 ... 0.000007
Within-imputation variance:
0.000099 0.000099 0.000101 ... 0.000100   
Total variance:
0.000106 0.000107 0.000108 ... 0.000108   

For the vector of means, 95% confidence intervals are computed:

95% confidence interval:
Left boundary of interval:  -0.008234 -0.015020 -0.013233 ...  -0.014736
Right boundary of interval: +0.032939 +0.026372 +0.028406 ...  +0.026744

To test the output of the algorithm, the whole experiment is repeated 20 times. In all iterations, 95% confidence intervals contain the true value of mean.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

For more complete information about compiler optimizations, see our Optimization Notice.