Developer Reference

Dealing with Missing Observations

Real-life datasets can have missing values. For example, sociological surveys and measurement of complex biological systems have to deal with missing observations. Outliers in datasets can also be treated as lost samples. Intel® Math Kernel Library (Intel® MKL) provides the Expectation-Maximization and Data Augmentation (EMDA) method for accurate processing of datasets with missing points.
The EMDA method is based on the approach described in [Schafer1997], comprising the Expectation-Maximization (EM) and Data Augmentation (DA) algorithms. The EMDA method outputs
m
sets of simulated missing points that can be imputed into the dataset producing
m
complete data copies. For each dataset, you can compute a specific statistical estimate. The final estimate is a combination of such
m
estimates. For details on computational aspects and usage model of the algorithm, see Support of Missing Values in Matrices of Observations.
The parameters of the EMDA method are passed into the library as follows:
  1. The EM algorithm iterates
    em_iter_num
     times to compute the initial estimate for the mean and variance-covariance used as the start point of the DA algorithm. The EM algorithm can terminate earlier if it achieves the given accuracy
    em_accuracy
    .
  2. The DA algorithm iterates
    da_iter_num
     times. This algorithm uses Gaussian random numbers underneath. For this reason, EMDA algorithm uses
    VSL_BRNG_MCG59
    basic random number generator with the pre-defined
    seed = 250
    and Gaussian distribution generator (ICDF method) available in Intel® MKL.
As the EMDA algorithm requires a number of missing values
missing_value_num
, you need to pre-process the dataset and mark all missing values using the
VSL_SS_DNAN
macro defined in the library. For a single-precision dataset, use the
VSL_SS_SNAN
macro. The algorithm parameters are passed into the library as the
params
array:
em_iter_num = 10; da_iter_num = 5; em_accuracy = 0.001; copy_num = m; miss_value_num = miss_num; params[0] = em_iter_num; params[1] = da_iter_num; params[2] = em_accuracy; params[3] = copy_num; params[4] = missing_value_num;
The editor for the EMDA method accepts the following set of parameters:
errcode = vsldSSEditMissingValues( task, &nparams, params, &init_estimates_n, init_estimates, &prior_n, prior, &simul_missing_vals_n, simul_missing_vals, &estimates_n, estimates );
The EM algorithm starts using the array of initial estimates
init_estimates
. The vector of means occupies the first
p
positions of the array. The upper-triangular part of the variance-covariance matrix occupies the rest
p*(p+1)/2
entries, where
p
is the dimension of the task. The
prior
array holds prior parameters for the EMDA algorithm.
The algorithm returns the sets of simulated missing points in the
simul_missing_vals
array. In total,
m*
missing_value_num
values are returned. Missing values are packed one by one, starting from the missing points for the first variable of the random vector.
To estimate convergence of the DA algorithm, pass the
estimates
array holding the mean/variance-covariance for all iterations and all sets of simulated missing points,
da_iter_num
* ( p + 0.5 * (p2 + p) )
in total. In each set of the estimates, first
p
entries hold the mean, and the rest
0.5 * (p2 + p)
entries hold the upper-triangular part of the variance-covariance matrix.
For the description of parameters passed into the EMDA algorithm using an editor and the requirements for the size of the arrays, see Support of Missing Values in Matrices of Observations.
To start the EMDA algorithm, call the
Compute
routine:
errcode = vsldSSCompute( task, VSL_SS_MISSING_VALS, SL_SS_METHOD_MI );
Example:
Consider a task with the dimension
p = 10
and the number of observations
n = 10,000
. The dataset is generated from a multivariate Gaussian distribution with the zero mean and a variance-covariance matrix that holds 1 on the main diagonal and 0.05 in other entries. The ratio of missing values in the dataset is 10%. Each observation may have one missing point in any position. The goal is to generate
m=100
sets of lost observations. The start point for the EM algorithm is the vector of zero means and the identity variance-covariance matrix. The pointer to the
prior
array is set to 0. The size of this array
prior_n
is also 0.
The workflow is as follows:
  1. A trial run of the algorithm with
    da_iter_num
    = 10 is performed. The analysis of the estimates in the
    estimates
    array shows that five iterations are sufficient for the DA algorithm.  
  2. 100 sets of missing values are simulated and imputed into the dataset, producing 100 complete data arrays.
  3. For each complete dataset, means and variance are computed using Summary Statistics algorithms:
Set: Mean: 1 0.013687 0.005529 0.004011 ... 0.008066 2 0.012054 0.003741 0.006907 ... 0.003721 3 0.013236 0.008314 0.008033 ... 0.011987 ... 99 0.013350 0.012816 0.012942 ... 0.004076 100 0.014677 0.011909 0.005399 ... 0.006457 ___________________________________________________ Average 0.012353 0.005676 0.007586 ... 0.006004 Set: Variance: 1 0.989609 0.993073 1.007031 ... 1.000655 2 0.994033 0.986132 0.997705 ... 1.003134 3 1.003835 0.991947 0.997933 ... 0.997069 ... 99 0.991922 0.988661 1.012045 ... 1.005406 100 0.987327 0.989517 1.009951 ... 0.998941 ________________________________________________ Average 0.99241 0.992136 1.007225 ... 1.000804 Between-imputation variance: 0.000007 0.000008 0.000008 ... 0.000007 Within-imputation variance: 0.000099 0.000099 0.000101 ... 0.000100 Total variance: 0.000106 0.000107 0.000108 ... 0.000108
For the vector of means, 95% confidence intervals are computed:
95% confidence interval: Left boundary of interval: -0.008234 -0.015020 -0.013233 ... -0.014736 Right boundary of interval: +0.032939 +0.026372 +0.028406 ... +0.026744
To test the output of the algorithm, the whole experiment is repeated 20 times. In all iterations, 95% confidence intervals contain the true value of mean.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804