Developer Reference

Basic Components of the MI Method

Summary Statistics MI method comprises two components:
  1. Expectation Maximization (EM) algorithm that computes the start point for the Data Augmentation (DA) algorithm
  2. Simulation-based DA function that uses Intel® MKL random number generators
The parameters of the MI method are packed into the
array. See Table Structure of the Array of MI Parameters in the Summary Statistics section of [MKLMan] for the description of the parameters and their location in the array.
Beside the array of MI parameters, you can pass initial estimates of the mean and variance-covariance matrix (start point) into the EM algorithm. The array of means and variance-covariance matrix are packed as a one-dimensional array
. The size of the array should be at least
. The package format is as follows:
  1. For
    contains the start estimate of means.
  2. The remaining positions of the array store the upper-triangular part of the variance-covariance matrix.
If you do not provide the start point for the EM algorithm, it uses the default start point, which is the vector of zero means, and the unitary matrix as a variance-covariance matrix.
You can pass into the library prior parameters for
, and
. As the DA function uses inverted matrix
, the MI algorithm expects the inverse of
. These parameters are packed as a one-dimensional array
. The size of the array should be at least
(p2 + 3p +4)/2
to hold all the parameters. The storage format is as follows:
  1. prior[0],..., prior[p-1]
    contain elements of vector
  2. prior[p]
    contains parameter
  3. prior[p+1]
    contains parameter
  4. The remaining positions contain the upper-triangular part of the inverted matrix
If the prior parameters are not provided, the algorithm uses their default values:
  1. μ0
     is set to an array of
  2. τ
     is set to 0.
  3. m
     is set to
  4.  for the initial approximate of
    , the zero matrix is used.
Proper processing of the default parameter values ensures correct computation.
The MI algorithm returns
sets of imputed values and/or a sequence of parameter estimates drawn during the DA procedure. The imputed values are returned as a single array
. The size of the array should be sufficient to hold
sets, each of size
, that is,
in total. Imputed values are packed one by one in the order of their appearance in the matrix of observations.
Consider a task of dimension 4 with the total number of observations
=10, where the second vector misses values for the first and the second variables, and the seventh observation misses the first point. The number of sets to impute is
=2. Thus,
contain the first and the second points for the second observation vector,
holds the first point for the seventh observation. Similarly, positions 3, 4, and 5 are reserved for the second set of simulated values.
To estimate convergence of the DA algorithm and choose a proper value for DA iterations, you may generate a sequence of parameter estimates produced during the DA procedure. Elements of the sequence can then be analyzed to estimate convergence of the algorithm. For example, see [Schafer1998].
The sequence of the parameters is returned as a single array. The size of the array should be
m*da_iter_num*(p + (p2 + p)/2)
  1. m
    is the number of sets of values to impute
  2. da_iter_num
    is the number of DA iterations
  3. p + (p2 + p)/2
     is the size of the memory to hold one set of the parameter estimates.
Each set of parameters is packed as follows:
  1. The vector of means occupies the first
  2. The upper-triangular part of the variance-covariance matrix occupies the remaining
    (p2 + p)/2
Before the call to the MI algorithm, the dataset to be analyzed should be pre-processed and all missing observations should be marked with a predefined parameter, as follows:
    , if the dataset is stored in double precision floating-point arithmetic
    , if the dataset is stored in single precision floating-point arithmetic
Upon successful generation of
sets of imputed values, you can place them in cells of the data matrix with missing values and use other Summary Statistics algorithms to analyze and compute estimates for each of the
complete datasets.
Intel® MKL implementation of the MI algorithm rewrites cells of the dataset that are marked with the
If the combination of
and any other estimate parameter are passed into the
routine, only the algorithm for processing missing values is called.
The example below shows a typical MI usage scenario:
#include "mkl_vsl.h" #define DIM 3 /* dimension of the task */ #define N 10000 /* number of observations */ #define M VSL_SS_MI_PARAMS_SIZE /* number of MI parameters */ #define M_VALUE 9 /* total number of missing values */ #define M_COPIES 5 /* number of sets of imputed values */ int main() { int status; VSLSSTaskPtr task; MKL_INT i; double x[DIM * N]; /* matrix of observations */ double W[2]; double mean[DIM], r2m[DIM], c2m[DIM]; MKL_INT p, n, xstorage; double em_iter_num, da_iter_num, em_accuracy, copy_num, missing_value_num; double params[M], simul_missing_vals[M_VALUE * M_COPIES]; MKL_INT nparams = M, simul_missing_val_n = M_VALUE * M_COPIES; /* Pre-process the dataset and mark entries of missing values with VSL_SS_DNAN */ ... /* Parameters of the task and initialization */ p = DIM; n = N; xstorage = VSL_SS_MATRIX_STORAGE_ROWS; /* Parameters of the MI algorithm */ em_iter_num = 100; da_iter_num = 10; em_accuracy = 0.001; copy_num = M_COPIES; missing_value_num = M_VALUE; params[0] = em_iter_num; params[1] = da_iter_num; params[2] = em_accuracy; params[3] = copy_num; params[4] = missing_value_num; /* Create a task */ status = vsldSSNewTask( &task, &p, &n, &xstorage, x, 0, 0 ); /* Initialize the task parameters */ status = vsldSSEditMissingValues( task, &nparams, params, 0, 0, 0, 0, &simul_missing_val_n, simul_missing_vals, 0, 0 ); /* Generate m_values copies of missing value sets */ status = vsldSSCompute(task, VSL_SS_MISSING_VALS, VSL_SS_METHOD_MI ); /* Use the task to analyze the complete datasets */ W[0] = 0.0; W[1] = 0.0; for ( i = 0; i < p; i++ ) { mean[i] = 0.0; r2m[i] = 0; c2m[i] = 0.0; } status = vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, W ); status = vsldSSEditMoments( task, mean, r2m, 0,0, c2m, 0, 0 ); for ( i = 0; i < M_COPIES; i++ ) { /* Perform imputation of the next set of simulated values into x */ ... /* Compute the mean and the variance using the fast method */ errcode = vsldSSCompute(task, VSL_SS_MEAN|VSL_SS_2C_MOM, VSL_SS_METHOD_FAST ); /* Analyze the computed estimates */ ... } /* Deallocate the task resources */ status = vslSSDeleteTask( &task ); return 0; }

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804