Contents

# Dealing with Missing Observations

Real-life datasets can have missing values. For example, sociological surveys and measurement of complex biological systems have to deal with missing observations. Outliers in datasets can also be treated as lost samples. Intel® Math Kernel Library (Intel® MKL) provides the Expectation-Maximization and Data Augmentation (EMDA) method for accurate processing of datasets with missing points.
The EMDA method is based on the approach described in [Schafer1997], comprising the Expectation-Maximization (EM) and Data Augmentation (DA) algorithms. The EMDA method outputs
m
sets of simulated missing points that can be imputed into the dataset producing
m
complete data copies. For each dataset, you can compute a specific statistical estimate. The final estimate is a combination of such
m
estimates. For details on computational aspects and usage model of the algorithm, see Support of Missing Values in Matrices of Observations.
The parameters of the EMDA method are passed into the library as follows:
1. The EM algorithm iterates
em_iter_num
times to compute the initial estimate for the mean and variance-covariance used as the start point of the DA algorithm. The EM algorithm can terminate earlier if it achieves the given accuracy
em_accuracy
.
2. The DA algorithm iterates
da_iter_num
times. This algorithm uses Gaussian random numbers underneath. For this reason, EMDA algorithm uses
VSL_BRNG_MCG59
basic random number generator with the pre-defined
seed = 250
and Gaussian distribution generator (ICDF method) available in Intel® MKL.
As the EMDA algorithm requires a number of missing values
missing_value_num
, you need to pre-process the dataset and mark all missing values using the
VSL_SS_DNAN
macro defined in the library. For a single-precision dataset, use the
VSL_SS_SNAN
macro. The algorithm parameters are passed into the library as the
params
array:
```
em_iter_num = 10;
da_iter_num = 5;
em_accuracy = 0.001;
copy_num = m;
miss_value_num = miss_num;

params = em_iter_num;
params = da_iter_num;
params = em_accuracy;
params = copy_num;
params = missing_value_num;
```
The editor for the EMDA method accepts the following set of parameters:
```
errcode = vsldSSEditMissingValues( task, &nparams, params, &init_estimates_n,
init_estimates, &prior_n, prior,
&simul_missing_vals_n, simul_missing_vals,
&estimates_n, estimates );
```
The EM algorithm starts using the array of initial estimates
init_estimates
. The vector of means occupies the first
p
positions of the array. The upper-triangular part of the variance-covariance matrix occupies the rest
p*(p+1)/2
entries, where
p
is the dimension of the task. The
prior
array holds prior parameters for the EMDA algorithm.
The algorithm returns the sets of simulated missing points in the
simul_missing_vals
array. In total,
m*
missing_value_num
values are returned. Missing values are packed one by one, starting from the missing points for the first variable of the random vector.
To estimate convergence of the DA algorithm, pass the
estimates
array holding the mean/variance-covariance for all iterations and all sets of simulated missing points,
da_iter_num
* ( p + 0.5 * (p2 + p) )
in total. In each set of the estimates, first
p
entries hold the mean, and the rest
0.5 * (p2 + p)
entries hold the upper-triangular part of the variance-covariance matrix.
For the description of parameters passed into the EMDA algorithm using an editor and the requirements for the size of the arrays, see Support of Missing Values in Matrices of Observations.
To start the EMDA algorithm, call the
Compute
routine:
```
errcode = vsldSSCompute( task, VSL_SS_MISSING_VALS, SL_SS_METHOD_MI );
```
Example:
Consider a task with the dimension
p = 10
and the number of observations
n = 10,000
. The dataset is generated from a multivariate Gaussian distribution with the zero mean and a variance-covariance matrix that holds 1 on the main diagonal and 0.05 in other entries. The ratio of missing values in the dataset is 10%. Each observation may have one missing point in any position. The goal is to generate
m=100
sets of lost observations. The start point for the EM algorithm is the vector of zero means and the identity variance-covariance matrix. The pointer to the
prior
array is set to 0. The size of this array
prior_n
is also 0.
The workflow is as follows:
1. A trial run of the algorithm with
da_iter_num
= 10 is performed. The analysis of the estimates in the
estimates
array shows that five iterations are sufficient for the DA algorithm.
2. 100 sets of missing values are simulated and imputed into the dataset, producing 100 complete data arrays.
3. For each complete dataset, means and variance are computed using Summary Statistics algorithms:
```
Set: Mean:
1 0.013687 0.005529 0.004011 ... 0.008066
2 0.012054 0.003741 0.006907 ... 0.003721
3 0.013236 0.008314 0.008033 ... 0.011987
...
99 0.013350 0.012816 0.012942 ... 0.004076
100 0.014677 0.011909 0.005399 ... 0.006457
___________________________________________________

Average 0.012353 0.005676 0.007586 ... 0.006004

Set: Variance:
1 0.989609 0.993073 1.007031 ... 1.000655
2 0.994033 0.986132 0.997705 ... 1.003134
3 1.003835 0.991947 0.997933 ... 0.997069
...
99 0.991922 0.988661 1.012045 ... 1.005406
100 0.987327 0.989517 1.009951 ... 0.998941
________________________________________________

Average 0.99241 0.992136 1.007225 ... 1.000804

Between-imputation variance:
0.000007 0.000008 0.000008 ... 0.000007

Within-imputation variance:
0.000099 0.000099 0.000101 ... 0.000100

Total variance:
0.000106 0.000107 0.000108 ... 0.000108
```
For the vector of means, 95% confidence intervals are computed:
```
95% confidence interval:
Left boundary of interval: -0.008234 -0.015020 -0.013233 ... -0.014736
Right boundary of interval: +0.032939 +0.026372 +0.028406 ... +0.026744
```
To test the output of the algorithm, the whole experiment is repeated 20 times. In all iterations, 95% confidence intervals contain the true value of mean.

#### Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804