# Overview of Summary Statistics (SS) in Intel® MKL

Por Hsin-ying Lin (Intel), Gennady F. (Intel), atualizado

**Introduction : **

The data volumes which researchers and engineers analyze nowadays in their daily tasks are huge and continue to increase. The size of datasets which are generated each day for solution of research or engineering problems can achieve hundred gigabytes, while the whole data array sizes have crossed terabyte threshold. To analyze the data in timely fashion and to enable faster and more accurate decision making, one needs a better tool stack. It includes not only the most powerful computer, but also the best software tools to effectively explore all capacities of the computer.

Intel® Math Kernel Library (MKL) team identified a set of statistical computationally intensive algorithms which are used on the different stages of the data analysis, and optimized those algorithms for the x86 multi-core platforms. Those statistical functions are available in Intel® MKL under Vector Statistical Library (VSL) chapter and are referred as Summary Statistics. VSL Summary Statistics provides the instruments for the fast, accurate and reliable data analysis on multi-core CPU. **Functionality : **

VSL Summary Statistics contains the set of the computational blocks that helps to get insight into the structure of the dataset from the different perspectives. This dataset is store in matrix form. It contains number of observations, sequence of n vectors; and each observation has the value of p variables (or p dimension of the task). Using VSL Summary Statistics algorithms, you can compute the basic statistical estimates like moments or quantiles for your observation matrix, estimate dependencies in the dataset, or even process the incomplete or noised data which are related to imperfect measurement process.

Based on the type of the estimates and the structure of the dataset we classify the Summary Statistics algorithms in the following way:

Computation of the basic statistical estimates for observations matrix:

- Moments, skewness, kurtosis, variation coefficient, quantiles and order statistics

Analysis of the datasets that have missing points

- Restoring the basic statistical characteristics in presence of missed observations

Analysis of the noised dataset, the one that contains artifacts (outliers)

- Detection of outliers, robust (to noise) estimates of the covariance matrix and mean

Analysis of the dependencies

- Variance-covariance/correlation matrix, partial variance-covariance/correlation matrix,

pooled/group variance-covariance/correlation matrix

- Parameterization (stabilization) of the correlation matrix

Typically, the algorithms for support of missing values and detection of outliers can be used for so called data cleaning when the initial analysis of the raw data that became available as result of a research experiment or a measurement of an object, or a sociological survey.

As soon as you clean the raw data, complete the initial analysis of the raw data, and apply some other steps like the data transformation using other MKL routines, you can use the rich set of the Summary Statistics estimators to learn the structure of the data array. With basic statistical estimates like mean or standard deviation you will obtain the first impression about your data. The advanced analysis would involve computation of the quantiles for understanding the distribution properties of the data, variance-covariance matrices - for revealing the dependencies between the components of the observed random vector, or computation of the robust version of the basic estimates.

The algorithms for estimation of dependencies are the important building blocks in the algorithms for signal processing, gene analysis, real-time analysis of the financial data, and portfolio management problems. In addition to “classical” covariance you will find more covariance algorithms in the library like pooled/group covariance which is used in the linear discriminant analysis for face recognition or gene expression level estimation and partial covariance used underneath filtering problems. In the real-life problems which involve intensive real-data based computation of the covariance matrix you may run in the situation when the matrix losses the property of positive semi-definiteness (PSD) due to spurious observations. In some problems you may also want to modify the entries of the matrix to integrate additional information about the object, and thus can impact on PSD. To help you to quickly restore PSD property of the correlation matrix the library provides the tool for parameterization of correlation matrix.

While more and more advanced methodologies for data collection (like microarray approach for the gene expression) are designed and become available, the process for object measurement is not free of spurious data and artifacts. To minimize impact of the noise on the final statistical conclusions you may use the robust methods available in the library.

There is a number of the problems in which the size of the raw data is huge and can’t fit into memory of the computer or the observation matrix become available in blocks only (like in astronomy or in real-time trading systems). Some of Summary Statistics routines are designed in the way that helps to easily address progressive, block-based analysis. In particular, you may provide the data, block-by block, to the algorithm for computation of covariance and, eventually, after processing the last data block will have the covariance estimate for the whole dataset. For another estimate, quantile, the library provides the version for streaming data: for price of pre-defined estimation error and one pass over the data you will obtain the distribution structure of the whole dataset without a need to return back to the processed chunks of the data.

Another important feature of the VSL Summary Statistics - the opportunity to store the observation matrix and results of the computations in one of the predefined formats: all algorithms of the library process the dataset held in raw-major and column-major format; based on the request of the user, the covariance estimators return the result in full or packed storage format.**Usage Model : **

API of VSL Summary Statistics is similar to that of other MKL features (Convolution / Correlation, FFT, RNGs). The minimal set of entry points into the library, scalability, and support of Object-Oriented Programming – the major drivers behind the library API.

The key element of Summary Statistics API is task object which is a data structure that holds the parameters related computation of Summary Statistics. All Summary Statistics operations are done in the context of this task. The typical Summary Statistics usage model consists of four steps shown on the scheme below:

Step 1: task construction

status = vsldSSNewTask(&task, &p, &n, &x_storage, x, w, indices);

Step 2: Editing the task parameters

status = vsldSSEditTask(task, VSL_SS_ED_MEAN, mean );

Step 3: Computation of statistical estimates

status = vsldSSCompute(task, VSL_SS_MEAN, VSL_SS_METHOD_1PASS);

Step 4: Destroying the task

status = vslSSDeleteTask(&task );

Here are three examples* on how to use VSL Summary Statistics functionality in MKL:**Example 1 : In-memory datasets **

This example for estimation of the mean using one-pass method shows the typical four-stage approach of using VSL Summary Statistics functionality.

First, it creates a new task by passing the task descriptor, problem size (dimension of the random vector p and number of the observed vectors n), observation matrix, and its matrix storage format. Summary Statistics usage model admits assigning weights to each observation in the dataset, which are not negative numbers; in the example below, the null pointer to the array of weights is passed to the constructor, thus, the default weights equal to 1 are assigned to the observations and are used in the computations. Second, it initializes the task parameter, necessary for the solution of the specific problem of sample mean estimation by means of registering the array of result VSL_SS_ED_MEAN. Third, it computes the mean using one pass method VSL_SS_METHOD_1PASS. Finally, it calls the task destructor to release the memory resources.

#define DIM 1 /* Task dimension */

#define N 1000 /* Number of observations */

int main()

{

VSLSSTaskPtr task; /* SS task descriptor */

double x[N]; /* Array for dataset */

double mean; /* Array for mean estimate */

double* w = 0; /* Null pointer to array of weights, default weight equal to one will be used in the computation */

MKL_INT p, n, xstorage;

int status;

/* Initialize variables used in the computation of mean */

p = DIM;

n = N;

xstorage = VSL_SS_MATRIX_STORAGE_ROWS;

mean = 0.0;

/* Step 1 - Create task */

status = vsldSSNewTask( &task, &p, &n, &xstorage, x, w, 0 );

/* Step 2- Initialize task parameters */

status = vsldSSEditTask( task, VSL_SS_ED_MEAN, &mean );

/* Step 3 - Compute the mean estimate using SS one-pass method */

status = vsldSSCompute(task, VSL_SS_MEAN, VSL_SS_METHOD_1PASS );

/* Step 4 - deallocate task resources */

status = vslSSDeleteTask( &task );

return 0;

}**Example 2 : Out-of-memory datasets **

The example below is extension of the previous scheme to the case of out-of-memory datasets. While the example follows the same four-stage scheme, there are few modifications to the previous example done to support processing of out-of-memory dataset

- First, the example registers the auxiliary array in the library intended for holding accumulated weight and accumulated squared weight which are just a sum of the observation weights and sum of the squares of the observation weights processed so far. It is necessary for the correct processing of the next available data block.
- Second, the example contains call to the function responsible for obtaining the next data block

The rest computations remain the same. As soon as the last data block is processed you will have the mean estimate for the whole dataset.

#define DIM 1 /* Task dimension */

#define N 1000 /* Number of observations */

#define BLOCKN 100 /* Number of blocks */

int main()

{

VSLSSTaskPtr task; /* SS task descriptor */

double x[N]; /* Array for dataset */

double mean; /* Array for mean estimate */

double W[2]; /* Array of accumulated weights */

MKL_INT p, n, xstorage;

int status, block_idx;

/* Initialize variables used in the computation of mean */

p = DIM; n = N;

mean = 0.0;

W[0] = 0.0; W[1] = 0.0;

xstorage = VSL_SS_MATRIX_STORAGE_ROWS;

/* Create task */

status = vsldSSNewTask( &task, &p, &n, &xstorage, x, 0, 0 );

/* Initialize task parameters */

status = vsldSSEditTask( task, VSL_SS_ED_MEAN, &mean );

status = vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, W );

/* Compute the mean estimate for block-based data using SS one-pass method */

for( block_idx = 0; block_idx++; )

{

status = vsldSSCompute( task, VSL_SS_MEAN, VSL_SS_METHOD_1PASS );

if ( block_idx >= BLOCKN )

break;

else

; //GetNextDataBlock( x, N );

}

/* De-allocate task resources */

status = vslSSDeleteTask( &task );

return 0;

}**Example 3 : Do several estimations simultanenously **

This example shows how to compute several statistical estimates for same data set. Let us assume that we need mean, variance, variation coefficient, and covariance in the statistical analysis. The computational scheme is the same and similar to those considered in the previous examples. We just need to modify those parameters in the task which are associated with the estimates we are interested in, that is we need to register the pointers to the memory which would hold the results and storage format for covariance matrix. In the code below, we use the editor for Moments vsldSSEditMoments, the editor for covariance vsldSSEditCovCor , and the generic editor for variation vsldSSEditTask to specify task parameters. We also use statement consisted four kind of estimations that will be use in the compute stage.

#define DIM 2 /* Task dimension */

#define N 1000 /* Number of observations */

int main()

{

VSLSSTaskPtr task; /* SS task descriptor */

double x[DIM*N]; /* Dataset array */

double mean[DIM],variation[DIM], r2m[DIM], c2m[DIM];

/* mean/variation/2nd raw/2nd central moments */

double cov[DIM*DIM];

double* w = 0;

MKL_INT p, n, xstorage, covstorage;

int status;

/* Initialize variables */

p = DIM; n = N;

xstorage = VSL_SS_MATRIX_STORAGE_ROWS;

covstorage = VSL_SS_MATRIX_STORAGE_FULL;

/* Create task */

int errcode = vsldSSNewTask( &task, &p, &n, &xstorage, x, w, 0 );

/* Edit task parameters */

errcode = vsldSSEditTask( task, VSL_SS_ED_VARIATION, variation );

errcode = vsldSSEditMoments( task, mean, r2m, 0, 0, c2m, 0, 0 );

errcode = vsldSSEditCovCor( task, mean, cov, &covstorage, 0, 0 );

/* Computation of several estimates using 1PASS method */

int estimates = VSL_SS_MEAN | VSL_SS_2C_MOM | VSL_SS_COV | VSL_SS_VARIATION;

status = vsldSSCompute( task, estimates, VSL_SS_METHOD_1PASS );

/* De-allocate task resources */

status = vslSSDeleteTask( &task );

return 0;

}

Note that some statistical estimators compute additional estimates if they are not explicitly requested by the user, e.g., the algorithm for covariance estimation also computes mean estimate. In order to provide the correct processing of such estimates you need to allocate the memory for those additional estimates, and to register pointers to this memory in the library. See [SSApp] for more details.**Summary : **

In this article, we describe the Summary Statistics Library functionality, usage model and examples under Intel® MKL VSL. The examples provided in the article demonstrate the ease of use of VSL Summary Statistics in different cases, e.g., processing of out-of-memory data, and simultaneous estimation of several statistical estimates. The additional information about API, usage models are available in [MKLMan] and [SSApp].*** note: **you may find out 18 Summary Statistic Examples both into <MKL_ROOT>\examples\vsl**c**\source\ for C and <MKL_ROOT>\examples\vsl**f**\source\ with Fortran API.

**Bibliography : **

[MKLMan] Intel® MKL Manual, /en-us/

[SSApp]Summary Statistics Application Notes,

/en-us/