# Statistical Sums/Cross-products Support Streaming Computation for Large Datasets

The Summary Statistics (SSL) is a subcomponent of the Vector Statistical Library (VSL) included in the Intel® Math Kernel Library (Intel® MKL). The library provides rich set of functions to compute various statistical estimates for multi-dimensional datasets. It contains algorithms for computation of moments like mean, covariance, moments and sums, skewness, kurtosis, variation/correlation/cross-product matrix, quintiles and order statistics etc.  Please find more details about Intel MKL SSL in the Intel MKL SSL overview document.

In Intel MKL 11.1, the SSL introduces new method of cross-product feature and sum-to-mom which are suitable to calculate covariance matrix and sums in streaming or distributed mode.

Brief details about the two features are given below:

1. Method VSL_SS_METHOD_CP_TO_COVCOR is an implementation of computation of a variance-covariance and/or correlation matrix from a corresponding cross-product matrix.
2. Method VSL_SS_METHOD_SUM_TO_MOM is an implementation of computation of raw/central statistical moments as well as kurtosis/skewness/variation from corresponding raw/central sums.

The new methods allow us to compute of a variance-covariance and/or correlation matrix from a corresponding cross-product matrix. Especially, it is convenient when computing variance-covariance for large datasets.

The corresponding examples are located in <MKL installation folder>/examples/vslc folder

-vsldcp.c – for computation of cross-product and its conversion into covariance

- vsldstatsum.c – for computation of stat sums and their conversion into corresponding stat moments

Sample code:

/***** Cross-product matrix is included in the list of estimates

to compute *****/

estimate = VSL_SS_CP;

/***** Compute the estimates using FAST method *****/

errcode = vsldSSCompute( task, estimate, VSL_SS_METHOD_FAST );

/***** Edit task parameters for computation of covariance *****/

/***** Convert cross-product matrix into correlation matrix *****/

errcode = vsldSSCompute( task, VSL_SS_COV, VSL_SS_METHOD_CP_TO_COVCOR );

When a large dataset is used, generally the data comes in streaming or block, instead of the original loop of computation as

1. Convert a variance-covariance and/or correlation matrix into a cross-product matrix,
2. Update a cross-product matrix
3. Update a variance-covariance and/or correlation matrix from the cross-product matrix

The new method can compute variance-covariance and/or correlation matrices for the dataset available as n blocks using 2 simple steps:

1. For blocks 1,2,…,n of the dataset, call VSL_SS_CP to compute a cross-product matrix using one of supported methods
2. Convert the final cross-product matrix into a variance-covariance and/or correlation matrix by applying VSL_SS_METHOD_CP_TO_COVCOR

Sample code:

/* Compute a cross-product matrix for NBLOCKS-1 */

for( i = 0; i < NBLOCKS; i++ )

{

/* Get i-th data block to array x */

status = GetBlock( i, x, p, n );

/* Update cross-product matrix using latest block */

status = vsldSSCompute( task, VSL_SS_CP, VSL_SS_METHOD_1PASS );

}

/* Convert cross-product matrix into correlation matrix */