Intel® Summary Statistics Library: how to process data in chunks?

In my previous post I considered computation of statistical estimates for in-memory datasets using tools available in Intel® Summary Statistics Library. New days bring new problems, and today I need to compute the same estimates for data which can not fit into memory of a computer. To process the data I'm going to split the dataset in chunks (data processing in portions is also possible for in-memory data which are not available at once). Is this the problem where Intel® Summary Statistics Library also can help? Answer is yes, and I do not think that significant efforts are required to tune the application for out-of-memory data support.

As earlier I will follow the same four stages for the library usage. However, I need additional initializations to support my huge data. First of all, I set estimates of my interest to zero (or any other value which makes sense to me):

for( i = 0; i < p; i++ )
{
Xmean[i] = 0.0;
Raw2Mom[i] = 0.0;
Central2Mom[i] = 0.0;
for(j = 0; j < p; j++)
{
Cov[i][j] = 0.0;
}
}


Then I initialize array W of size 2 which will hold accumulated weights, they are important for correct computation of the estimates:

W[0] = 0.0; W[1] = 0.0;


My next step is reading of the first portion of the dataset into array X and weights of the observations:

GetNextDataChunk( X, weights );


My further steps are similar to those discussed in the previous post: creation of the task, editing its parameters, computing necessary estimates, and de-allocation of the task resources:

/* Creation of task */
storage_format_x = VSL_SS_MATRIX_COLUMNS_STORAGE;
errcode = vsldSSNewTask( &task, &p, &n_portion, &storage_format_x, X, weights, indices );

/* Edition of task parameters */
errcode = vsldSSEditTask( task, VSL_SS_ACCUMULATED_WEIGHT, W );
errcode = vsldSSEditTask( task, VSL_SS_VARIATION_ARRAY, Variation );
errcode = vsldSSEditMoments( task, Xmean, Raw2Mom, 0, 0, Central2Mom, 0, 0 );

Cov_storage = VSL_SS_MATRIX_FULL_STORAGE;
errcode = vsldSSEditCovCor( task, Xmean, (double*)Cov, &Cov_storage, 0, 0 );

/* Computation of estimates for dataset split in chunks */
for( nchunk = 0; ; nchunk++ )
{
errcode = vsldSSCompute( task,

VSL_SS_MEAN | VSL_SS_2CENTRAL_MOMENT |

VSL_SS_COVARIANCE_MATRIX | VSL_SS_VARIATION,
VSL_SS_1PASS_METHOD );
If ( nchunk >= N ) break;
GetNextDataChunk( X, weights );
}

/* De-allocation of task resources */
errcode = vslSSDeleteTask( &task );



I also have opportunity to place the next portion of the data in another array. The whole computation scheme is the same, and I just need to “communicate” to the library address of the new chunk as below:

for( nchunk = 0; ; nchunk++ )
{
errcode = vsldSSCompute( task,

VSL_SS_MEAN | VSL_SS_2CENTRAL_MOMENT | VSL_SS_COVARIANCE_MATRIX | VSL_SS_VARIATION,
VSL_SS_1PASS_METHOD );
If ( nchunk >= N ) break;
GetNextDataChunk( NextXChunk, weights );
errcode = vsldSSEditTask( task, VSL_SS_OBSERVATIONS, NextXChunk );
}


Nothing else is required. Easy? I believe so.
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.