Intel® Summary Statistics Library: how to process data in chunks?

By Dmitry Kabaev (Intel) (11 posts) on August 22, 2008 at 5:51 am

In my previous post I considered computation of statistical estimates for in-memory datasets using tools available in Intel® Summary Statistics Library. New days bring new problems, and today I need to compute the same estimates for data which can not fit into memory of a computer. To process the data I'm going to split the dataset in chunks (data processing in portions is also possible for in-memory data which are not available at once). Is this the problem where Intel® Summary Statistics Library also can help? Answer is yes, and I do not think that significant efforts are required to tune the application for out-of-memory data support.

As earlier I will follow the same four stages for the library usage. However, I need additional initializations to support my huge data. First of all, I set estimates of my interest to zero (or any other value which makes sense to me):

for( i = 0; i < p; i++ )
{
     Xmean[i] = 0.0;
     Raw2Mom[i] = 0.0;
     Central2Mom[i] = 0.0;
     for(j = 0; j < p; j++)
     {
         Cov[i][j] = 0.0;
     }
}

Then I initialize array W of size 2 which will hold accumulated weights, they are important for correct computation of the estimates:

W[0] = 0.0; W[1] = 0.0;

My next step is reading of the first portion of the dataset into array X and weights of the observations:

GetNextDataChunk( X, weights );

My further steps are similar to those discussed in the previous post: creation of the task, editing its parameters, computing necessary estimates, and de-allocation of the task resources:

/* Creation of task */
storage_format_x   = VSL_SS_MATRIX_COLUMNS_STORAGE;
errcode = vsldSSNewTask( &task, &p, &n_portion,  &storage_format_x, X, weights, indices );

/* Edition of task parameters */
errcode = vsldSSEditTask( task, VSL_SS_ACCUMULATED_WEIGHT, W );
errcode = vsldSSEditTask( task, VSL_SS_VARIATION_ARRAY, Variation );
errcode = vsldSSEditMoments( task, Xmean, Raw2Mom, 0, 0, Central2Mom, 0, 0 );

Cov_storage = VSL_SS_MATRIX_FULL_STORAGE;
errcode = vsldSSEditCovCor( task, Xmean, (double*)Cov, &Cov_storage, 0, 0 );

/* Computation of estimates for dataset split in chunks */
for( nchunk = 0; ; nchunk++ )
{
     errcode = vsldSSCompute( task,

   VSL_SS_MEAN | VSL_SS_2CENTRAL_MOMENT |

   VSL_SS_COVARIANCE_MATRIX |  VSL_SS_VARIATION,
                              VSL_SS_1PASS_METHOD );
     If ( nchunk >= N ) break;
     GetNextDataChunk( X, weights );
}

/* De-allocation of task resources */
errcode = vslSSDeleteTask( &task );

I also have opportunity to place the next portion of the data in another array. The whole computation scheme is the same, and I just need to “communicate” to the library address of the new chunk as below:

for( nchunk = 0; ; nchunk++ )
{
     errcode = vsldSSCompute( task,

VSL_SS_MEAN | VSL_SS_2CENTRAL_MOMENT | VSL_SS_COVARIANCE_MATRIX | VSL_SS_VARIATION,
                             VSL_SS_1PASS_METHOD );
     If ( nchunk >= N ) break;
     GetNextDataChunk( NextXChunk, weights );
     errcode = vsldSSEditTask( task, VSL_SS_OBSERVATIONS, NextXChunk );
}

Nothing else is required. Easy? I believe so.

Categories: Financial Services Industry, Parallel Programming, Software Engineering, What If Software

Comments (4)

August 22, 2008 7:01 AM PDT


mr_duck
Interesting... Why do I need array <i>W</i>? And what is W[0] and W[1]?
August 22, 2008 7:21 AM PDT

Ilya Burylov (Intel)
Total Points:
250
Status Points:
200
Green Belt
mr_duck, thank you for your question!
We need this array <b>W</b> in order to support data in chunks. It is used internally to unite previously collected values with new data.
<b>W[0]</b> is the sum of all weights of the observations and <b>W[1]</b> is the sum of squares of weights.
Is it answering your question?
August 22, 2008 10:49 AM PDT


mr_duck
Thanks a lot, Ilya!
I just wanted to ask - is it a best place for follow up questions or there is better place to ask some usage-model questions? In the forum or somewhere else?
Thanks again!
August 25, 2008 12:00 AM PDT

Dmitry Kabaev (Intel)
Total Points:
645
Status Points:
145
Brown Belt
We will be grateful to you for your questions and suggestion on development of the library functionality. Please do not hesitate to ask me directly here. I will try to share with you some methods in data processing developed taking into account multicore processors. Thanks.

Trackbacks (0)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*