Intel® Summary Statistics Library: Several Estimates at One Stroke

Today it was necessary for me to compute statistical estimates for a dataset. The observations are weighted, and only several components of the random vector had to be analyzed. How often do we solve such tasks and how do we solve them in our every day life? If we meet such problems rarely or their size is small then use of a popular statistical package or development of a data processing program will be a proper way to address the problem. What if I need to process huge data arrays regularly analyzing gene expression levels for example? What is the fast way to calculate the estimates? Intel® Summary Statistics Library looks suitable approach to solve such problems.


Typical application which uses the library passes through four stages. Below I will consider all the stages in more detail using a simple example for computation of mean, variance, covariance, and variation coefficient.


First, I need to create a new task and pass into the library parameters of my problem, dimension p, number of observations n, and to point at memory where the dataset X is stored:




storage_format_x   = VSL_SS_MATRIX_COLUMNS_STORAGE;


errcode = vsldSSNewTask( &task, &p, &n,  &storage_format_x, X, weights, indices );



Array weights contain weights assigned to each observation, and array indices determine components of the random vector I work with. For example, indices can be initialized as follows:




 indices[p] = {0, 1, 1, 0, 1,..};



 That is, observations for zero and third components of the random vector are excluded from analysis. The dataset can be stored in columns or in rows; its storage format is passed into the library using variable storage_format_x.  By the way, I can pass null pointers instead of weights and indices if I want to set all weights to one and to process all components of the random vector.


Further, I need to register in the task descriptor arrays to hold computation results and other parameters. Intel® Summary Statistics Library provides set of editors. In my application I use several of them:




errcode = vsldSSEditTask( task, VSL_SS_ACCUMULATED_WEIGHT, W );


errcode = vsldSSEditTask( task, VSL_SS_VARIATION_ARRAY, Variation );


errcode = vsldSSEditMoments( task, Xmean, Raw2Mom,0,0,Central2Mom, 0, 0);


cov_storage = VSL_SS_MATRIX_FULL_STORAGE;


errcode = vsldSSEditCovCor( task, Xmean, Cov, &Cov_storage , 0, 0);



 Estimates for mean, 2nd algebraic moment, variance, and variation coefficient will be stored in arrays Xmean, Raw2Mom, Central2Mom, Variation. Covariance estimate will be placed in the array Cov. Please, do not forget to set storage format for the covariance matrix as the library supports full and packed format and needs to “know” how to store computation results for you. Registration of array to hold mean estimates is required in most cases even if I do not need to know the estimate. This is necessary as many other statistical estimates use mean value (User Manual provides necessary details).    


 And now we can compute estimates of our interest. It is enough to call computing routine just once:




errcode = vsldSSCompute( task,VSL_SS_MEAN | VSL_SS_2CENTRAL_MOMENT | VSL_SS_COVARIANCE_MATRIX | VSL_SS_VARIATION, VSL_SS_FAST_METHOD );



Not so bad, is not it? A specific of the library is that I pass into the library pointers to memory where data are stored. This allows me to place another data to the same memory and call Compute routine without a need to edit the task descriptor once again. 


Finally, I free system resources allocated for the task:




errcode = vslSSDeleteTask( &task );



 Solution of the problem described above with p = 500 and n = 100,000 took 1.42 sec on my two-way Quad Core Intel® Xeon® E5440 2.8GHz CPU (8 cores total) based system. Run of the same application on the same machine in serial mode (using only one core) took 9.09 sec.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.