English | 中文 | Русский | Français
2,598 Posts served
8,348 Conversations started
Algorithms of Intel® Summary Statistics Library provide support for huge datasets which can not fit into memory of a computer. Earlier I described how analysis of the dataset which is available in blocks/chunks can be done for case of statistical moments. The Update to Intel® Summary Statistics Library 1.0 which was recently released contains the algorithm for computations of quantiles, case of streaming data. Let me have a closer look at the opportunities that are possible with this feature.
The theory and properties of this algorithm are described in [1]. In a nutshell, this method allows getting eps-approximate quantile (that is, the element in the dataset whose rank is within [r-eps*n, r+esp*n] interval for user-provided error eps, size of dataset n and any rank r=1,…,n) for price of one pass over the dataset. Another important aspect of the algorithm is in general case absence of the requirement to know the total size of the dataset in advance.
To understand how the algorithm works, I developed a simple application. As usual, it passes through four basic important stages: creation of the task, edition of necessary task parameters, getting stat estimates, and de-allocation of the resources. I already described this model here and now demonstrate usage of the editor and computation of the estimate.
For goals of my experiment I set dimension of the task p to 1 and chose total number of observations n=10,000,000. The dataset “arrives” in blocks of 10,000 elements each. My nearest task is to compute deciles with pre-defined error eps = 0.00001 (that is, I expect to get elements as decile estimates whose rank deviate from the “accurate” one by no more than 100 positions).
The library contains special editor for this algorithm:
status = vsldSSEditStreamQuantiles ( task, &quant_order_n, quant_order, quantiles, ¶ms_n, ¶ms );
Array quant_order is initialized with order of quantiles: 0.1, 0.2, …, 0.9; total number of quantiles I would like to compute is quant_order_n=9. Results of the computations will be placed into array quantiles. Finally, the algorithm for computation of streaming quantiles accepts array of parameters. It contains one element, user-defined error eps which is 0.00001 in my case. To initialize size of array which contains parameters of the algorithm I use the macro defined in the library: params_n = VSL_SS_STREAM_QUANTILES_ZW_PARAMS_N.
The loop for computation of the deciles is as follows:
for ( block_index = 0; block_index < max_block_num; block_index++ )
{
// Get the next data block of size block_n
…
status = vsldSSCompute( task, VSL_SS_STREAM_QUANTILES, VSL_SS_STREAM_QUANTILES_ZW_METHOD );
// Process computation results
…
}
Immediately after the processing of the next block I get intermediate estimates of deciles. In my case (my dataset contains Gaussian numbers with mean equal to 0 and variance 1) I have the sequence of the estimates:
Block Streaming deciles:
index D1 D2 D3 D4 D5 D6 D7 D8 D9
1 -1.267162 -0.844286 -0.525791 -0.266775 -0.011528 0.252491 0.539124 0.849691 1.269578
2 -1.288032 -0.847845 -0.537461 -0.276688 -0.019232 0.240052 0.513117 0.832769 1.269036
3 -1.284862 -0.838628 -0.526169 -0.265614 -0.011068 0.242837 0.516336 0.836675 1.270469
…
…
998 -1.281560 -0.841417 -0.524175 -0.253100 0.000917 0.253695 0.524858 0.841297 1.281417
999 -1.281570 -0.841419 -0.524176 -0.253106 0.000914 0.253691 0.524859 0.841312 1.281418
1000 -1.281543 -0.841419 -0.524180 -0.253124 0.000888 0.253677 0.524839 0.841297 1.281397
Well, what if I do not need to know intermediate estimate of quantiles, and my goal - the estimate for the whole dataset? In this case the library gives me opportunity to use so called “Fast Mode” which allows me just to update the library internal data structures without actual computation of the intermediate estimates.
for ( block_index = 0; block_index < max_block_num; block_index++ )
{
// Get the next data block of size block_n
…
status = vsldSSCompute( task, VSL_SS_STREAM_QUANTILES_FMODE, VSL_SS_STREAM_QUANTILES_ZW_METHOD );
}
As soon as this loop completes I set block_n variable to zero (please, make sure that this variable was registered in the library during creation of the task or by means of the suitable editor) and call Compute procedure to get the estimates:
block_n = 0;
status = vsldSSCompute( task, VSL_SS_STREAM_QUANTILES, VSL_SS_STREAM_QUANTILES_ZW_METHOD );
The result of this experiment is identical to the last line of the previous table:
Streaming deciles:
D1 D2 D3 D4 D5 D6 D7 D8 D9
-1.281543 -0.841419 -0.524180 -0.253124 0.000888 0.253677 0.524839 0.841297 1.281397
Finally, I should check the accuracy of the estimates. For this reason I compute “accurate” deciles for the same dataset using the algorithm in the library intended for processing of in-memory datasets. The algorithm can return order statistics as well. Its output turned out to be as follows:
“Accurate” deciles:
D1 D2 D3 D4 D5 D6 D7 D8 D9
-1.281545 -0.841421 -0.524179 -0.253124 0.000889 0.253675 0.524842 0.841302 1.281399
This is not enough, now I should compute maximal difference between ranks of “in-memory” and “out-of-memory” deciles. As you remember it should not be bigger than 100:
Rank difference
D1 D2 D3 D4 D5 D6 D7 D8 D9
4 5 3 1 3 7 8 7 4
So, I’m absolutely satisfied with these estimates.
References
[1] Q. Zhang and W. Wang. A Fast Algorithm for Approximate Quantiles in High Speed Data Streams. SSDBM, p.29, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007), 2007.
