Computing Quantiles for Streaming Data

 

Summary Statistics provides a method for computing quantiles that supports out-of-memory data.

In a nutshell, this method permits getting an e-approximate quantile in one pass over the dataset, without knowing the total size of the dataset in advance. The e-approximate quantile is an element in the dataset with the rank within the interval [r- en, r+en] for a user-provided error e, size of dataset n and any rank r=1,...,n.

[Zhang2007] describes the theory and properties of this algorithm. For details on computational aspects, requirements, and usage model of the algorithm, see Using VSL_SS_METHOD_SQUANTS_ZW for Quantiles Computation chapter of this document.

Example:

Consider a simple application that uses this algorithm. The dimension of the task is set to p=1. The total number of observations is set to n=10,000,000. The dataset is returned in blocks of 10,000 elements each. The goal is to compute deciles with a pre-defined error e = 0.00001, that is, the array elements with ranks deviating from the rank of accurate deciles by not more than 100 positions.

The library contains a special editor for this algorithm:

status = vsldSSEditStreamQuantiles ( task, &quant_order_n, quant_order, 
                                     quant, &n_params, &params );

where

  1. quant_order_n is the total number of quantiles, set to 9 in this example.

  2. quant_order is an array initialized with quantile orders 0.1, 0.2,..., 0.9.

  3. &params is an array of parameters. The only element in the array is the user-defined error e, set to 0.00001 in this example.

The computation results are placed into the quants array.

To initialize the size of the array that contains parameters of the algorithm, you can use the macro defined in the library:

n_params = VSL_SS_SQUANTS_ZW_PARAMS_N

The loop for computing the deciles is as follows:

for ( block_index = 0; block_index < max_block_num; block_index++ )
{
    // Get the next data block of size block_n
    ...
    status = vsldSSCompute( task, VSL_SS_STREAM_QUANTS,
                                  VSL_SS_METHOD_SQUANTS_ZW );
    // Process computation results
    ...
}

Intermediate estimates of deciles are obtained immediately after processing the next block. As the dataset contains Gaussian numbers with the mean equal to 0 and the variance equal to 1, the sequence of the estimates is as follows:

Block   Streaming deciles:
index        D1      D2      D3      D4      D5     D6     D7     D8    D9
1        -1.2671 -0.8442 -0.5257 -0.2667 -0.0115 0.2524 0.5391 0.8496 1.2695 
2        -1.2880 -0.8478 -0.5374 -0.2766 -0.0192 0.2400 0.5131 0.8327 1.2690 
3        -1.2848 -0.8386 -0.5261 -0.2656 -0.0110 0.2428 0.5163 0.8366 1.2704
...
...
998      -1.2815 -0.8414 -0.5241 -0.2531  0.0009 0.2536 0.5248 0.8412 1.2814 
999      -1.2815 -0.8414 -0.5241 -0.2531  0.0009 0.2536 0.5248 0.8413 1.2814 
1000     -1.2815 -0.8414 -0.5241 -0.2531  0.0008 0.2536 0.5248 0.8412 1.2813

If you need to compute the estimate for the whole dataset only, you can use the fast version of the same method. It permits you to update the internal data structure in the library without actual computation of the intermediate estimates. 

for ( block_index = 0; block_index < max_block_num; block_index++ )
{
    // Get the next data block of size block_n
    ...
    status = vsldSSCompute( task, VSL_SS_STREAM_QUANTS,
                                  VSL_SS_METHOD_SQUANTS_ZW_FAST );
} 

To get the estimate, set the block_n variable to zero, make sure it is registered in the library, and call the Compute routine:

block_n = 0;
status = vsldSSCompute( task, VSL_SS_STREAM_QUANTS,
                              VSL_SS_METHOD_SQUANTS_ZW );

The output of this application is identical to the last line of the previous table:

Streaming deciles:
   D1        D2       D3       D4      D5      D6      D7      D8     D9  
-1.28154 -0.84141 -0.52418 -0.25312 0.00088 0.25367 0.52483 0.84129 1.28139

To check the estimates, the in-memory version of the quantile algorithm calculates accurate deciles for the same dataset. This algorithm returns the following output:

"Accurate" deciles:
   D1        D2       D3       D4      D5       D6      D7      D8     D9  
 -1.28155 -0.84142 -0.52417 -0.25311 0.00089 0.25368 0.52484 0.84130 1.28140 

The maximal difference between ranks of in-memory and out-of-memory deciles does not exceed 100, which fully aligns with the theory:

Rank difference
   D1        D2       D3       D4      D5      D6      D7      D8     D9  
   4         5        3        1       3       7       8       7      4
Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.