Gentle Introduction to PyDAAL: Vol 2 Basic Operations on Numeric Tables

By PREETHI VENKATESH, Nathan G Greeneltch, Published: 09/19/2017, Last Updated: 09/19/2017

Previous: Vol 1: Data Structures

A wide range of classes are available in the Intel® Data Analytics Acceleration Library (Intel® DAAL) to create a numeric table accommodating various data layout, dtypes, and frequent access methods. Volume 1 of this series covers numeric table creation under different scenarios. Once created, Intel® DAAL provides operational methods for visualizing and mutating a user’s numeric tables. Volume 2 will cover the usage of the operational methods. Subsequently Volume 3 in this series gives a brief introduction to Algorithm section of PyDaal. Table 1 can be used as a quick reference for basic operations on Intel® DAAL’s numeric table.

Volumes in Gentle Introduction Series

  • Vol 1: Data Structures - Introduces Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples.
  • Vol 2: Basic Operations on Numeric Tables - Introduces possible operations performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples
  • Vol 3: Analytics Model Building and Deployment - Introduces analytics modeling and evaluation process in Intel® DAAL with serialized deployment in batch processing.
  • Vol4: Distributed and Online Processing -  Introduces Intel DAAL’s advanced processing modes (distributed and online) that support data analysis and model fitting on large and streaming data.

IDP and Intel® DAAL Installation

The demonstrations in this article require IDP, Intel® DAAL and mpi4py installation which are available for free on Anaconda cloud.

1. Install IDP full environment to install all the required packages

conda create -n IDP –c intel intelpython3_full python=3.6 

2. Activate IDP environment

source activate IDP
(or)
activate IDP

Table 1. Quick reference table on available methods

Method Description Usage Syntax
*Print numeric table as stored in memory to represent data layout. Method requires ‘nT’ as input argument printNumericTable(nT)
*Quick visualization on multiple numeric tables printNumericTables(nT1,nT2)
Check shape of numeric table #Number of Rows
nT.getNumberOfRows()
#Number of Columns
nT.getNumberOfColumns()
Allocate buffer to load block of numeric table for access and manipulation operations. block = BlockDescriptor_Float64()
#Allocates a memory block with double dtype
Retrieve block of rows and columns from numeric table into Block Descriptor for visualization. (Setting rwflag to ‘readOnly’ enables only read access to the buffer.) #Block of Column values
nT.getBlockOfColumnValues(colIndex, firstRowIndex,lastRowIndex, rwflag, block)

#Block of Rows
nT.getBlockOfRows(firstRowIndex,lastRowIndex, rwflag, block)
Extract numpy array from Block Descriptor object when loaded with block of values block.getArray()
Release block of Rows from buffer nT.releaseBlockOfRows(block)
*Print underlying array of numeric table. Method requires ‘np.array’ as input argument. printArray(block.getArray() , num_printed_cols, num_printed_rows, num_cols, message)
Check FeatureTypes on each column of numeric table data dictionary dict[colIndex].featureType

* denotes functions included in the ‘utils’ folder, which can be found in <install_root>/share/pydaal_examples/examples/python/source/. <install_root>

Different phases of Numeric Table life cycle

1. Initiate

Let’s begin by constructing a numeric table (nT) directly from a Numpy array. We will use the nT throughout the rest of the code examples in this volume.

import numpy as np
from daal.data_management import HomogenNumericTable
array =np.array([[1,2,3,4],
                [5,6,7,8]])
nT= HomogenNumericTable(array)

2. Operate

Once initialized, numeric tables provide various classes and member functions to access and manipulate data similar to a pandas DataFrame. We will dive next into Intel DAAL’s operational methods, after an important note about Intel DAAL’s bookkeeping object called Data Dictionary.

Data Dictionary:

As mentioned in Vol1 of this series on creation of Intel DAAL’s numeric tables (link), these custom data structures must be accompanied by a Data Dictionary to perform operations. When raw data streams into memory to populate the numeric table structure, the table’s Data Dictionary concurrently records metadata. Dictionary creation will occur automatically unless specified to not allocate by the user. Various Data Dictionary methods are available to access and manipulate feature type, dtypes etc. If a user creates a numeric table without memory allocation, the Data Dictionary values have to be explicitly set with feature types. An important note is that Intel DAAL’s Data Dictionary is a custom data structure, not a Python dictionary.

More details on working with Intel DAAL Data Dictionaries

2.1 Data Mutation in Numeric Table:

2.1.1 Standardization and Normalization:

Data analysis work is usually preceded by a Data Preprocessing stage that includes data wrangling, quality checks, and assurance to handle null values, outliers etc. An important preprocessing activity is to normalize input data. Intel DAAL offers routines to support two popular normalization techniques on numeric tables: Namely, Z-score standardization and Min-Max normalization.

Currently, Intel DAAL only supports rescaling for descriptive analytics. In the future, support will be added for predictive analytics with the addition of a “transform()” method to be applied to new data.

  • Z-score Standardization: Rescales numeric table values feature-wise to the number of standard deviations each value deviates from the mean. Below are the steps to use Intel DAAL’s z-score standardization.

    import daal.algorithms.normalization.zscore as zscore
    
    # Create an algorithm
    algorithm = zscore.Batch(method=zscore.sumDense)
    
    # Set input object for the algorithm to nT
    algorithm.input.set(zscore.data, nT)
    
    # Compute Z-score normalization function
    res = algorithm.compute()
    
    #Retrieve normalized nT
    Norm_nT= res.get(zscore.normalizedData)
    
  • Min-Max Normalization: Rescales numeric table values feature-wise linearly to fit [0, 1] / [-1-1] range. Below are the steps to use Intel DAAL’s Min-Max normalization.

    import daal.algorithms.normalization.minmax as minmax
    
    # Create an algorithm
    algorithm = minmax.Batch(method=minmax.defaultDense)
    
    # Set lower and upper bounds for the algorithm
    algorithm.parameter.lowerBound = -1.0
    algorithm.parameter.upperBound = 1.0
    
    # Set input object for the algorithm to nT
    algorithm.input.set(minmax.data, nT)
    
    # Compute Min-max normalization function
    res = algorithm.compute()
    
    # Get normalized numeric table
    Norm_nT = res.get(minmax.normalizedData)

2.1.2 Block Descriptor for Visualization and Mutation:

The Contents of a numeric table cannot be accessed directly to visualize or manipulate. Instead a user must first move a requested block of data values to a memory buffer. Once instantiated, this buffer is housed in an object called BlockDescriptor. An Intel DAAL numeric table object has member functions to retrieve blocks of rows/columns and add to the BlockDescriptor. The argument rwflag is used to set “readOnly”/“readWrite” mode, depending on whether the user intends to update values in the numeric table while releasing the block. Conveniently, the BlockDescriptor class allows for block retrieval of data in specific rows and/or columns. Note: the dtype of data in the BlockDescriptor buffer is not required to match the numeric table that sourced the block.

Access Modes:
  • “readOnly” argument sets rwflag to provide read only access to numeric table contents, thus performing no updates to the table when the block is released from buffer memory.

    Syntax and Usage:

    from daal.data_management import BlockDescriptor_Float64, readOnly
    #Allocate a readOnly memory block with double dtype 
    block = BlockDescriptor_Float64()
    nT.getBlockOfRows(0,1, readOnly, block)
    
  • “readWrite” argument sets rwflag to write back any changes from block descriptor object to the numeric table when the block is released from buffer memory, thus enabling numeric table mutation with the help of block descriptor.

    Syntax and Usage:

    from daal.data_management import BlockDescriptor_Float64, readWrite
    #Allocate a readOnly memory block with double dtype 
    block = BlockDescriptor_Float64()
    nT.getBlockOfRows(0,1, readWrite, block)
    

2.1.3 BlockDescriptor() in “readWrite” mode:

When rwflag argument is set to “readWrite” in getBlockOfRows()/ getBlockOfColumnValues(), contents of BlockDecriptor object are written back to the numeric table while releasing block of rows, making edits possible on existing rows/columns in numeric table.

Let’s create a basic numeric table to explain BlockDecriptor in “readWrite” mode in detail.

import numpy as np
from daal.data_management import HomogenNumericTable, readWrite, BlockDescriptor
from utils import printNumericTable
array =np.array([[1,2,3,4],
                [5,6,7,8]])
nT= HomogenNumericTable(array)
  • Edit numeric table Row-wise:
    printNumericTable(nT,"Original nT: ")
    #Create buffer object with ntype "double"
    doubleBlock = BlockDescriptor(ntype=np.float64)
    
    firstRow = 0
    lastRow = nT.getNumberOfRows()
    
    #getBlockOfRows() member function in "readWrite" mode to retrieve numeric table contents and populate "doubleBlock" object
    nT.getBlockOfRows(firstRow,lastRow, readWrite, doubleBlock)
    #Access array contents from "doubleBlock" object
    array = doubleBlock.getArray()
    #Mutate 1st row of array to reflect on "doubleBlock" object
    array[0] = [0,0,0,0]
    #Release buffer object and write changes back to numeric table
    nT.releaseBlockOfRows(doubleBlock)
    printNumericTable(nT,"Updated nT: ")
    

    nT was originally created with data [[1,2,3,4],[5,6,7,8]]. After row mutation the first row is now replaced using buffer memory. Updated nT has data [[0,0,0,0],[5,6,7,8]].

  • Edit numeric table Column-wise:
    printNumericTable(nT,"Original nT: ")
    #Create  buffer object with ntype "double"
    doubleBlock = BlockDescriptor(ntype=np.intc)
    ColIndex = 2
    firstRow = 0
    lastRow = nT.getNumberOfRows()
    
    #getBlockOfColumnValues() member function in "readWrite" mode to retrieve numeric table ColIndex contents and populate "doubleBlock" object
    nT.getBlockOfColumnValues(ColIndex,firstRow,lastRow,readWrite,doubleBlock)
    
    #Access array contents from "doubleBlock" object
    array = doubleBlock.getArray()
    
    #Mutate array to reflect on "doubleBlock" object
    array[:][:] = 0
    
    #Release buffer object and write changes back to numeric table
    nT.releaseBlockOfColumnValues(doubleBlock)
    printNumericTable(nT, "Updated nT: ")

     nT was originally created with data [[1,2,3,4],[5,6,7,8]] After column mutations, the third column is replaced with [0,0] using buffer memory. Updated nT has data [[1,2,0,4],[5,6,0,8]].

2.1.4 Merge numeric table:

Numeric tables can be appended along rows and columns, provided, they share the same array size along the relevant axis to merge. RowMergedNumericTable()and MergedNumericTable() are the 2 classes available to merge numeric tables. The latter is used for merges on column indexes.

  • Merge Row-wise:

    Syntax:

    mnT = RowMergedNumericTable()
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT2); mnt.addNumericTable(nT3)
         

    Code Example:

    from daal.data_management import HomogenNumericTable, RowMergedNumericTable
    import numpy as np
    from utils import printNumericTable
    
    
    #nT1 and nT2 are 2 numeric tables having equal number of COLUMNS
    array =np.array([[1,2,3,4],
                     [5,6,7,8]], dtype = np.intc)
    nT1= HomogenNumericTable(array)
    array =np.array([[9,10,11,12],
                     [13,14,15,16]],dtype = np.intc)
    nT2= HomogenNumericTable(array)
    
    #Create merge numeric table object
    mnT = RowMergedNumericTable()
    
    #add numeric tables to merged numeric table object
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT2)
    printNumericTable(nT1, "Numeric Table nT1: ")
    printNumericTable(nT2, "Numeric Table nT2: ")
    printNumericTable(mnT, "Merged Numeric Table nT1 and nT2: ")

     Output:

    1.000     2.000     3.000     4.000    
    5.000     6.000     7.000     8.000    
    9.000     10.000    11.000    12.000   
    13.000    14.000    15.000    16.000  

  • Merge Column-wise:

    Syntax:

    mnT = MergedNumericTable()
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT1); mnt.addNumericTable(nT3) 

    Code Example:

    from daal.data_management import HomogenNumericTable, MergedNumericTable
    import numpy as np
    from utils import printNumericTable
    
    #nT1 and nT2 are 2 numeric tables having equal number of ROWS
    array =np.array([[1,2,3,4],
                     [5,6,7,8]], dtype = np.intc)
    nT1= HomogenNumericTable(array)
    
    array =np.array([[9,10,11,12],
                     [13,14,15,16]],dtype = np.intc)
    nT2= HomogenNumericTable(array)
    
    #Create merge numeric table object
    mnT = MergedNumericTable()
    
    #add numeric tables to merged numeric table object
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT2)
    
    printNumericTable(nT1, "Numeric Table nT1: ")
    printNumericTable(nT2, "Numeric Table nT2: ")
    printNumericTable(mnT, "Merged Numeric Table nT1 & nT2: ")


    Output:

    1.000     2.000     3.000     4.000     9.000     10.000    11.000    12.000    

    5.000     6.000     7.000     8.000     13.000    14.000    15.000    16.000

2.1.5 Split Numeric table:

See Table 1 for a quick reference on available methods for the entries getBlockOfRows() and getBlockOfColumnValues() methods, used to extract sections of a numeric table by row or column values. Additionally, the helper function getBlockOfNumericTable() is provided below and implements the capability to extract a contiguous subset of the table with selected range of rows and columns. getBlockOfNumericTable() accepts int or list keyword arguments for ranges of rows and columns, using conventional Python 0 - based indexing.

Syntax and Usage: getBlockOfNumericTable(nT, Rows = ‘All’, Columns = ‘All’)

Helper Function:
def getBlockOfNumericTable(nT,Rows = 'All', Columns = 'All'):
    from daal.data_management import HomogenNumericTable_Float64, \
    MergedNumericTable, readOnly, BlockDescriptor
    import numpy as np
#------------------------------------------------------
    # Get First and Last Row indexes
    lastRow = nT.getNumberOfRows()
    if type(Rows)!= str:
        if type(Rows) == list:
            firstRow = Rows[0]
            if len(Rows) == 2: lastRow = min(Rows[1], lastRow)
        else:firstRow = 0; lastRow = Rows
    elif Rows== 'All':firstRow = 0
    else:
        warnings.warn('Type error in "Rows" arguments, Can be only int/list type')
        raise SystemExit
#------------------------------------------------------
    # Get First and Last Column indexes
    nEndDim = nT.getNumberOfColumns()
    if type(Columns)!= str:
        if type(Columns) == list:
            nStartDim = Columns[0]
            if len(Columns) == 2: nEndDim = min(Columns[1], nEndDim)
        else: nStartDim = 0; nEndDim = Columns
    elif Columns == 'All': nStartDim = 0
    else:
        warnings.warn ('Type error in "Columns" arguments, Can be only int/list type')
        raise SystemExit
#------------------------------------------------------
    #Retrieve block of Columns Values within First & Last Rows
    #Merge all the retrieved block of Columns Values
    #Return merged numeric table
    mnT = MergedNumericTable()
    for idx in range(nStartDim,nEndDim):
        block = BlockDescriptor()
        nT.getBlockOfColumnValues(idx,firstRow,(lastRow-firstRow),readOnly,block)
        mnT.addNumericTable(HomogenNumericTable_Float64(block.getArray()))
        nT.releaseBlockOfColumnValues(block)
    block = BlockDescriptor()
    mnT.getBlockOfRows (0, mnT.getNumberOfRows(), readOnly, block)
    mnT = HomogenNumericTable (block.getArray())
    return mnT 



There are 4 different ways of passing arguments to this function:

  1. getBlockOfNumericTable(nT) - Extracts block of numeric table having all rows and columns of nT.
  2. getBlockOfNumericTable(nT, Rows = 4, Columns = 5) - Retrieves first 4 rows and first 5 column values of nT
  3. getBlockOfNumericTable(nT, Rows=[2,4], Columns = [1,3])-Slices numeric table along row and column directions using lower bound and upper bound passed as parameters in list.
  4. getBlockOfNumericTable(nT, Rows=[1,], Columns = [1,])-Extracts all rows and columns from lower bound through last index.

2.1.6 Change feature type:

Numeric table objects have dictionary manipulation methods to get and set feature types in the Data Dictionary for each column. Categorical(0), Ordinal(1), and Continuous(2) are available feature types in Data Dictionary supported by Intel DAAL.

  • Get dictionary object associated with nT :

    Syntax:  nT.getDictionary()

    Code Example:

    dict = nT.getDictionary() # nT is numeric table created in section 1
    '''
    'dict' object has data dictionary of numeric table nT. This can be used to update metadata information about the data. Most common use case is to modify default feature type of nT columns.
    '''
    # Print default feature type of 3rd feature (example feature is continuous):
    print(dict[2].featureType) #outputs “2” (denotes Continuous feature type) 
    
    # Modify feature type from Continuous to Categorical:
    dict[2].featureType = data_feature_utils.DAAL_CATEGORICAL 
    print(dict[2].featureType) #outputs “0” (denotes Categorical feature type)
         
  • Set dictionary object associated with nT:

    This is the method used to replace current Data Dictionary values or to create new Data Dictionaries, if needed. Also, for batch updates, an existing Data Dictionary can be overwritten in full using setDictionary() method.

    When tables are created without allocating memory for the Data Dictionary, the setDictionary() method must be used to construct metadata for features in the table. Let us again consider nT created in section-1 having 4 features

    Syntax:  nT.setDictionary()

    Code Example:

    nT.releaseBlockOfRows(block)
    
    #Create a dictionary object using Numeric table dictionary class with the number of features
    dict = NumericTableDictionary(nFeatures) 
    #Allocate a feature type for each feature
    dict[0].featureType = data_feature_utils.DAAL_CONTINUOUS
    dict[1].featureType = data_feature_utils.DAAL_CATEGORICAL
    dict[2].featureType = data_feature_utils.DAAL_CONTINUOUS
    dict[3].featureType = data_feature_utils.DAAL_CATEGORICAL
    
    #set the nT numeric table dictionary with “dict”
    nT.setDictionary(dict)
    
    

2.2 Export Numeric Table to disk:

Numeric tables can be exported and saved as a numpy binary file (.npy) file to disk. The following two sections contain helper functions to complete the task of saving in binary form, as well as compressing the data on disk.

2.2.1 Serialization

Intel DAAL provides interfaces to serialize numeric table objects into a data archive that can be converted to a numpy array object. The resulting Numpy array, which houses the serialized form of the data, can be saved to disk and subsequently reloaded in the future to reconstruct the source numeric table.

To automate this process, the following helper function can be used to serialize and save to disk.

Helper Function:
def Serialize(nT):
#Construct input data archive Object
#Serialize nT contents into data archive Object
#Copy data archive contents to numpy array
#Save numpy array as .npy in the path
   from daal.data_management import InputDataArchive
   import numpy as np

   dataArch = InputDataArchive()

   nT.serialize(dataArch)

   length = dataArch.getSizeOfArchive()
   buffer_array = np.zeros(length, dtype=np.ubyte)
   dataArch.copyArchiveToArray(buffer_array)

   return buffer_array
buffer_array = Serialize(nT) # call helper function
#np.save(<path>, buffer)# This step is optional
</path>

2.2.2 Compression

Compressor methods are also available in Intel DAAL to achieve reduced memory footprint when large datasets must be stored to disk. A serialized array representation of an Intel DAAL numeric table can be compressed before saving it to disk, hence achieving optimal storage.

To automate this process, the following helper function can be used to serialize, then compress the resulting serialized array.

Incorporate helper functions Serialize(nT) and CompressToDisk (nT, path) to compress and write numeric tables to disk.

Helper Function:
def CompressToDisk(nT, path):
    # Serialize nT contents
    # Create a compressor object
    # Create a stream for compression
    # Write numeric table to the compression stream
    # Allocate memory to store the compressed data
    # Store compressed data
    # Save compressed data to disk
    from daal.data_management
    import Compressor_Zlib, level9, CompressionStream
    import numpy as np

    buffer = Serialize (nT)
    compressor = Compressor_Zlib ()
    compressor.parameter.gzHeader = True
    compressor.parameter.level = level9
    comprStream = CompressionStream (compressor)
    comprStream.push_back (buffer)
    compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8)
    comprStream.copyCompressedArray (compressedData)
    np.save (path, compressedData)
    CompressToDisk (nT, < path >)

2.3 Import Numeric Table from disk:

As mentioned in the previous sections, numeric tables can be stored in the form of either serialized or compressed numpy files. Decompression/ Deserialization methods are available to reconstruct the numeric table.

2.3.1 Deserialization

The helper function below is available to reconstruct a numeric table from serialized array objects.

Helper Function:
def DeSerialize(buffer_array):
    from daal.data_management import OutputDataArchive, HomogenNumericTable
    #Load serialized contents to construct output data archive object
    #De-serialize into nT object and return nT

    dataArch = OutputDataArchive(buffer_array)
    nT = HomogenNumericTable()
    nT.deserialize(dataArch)
    return nT
#buffer_array = np.load(path) # this step is optional, used only when serialized contents have to be written to  disk
nT = DeSerialize(buffer_array)

2.3.2 Decompression

As compression stage involves serialization of numeric table object, decompression stage includes deserialization. See DeSerialize helper function to recover the numeric table. Refer below for a quick de-compression helper function.

Incorporate helper functions DeSerialize(buffer_array) and DeCompressFromDisk(path) to compress and read numeric tables from disk.

Helper Function:
def DeCompressFromDisk(path):
    from daal.data_management import  Decompressor_Zlib, DecompressionStream
    # Create a decompressor
    decompressor = Decompressor_Zlib()
    decompressor.parameter.gzHeader = True 

    # Create a stream for decompression
    deComprStream = DecompressionStream(decompressor)

    # Write the compressed data to the decompression stream and decompress it
    deComprStream.push_back(np.load(path))

    # Allocate memory to store the decompressed data
    deCompressedData = np.empty(deComprStream.getDecompressedDataSize(), dtype=np.uint8)

    # Store the decompressed data
    deComprStream.copyDecompressedArray(deCompressedData)

    #Deserialize
    return DeSerialize(deCompressedData)

nT = DeCompressFromDisk(<path>)#path must be ‘.npy’ file

Intel DAAL also implements several other generic compression and decompression methods that include ZLIB, LZO, RLE, and BZIP (reference)

Conclusion

Intel® DAAL’s data management component provides classes and methods to perform common operations on numeric table contents. Some of the basic numeric table operations include - access, mutation, export to disk and import from disk. Helper functions covered in this document will help automating Intel® DAAL’s creation of numeric table subsets, as well as serialization and compression processes.

The next volume (Volume 3) in the Gentle Introduction series gives a brief introduction to Algorithm section of PyDAAL. Volume 3 focuses on the workflow of important descriptive and predictive algorithms available in Intel® DAAL. Advanced features such as setting hyperparameters, distributing fit calculations, and deploying models as serialized objects will all be covered.

 Other Related Links:

Next: Vol 3: Analytics Model Building and Deployment

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804