Gentle Introduction to PyDAAL: Vol 1 Data Structures

The Intel® Data Analytics Acceleration Library (Intel® DAAL) is written on Intel® architecture optimized building blocks and includes support for all data analytics stages. Data-driven decision making is empowered by Intel® DAAL with foundations for data acquisition, preprocessing, transformation, data mining, modeling and validation. Python users can access these foundations with the Python API for Intel® DAAL (named PyDAAL). Machine learning with Python gets an injection of power with PyDAAL, accessed via a simple scripting API. Furthermore, PyDAAL provides the unique capability to easily extend Python scripted batch analytics to online (streaming) data acquisition and/or distributed math processing. To achieve best performance on a range of Intel® processors, Intel® DAAL uses optimized algorithms from the Intel® Math Kernel Library and Intel® Integrated Performance Primitives. Intel® DAAL provides APIs for C++, JAVA, and Python. In this Gentle Introduction series, we will cover the basics of PyDAAL from the ground up. The first installment will introduce Intel® DAAL’s custom data structure, Numeric Table, and data management in the world of PyDAAL.

Volumes in Gentle Introduction Series

  • Vol 1: Data Structures - Introduces Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples
  • Vol 2: Basic Operations on Numeric Tables - Introduces possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples
  • Vol 3: Analytics Model Building and Deployment - Introduces analytics model building and evaluation in Intel® DAAL with serialized deployment in batch processing
  • Vol4: Distributed and Online Processing-  Introduces Intel DAAL’s advanced processing modes (distributed and online) that support data analysis and model fitting on large and streaming data.

IDP and Intel® DAAL Installation

The demonstrations in this article require IDP and Intel® DAAL installation which are available for free on Anaconda cloud.

1.    Install IDP full environment to install all the required packages

conda create -n IDP –c intel intelpython3_full python=3.6 

2.    Activate IDP environment

source activate IDP
(or)
activate IDP

Refer to installation options and a complete list of Intel packages for more information

1. Introduction to PyDAAL Data Management

Intel DAAL supports the following ways of processing data:

  • Batch processing
  • Online processing
  • Distributed processing
  • Complex processing (combination of online and distributed processing)

This document’s primary focus will be on batch processing. Online and distributed data management will be discussed in subsequent volumes in the Gentle Introduction series.

Programming Considerations:

Strong Typing: Python scripting heavily utilizes the concept of dynamic (duck) typing, relying on Python’s interpreter to infer type at run time. However, this practice can cause problems when memory footprint requires attention, or when mixed code is deployed. The PyDAAL API calls libraries written in C++ and assembly language, forcing a mixed code environment on the user. Thus, PyDAAL requires consistent typing, conveniently supporting numpy types “np.float32”, “np.float64”, and “np.intc”. Static/Strong typing not only allows explicit declaration of datatypes for optimal memory management but also enforces a type check during compiling, significantly reducing run time.

Memory Access Patterns: Multidimensional arrays are stored as contiguous data in memory. These memory segments can be used to arrange array elements. One possible way to store these elements is one row vector after another called “row major”. Of course, an equally valid approach in storing columns one-after-another, known as “column major”. These two data layout patterns are both supported by Intel DAAL’s numeric table data structure and should be chosen based on expected memory access patterns of the program being written. The former is default in C programming and Intel DAAL’s standard numeric table. The latter is default in Fortran programming and is achieved with Intel DAAL’s Structure of Array’s (SOA) numeric table.

Numpy has the ascontiguousarray () method for converting a numpy array to row-major storage in memory. Furthermore, PyDAAL will attempt to convert any passed input array to contiguous automatically.

SWIG Interface Objects: An important component of Intel DAAL is SWIG, which is a simplified wrapper and interface generator for C/C++ programs. ( Wikipedia information on SWIG). PyDAAL uses SWIG, enabling Python scripting control of the Intel DAAL C++ libraries. An important note is this allows PyDAAL to in effect, escape the Python’s global interpreter lock (GIL) and dispatch processing/threading with compiled C++ code. INTEL’s PyDAAL API team has exposed Intel DAAL’s C++ member functions to the Python user as familiar class methods, visible in Python’s convenient interactive console through the dir(DAAL_object) call.

Data Structure Overview and Flow:

Numeric Table: The primary data structure utilized by Intel DAAL is a Numeric Table. Raw data is streamed into the numeric table structure, and stored in-memory for further access to construct analytics model and fit machine learning algorithms. An Intel DAAL numeric table defines a tabular view of a data set where rows represent observations and columns represent features in-memory. Numeric data presented in the numeric table is accessed through numeric table interface.

Intel DAAL Components and Data Flow:

Components dataflow image

2. Numeric Tables

2.1 Types of Numeric Tables:

Numeric Tables can be constructed on the basis of data type and storage preferences: Initialization preferences can be branched out into data types and data layout.

  • Data Types: Intel DAAL supports 3 dtypes during numeric table creation: intc(C identical Int32/64), float32, float64.

In numpy, these dtypes are called “np.float32”, “np.float64”, and “np.intc”.

  • Data Layout: Dense matrix data can be laid out according to row/column major access, depending on desired access pattern. Also, sparse and triangular matrices can be wisely represented to create smaller memory footprint.

Below are the types of Numeric Tables that Intel DAAL supports: homogeneous, heterogeneous and memory saving numeric tables for dense and sparse data

image

image
image

2.2 Numeric table Initializations Reference (more details for each included in following sections)

Homogeneous Numeric Table:

Homogeneous Numeric Table

dtypeClass nameAlias
intcHomogenNumericTable(ndarray, ntype = intc)HomogenNumericTable_Intc(numpy_array))
float32HomogenNumericTable(ndarray, ntype = float32)HomogenNumericTable_Float32(numpy_array))
float64HomogenNumericTable(ndarray, ntype = float64)HomogenNumericTable_Float64(numpy_array))

Heterogeneous Numeric Table:

1. Array of Structures: heterogen_AOS_nT = AOSNumericTable(ndarray)
2. Structure of Arrays: heterogen_SOA_nT = SOANumericTable(nRows, nColumns)

Memory saving Numeric Table:

1. Condensed Sparse Matrix:
homogenCSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)
2. Packed Matrix

2.3 Loading Data into various types of Numeric Table:

As mentioned earlier, PyDAAL is a strongly-typed library. Intel DAAL’s math has efficient handling of data with common (homogeneous) typing, and separate handling for data of mixed (heterogeneous) types. To this end, Intel DAAL’s numeric tables have multiple flavors to store and serve data in both typing conditions, as well as memory-saving versions for sparse matrices.

2.3.1 Homogeneous Numeric Table and different ways of loading data

Homogeneous numeric tables are Intel DAAL's data structure for storing features that are of the same basic data type. Values of the features are laid out in memory as one contiguous block in row-major order, that is, Observation 1, Observation 2, and so on. While creating the numeric table, Intel DAAL creates a data dictionary that stores assignments of feature type (Continuous, Ordinal, and Categorical) and can be accessed at any time to modify the assignments. Below are some code snippets written to help demonstrate creation of numeric tables using Numpy Array, Pandas DataFrame and PyDAAL’s FileDataSource (csv loading) class as inputArray.

i. Data load and numeric table creation through Numpy Array:

PyDAAL supports direct and easy integration with Numpy. The Below code snippet creates HomogenNumericTable from Numpy array.

The 3 dtypes supported by numeric table creation are float32, float64, and intc. When an integer numpy array is created without declaring the dtype, Python infers the dtype and defaults integers to int32 instead of C identical int32/int64. Hence when creating an integer type numeric table, it is mandatory to declare dtype (np.intc) during initialization of the input numpy array.

Steps for creating HomogeneousnT from Numpy ndarray

  1. Create a numpy ndarray with declared dtype
    array = np.array([], dtype=type)

    Intel DAAL accepts np.float64, np.float32, np.intc

  2. Create a nT from the created numpy array
    nT = HomogenNumericTable( array , ntype=dtype)

Code Snippet

import numpy as np

array = np.array([[0.5, -1.3],
                  [2.5, -3.3],
                  [4.5, -5.3],
                  [6.5, -7.3],
                  [8.5, -9.3]],
                   dtype = np.float32)

# import Available Modules for Homogen numeric table
from daal.data_management import(HomogenNumericTable)

nT = HomogenNumericTable(array, ntype = np.float32)
ii. Data load and numeric table creation through Pandas DataFrame:

Pandas is a widely used library to prepare and manipulate datasets in spreadsheet form. Its ease-of-use and overall breadth have made the library seemingly ubiquitous in Python machine learning work.PyDAAL fully supports data input from Pandas DataFrames, both as homogeneous (through intermediate Numpy array) or heterogeneous (directly from DataFrame). See SOA dedicated section for details on the heterogeneous numeric table creation.

Steps for creating Homogeneous nT from Pandas DataFrame

  1. Create a pandas DataFrame with declared dtype
    df = pd.DataFrame(values, dtype=type)

    Intel DAAL accepts np.float64, np.float32, np.intc

  2. Convert pandas df to Numpy array(ndarray)
    array = df.as_matrix()
  3. Create a nT from the created numpy array
    nT = HomogenNumericTable(array , ntype=dtype)

Code Snippet

import pandas as pd
import numpy as np

#Initialize the columns with values
Col1 = [1,2,3,4,5]
Col2 = [6,7,8,9,10]
# Create a pandas DataFrame of dtype integer
df_int = pd.DataFrame({'Col1':Col1, 
                       'Col2':Col2},
                       dtype=np.intc)

array = df_int.as_matrix()

from daal.data_management import(HomogenNumericTable)

nT  = HomogenNumericTable(array, ntype = np.intc)
iii. Data load and numeric table creation through CSV files

One of the prominent features offered by Pandas is to read data from a CSV file and load them up into a DataFrame for use in Machine Learning algorithms. Intel DAAL provides a class “FileDataSource” that can be leveraged to behave similar to the way pandas operates to read a csv file. PyDAAL’s FileDataSource creates an empty data source object and loads preferred blocks of rows from a csv file using Intel DAAL’s CSVFeatureManager class, followed by getNumericTable() method to create a numeric table. Currently only float64 dtype is available when using FileDataSource.

Steps for creating Homogeneous nT from CSV file source

  1. Create a FileDataSource object with csv path as arg, allocate memory for numeric table, and create Data Dictionary from the data
    dataSource = FileDataSource(path, DataSource.doAllocateNumericTable, DataSource.doDictionaryFromContext)

    DataSource.doAllocateNumericTable creates Homogeneous numeric table by default (so choose “notAllocateNumericTable” for AOS creation)

  2. Load required data block(rows) into the FileDataSource object
    dataSource.loadDataBlock(nRows)
  3. Create a HomogenNumericTable from the data Source with data loaded in it
    nT = dataSource.getNumericTable()
  4. Default dtype is float64

Code Snippet

from daal.data_management \
import(FileDataSource, DataSource)

dataSource = FileDataSource(
    r’path’, DataSource.doAllocateNumericTable, DataSource.doDictionaryFromContext)

# boilerplate method to load nRows in the csv file, default loads all rows if no argument passed
dataSource.loadDataBlock(30) # load first 30 rows

#dataSource.loadDataBlock() to load all rows

nT  = dataSource.getNumericTable()

2.3.2 Heterogeneous Numeric Table and different ways of loading data

Python is dynamically typed and capable of inferring dtype during run time if explicitly not declared. Often, memory footprint consideration becomes an oversight due to this “duck” typing. As datasets grow large, explicit declaration of dtypes becomes beneficial as memory usage is of major concern – a key functionality supported by PyDAALAPIs connecting with C++ libraries.

When columns of incoming data contain different numeric data types (intc, float, or double) it becomes necessary to declare dtypes on respective columns in order to reduce memory footprint. Intel DAAL’s Heterogeneous numeric table delivers the capability of declaring dtypes on individual columns of the array, hence saving significant memory through static typing. AOSNumericTable (Array of Structures) and SOANumericTable (Structure of Arrays) are the 2 Heterogeneous structures available in the current version of PyDAAL.

a. Introduction to Memory Layout Patterns and AOSNumericTable/SOANumericTable

Depending on access patterns, a practitioner can choose to layout a dataset in memory in row-major or column-major form. The resultant data structures are called Array of Structures (AOS) and Structure of Arrays (SOA), respectively. If downstream access is likely to be row by row (sequential observations), then AOS is the best layout choice. If column by column access (sequential features) is required, then SOA should be the better choice. Below is a straight forward representation to visualize the difference between an AOS and SOA data structure:

Example Incoming Data:

image

Memory Layout: AOS (Array of Structures):

image

Memory Layout: SOA (Structure of Arrays):

image

b. AOSNumericTable (Array of Structures data structure):

If desired data access is along row major and input data features have heterogeneous dtypes, Intel DAAL’s AOSNumericTable provides corresponding memory pattern for faster access. It does so by allocating contiguous memory on “observations”. Below are code snippets to illustrate creation of AOSNumericTable table using Numpy Array, Pandas and FileDataSource class of PyDAAL.

i. Data load and numeric table creation through Numpy Array:

Unlike Intel DAAL’s HomogenNumericTable, AOSNumericTable is created with a 1D Numpy array having tuples of elements with declared dtypes on each tuple. Each tuple is a complete row of the dataset. The resulting shape of the input Numpy array is therefore (nRows,).

Steps for creating AOS nT from Numpy array

  1. Create a Numpy 1D array with declared dtype on each column
    array = np.array([], dtype=[(column, dtype)])

    Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc

  2. Create AOS nT from Numpy array
    nT = AOSNumericTable(array)

Code Snippet

import numpy as np

from daal.data_management
import(AOSNumericTable) 

array = np.array([(0.5, -1.3, 1), 
                  (4.5, -5.3, 2),
                  (6.5, -7.3, 0))],
                  dtype=[('x', 			np.float32), 
                         ('categ', 		np.intc), 
                         ('value', 		np.float64)])

nT  = AOSNumericTable(array)

ii. Data load and numeric table creation through Pandas DataFrame:

Steps for creating AOS nT from Pandas DataFrame

  1. Create a pandas DataFrame
    df = pd.DataFrame(values)

    Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc

  2. Convert Pandas DataFrame to 1D structured Numpy array of shape(nRows,). Use helper function: get_StructArray() to:
    • create a list of tuples
    • zip with dtypes
    • convert to numpy array
    array = get_StructArray(df, [dtype1, dtype2, etc.])
  3. Create a nT from the structured Numpy array
    nT = ASONumericTable(np.1darray)

Code Snippet

import pandas as pd
import numpy as np 
df = pd.DataFrame(columns=['c','f'])

df ['c']=[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
df ['f']=[3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0]

def get_StructArray(df,dtypes):
#*** inputs: df, [dtypes], output: structured Numpy array ***
    dataList = []
    for idx in range(df.shape[0]):
        dataList.append(tuple(df.loc[idx]))
    decDtype = list(zip(df.columns.tolist(),dtypes))
    array =  np.array(dataList,dtype = decDtype)
    return array
 
array = get_StructArray(df, [np.intc,np.float64] )

from daal.data_management import AOSNumericTable

nT  = AOSNumericTable(array)

iii. Data load and numeric table creation through CSV file:

Steps for creating AOS nT from CSV file source

  1. Create a FileDataSource object with csv path as arg. Then create Data Dictionary from the data and DO NOT allocate corresponding numeric table.
    dataSource = FileDataSource(path, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)

    DataSource.doAllocateNumericTable creates Homogeneous numeric table by default (so choose “notAllocateNumericTable” for AOS creation)

  2. Initialize an empty 1D numpy array with declared dtypes on every column of input data source
    array = np.empty([nRows,], dtype=[(column,dtype)])

    Intel DAAL accepts numpy supported dtypes- “np.intc”, “np.float32”, “np.float64”

  3. Allocate memory block for AOSNumericTable for empty array initialized in step 2
    nT = AOSNumericTable(array)
  4. Load data block(number of rows) from the FileDataSource object to AOSNumericTable layout
    dataSource.loadDataBlock(nRows,nT)

Code Snippet

import numpy as np

from daal.data_management import (FileDataSource, AOSNumericTable, DataSource)

dataSource = FileDataSource(
    r’path’, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)

array = np.empty([10,],dtype=[('x','i4'),('y','f8')])

nT = AOSNumericTable(array)

dataSource.loadDataBlock(10,nT)
c. SOANumericTable (Structure of Arrays data structure):

If desired data access is along column major and input data features have heterogeneous dtypes, Intel DAAL’s AOSNumericTable provides corresponding memory pattern for faster access. It does so by allocating contiguous memory along “features”. This data structure is a more natural conversion for pandas DataFrame, as pandas stores data in a similar pattern in memory

In contrast to Intel DAAL’s AOSNumericTable, which can be populated with array values at initialization, SOANumericTable requires the number of rows and columns to be defined first (allocating a contiguous block) at initialization, followed by setting array values one column at a time. In other words, a practitioner must create a SOANumericTable structure with proper data dimensions, then subsequently fill the table with data

Below are code snippets to illustrate creation of SOANumericTable table using Numpy Array, Pandas and FileDataSource class of PyDAAL.

i. Data load and numeric table creation through Numpy Array
Steps for creating SOA nT from Numpy array
  1. Create a numpy ndarray for every column with declared dtype
    array = np.array([], dtype=type)

    Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc

  2. Create SOA nT template with nRows and nColumns
    nT = SOANumericTable(nColumns, nRows)
  3. Set one column at a time in SOA nT array
    nT.setArray(array,column index)

Code Snippet

import numpy as np

from daal.data_management import SOANumericTable 

Col1 = np.array([1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6,2.8], 		dtype=np.float64)
Col2 = np.array([3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0], 		dtype=np.float32)
Col3 = np.array([-10, -20, -30, -40, -50, -60, -70, -80, -90, -100],
	dtype=np.intc)
nObservations = 10
nFeatures = 4 
nT  = SOANumericTable(nFeatures, nObservations)

nT.setArray(Col1, 0)
nT.setArray(Col2, 1)
nT.setArray(Col3, 2)
ii. Data load and numeric table creation through Pandas DataFrame:

Steps for creating SOA nT from Pandas DataFrame

  1. Create a pandas DataFrame with different dTypes on each column
    df = pd.DataFrame(values)

    Intel DAAL accepts np.float64, np.float32, np.intc

  2. Create SOA nT template with nRows and nColumns from Df
    nT = SOANumericTable(nColumns, nRows)
  3. Convert every column of Df to Numpy Array (If Df has 3 columns, convert it into 3 numpy arrays) and set each SOA nT columns with the numpy array
    nT.setArray(array,column index)

Code Snippet

import pandas as pd
import numpy as np

#Initialize the columns with values
from daal.data_management import SOANumericTable

df = pd.DataFrame()

df['a']=[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
df['b']=[3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0]
df['c']=[1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8]

df = df.astype(dtype={'a' : np.intc, 
                      'b' : np.float32,                          
                      'c' : np.float64})

nT  = SOANumericTable(df.shape[1],df.shape[0])   

for idx in range(len(df.columns)):
        nT.setArray(df[df.columns[idx]].values,idx)

SOA nT can also be initialized without number of rows and columns and could be set at a later stage using methods setNumberOfRows(N) and setNumberOfColumns(N). However, setting the number of rows and columns to an existing SOA nT having values in it would recreate an empty SOA nT, deleting its previous values.

iii. Data load and numeric table creation through csv file:

Steps for creating SOA nT from CSV file source

  1. Create a FileDataSource object with csv path as arg. Then create Data Dictionary from the data and DO NOT allocate corresponding numeric table
    dataSource = FileDataSource(path, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)

    DataSource.doAllocateNumericTable creates Homogenous numeric table by default (so choose “notAllocateNumericTable” for SOA creation)

  2. Initialize empty 1D numpy arrays with declared dtypes for every column of input data source
    Col1_array = np.empty([nRows,],dtype=dtypes)

    Intel DAAL supports numpy dtypes “np.intc”, “np.float32”, “np.float64”

  3. Allocate memory block for SOANumericTable with number of features and observations
    nT = SOANumericTable(nRows,nObservations)
  4. Set all columns of SOANumericTable with respective empty 1D numpy arrays created in step 2
    nT.setArray(array,idx)
  5. Load data block(rows) from the FileDataSource object to SOANumericTable layout
    dataSource.loadDataBlock(nRows,nT)

Code Snippet

from daal.data_management import(FileDataSource,SOANumericTable, DataSource)
import numpy as np

# CSV file 'path' with 10 rows and 2 columns
dataSource = FileDataSource(
    r’path’, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)

#if data source has 2 columns
Col1_array = np.empty([10,],dtype=np.float64)
Col2_array = np.empty([10,],dtype=np.float64)

nT  = SOANumericTable(2,10)

nT.setArray(Col1_array,0)
nT.setArray(Col2_array,1)

dataSource.loadDataBlock(10,nT)

2.3.3 Homogeneous Memory saving Numeric Table and different ways of loading data:

As discussed in the Homogeneous Section of this document on Homogeneous nT, memory allocation is one contiguous block across every observation holding values with same data type. Homogeneous memory saving nT presents special classes to optimally store values with more efficient and reduced memory footprint when input data (matrix/numpy array) is sparse/ close to sparse

Homogeneous memory saving nT offers classes with special data layout to represent packed matrix and condensed spare row matrix. Memory saving numeric table stores matrix that are sparse/ close to sparse/ symmetrical by eliminating memory footprint inefficiencies, yet retaining their primitive data layout.  Values in numeric table are stored in memory to accommodate entries based on any one of the matrix types discussed in our subsequent section.

Homogeneous Memory saving Numeric Table presents classes to handle 3 types matrix.

a. Symmetric and Triangular Numeric Table

image

Symmetric Matrix: Matrix where upper/lower values of the diagonal are symmetrical. As values of both upper and lower positions to matrix diagonal are symmetrical, saving entire matrix seems near duplication. Memory saving numeric table provides options to save either upper/lower part of the diagonal to reduce redundant storage and memory footprint. Numeric tables store incoming symmetric matrix to project upper/lower packed symmetric matrix depending on layout user prefers.

Triangular Matrix: Matrix where upper/lower values to the diagonal are 0. While creating a regular Homogeneous numeric table, saving zeroes lower/upper to diagonal is not an efficient use of memory. Packed triangular matrix numeric table helps to reduce memory footprint by storing values upper/lower to diagonal depending on the type of triangular matrix. Numeric tables store incoming packed triangular matrix allowing upper/lower triangular matrix representation.

Below are code snippets to illustrate creation of Packed Triangular Matrix and Packed Symmetric Matrix table using Numpy Array, Pandas and FileDataSource class of PyDAAL.

i. Data load and numeric table creation through Numpy Array:

Steps for creating Packed nT from Numpy array

  1. Create a numpy 1Darray representing packed matrix with declared dtype
    array = np.array([], dtype=type)

    Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc

  2. Create Lower and Upper Packed Triangular Matrix Numeric Table
  3. Create Lower and Upper Packed Symmetric Matrix Numeric Table

Code Snippet

from daal.data_management import PackedTriangularMatrix,PackedSymmetricMatrix, NumericTableIface
import numpy as np
from utils import printArray
array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.intc)

nT_LowerPTM = PackedTriangularMatrix( NumericTableIface.lowerPackedTriangularMatrix, 
				array, DataType = np.intc)

nT_UpperPTM = PackedTriangularMatrix( NumericTableIface.upperPackedTriangularMatrix, 
				array, DataType = np.intc)


nT_LowerPSM = PackedSymmetricMatrix( NumericTableIface.lowerPackedSymmetricMatrix, 
				array, DataType = np.intc)

nT_UpperPSM = PackedSymmetricMatrix( NumericTableIface.upperPackedSymmetricMatrix, 
				array, DataType = np.intc)
ii. Data load and numeric table creation through Pandas DataFrame:

Steps for creating Packed nT from Pandas DataFrame

  1. Create a pandas DataFrame with declared dtypes
    df = pd.DataFrame(values)

    Intel DAAL accepts np.float64, np.float32, np.intc

  2. Convert Pandas df to 1D numpy array
  3. Create Lower and Upper Packed Triangular Matrix Numeric Table
  4. Create Lower and Upper Packed Symmetric Matrix Numeric Table

Code Snippet

from daal.data_management import PackedTriangularMatrix,PackedSymmetricMatrix, NumericTableIface
import numpy as np
import pandas as pd

df= pd.DataFrame([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4],
		dtype= np.float64)
df= pd.DataFrame([1,2,3,4,5,6,7,8,9,0],
		dtype= np.intc)

array = df.values.ravel()

nT_LowerPTM = PackedTriangularMatrix( NumericTableIface.lowerPackedTriangularMatrix,
				 array, DataType = np.intc)

nT_UpperPTM = PackedTriangularMatrix( NumericTableIface.upperPackedTriangularMatrix,
				 array, DataType = np.intc)

nT_LowerPSM = PackedSymmetricMatrix( NumericTableIface.lowerPackedSymmetricMatrix, 
				array, DataType = np.intc)

nT_UpperPSM = PackedSymmetricMatrix( NumericTableIface.upperPackedSymmetricMatrix, 
				array, DataType = np.intc)
b. Condensed Sparse Row Numeric Table

Given that basic Homogeneous Numeric Table allocates contiguous memory locations to save matrix data, CSRNumeric table, another memory saving numeric table stores values of an array with reduced memory footprint, also retaining sparse matrix data format.

Intel DAAL offers the CSRNumericTable class for a special version of a homogeneous numeric table that encodes sparse data (data with a significant number of zero elements). The library uses Condensed Sparse Row (CSR) format for encoding:

image of data
image of data

Prior to creating CSRNumeric Table, the typical sparse matrix “M” is represented by three 1D arrays as follows:

  • Values: The array values contain non-zero elements of input matrix row-by-row.
  • Columns: The j-th element of the array columns encodes the column index in matrix M for j-th element of array values.
  • RowIndex: The i-th element of array rowIndex encodes index in array values corresponding to the first non-zero element in rows indexed i or greater. The last element in the array rowIndex encodes the number of non-zero elements in the matrix M.

The CSRNumeric Table created utilizing the 3 arrays mentioned above delivers a numeric table having sparse data format without having to feed a typical sparse data with zeroes. This way of representation allows faster row access as row indices are compressed

Below are the code snippets used to better understand creation of numeric table using Numpy Array, Pandas and FileDataSource class of PyDAAL.

i. Data load and numeric table creation through Numpy Array:

Steps for creating CSR nT from Numpy array

  1. Create 3 arrays: non-zero values, column indices and row offsets
    array = np.array([], dtype=type)

    Intel DAAL accepts np.float64, np.float32, np.intc for the values to be loaded into numeric table. Also, for CSRNumericTable the column indices & row Offsets have to be unsigned integer - 64, any other dtype for indices would lead to non-implementation error while creating numeric table.

  2. Declare the number of rows and columns of the Sparse matrix
  3. Create the CSR nT
    nT = CSRNumericTable(non-zero Values, Column indices, Row Offsets, nColumns, nRows)

Code Snippet

import numpy as np
### import Available Modules for CSRNumericTable###
from daal.data_management import CSRNumericTable
# Non zero elements of the matrix
values = np.array([1, -1, -3, -2, 5, 4, 6, 4,-4, 2, 7,8, -5], 
                  dtype=np.intc)
# Column indices "colIndices" corresponding to each element in "values" array
colIndices = np.array([1, 2, 4, 1, 2, 3, 4, 5,1, 3, 4,2, 5], 
                      dtype=np.uint64)
# Row offsets for every first non zero element encountered in each row
rowOffsets = np.array([1,4,6,9,12,14], 
                      dtype=np.uint64)
# Creation of CSR numeric table with the arguments dicussed above

nObservations = 5 # Number of rows in the numpy array
nFeatures = 5# Number of columns in numpy array

CSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)
ii. Data load and numeric table creation through Pandas DataFrame:

Steps for creating CSR nT from Pandas DataFrame

  1. Create pandas DataFrame and initialize values with for 3 arrays: array values, Column Indices and row offsets
    Df_values= pd.DataFrame([columns = “Values”, “ColIndices”
    Df_RowOffsets=pd.DataFrame(columns = [“rowOffsets”])

    Note that if the row offset values are not of the same size as column indices/array values, create a different pandas df

  2. Convert the DataFrame values into numpy array to load into numeric table.

    Intel DAAL accepts np.float64, np.float32, np.intc for the values to be loaded into numeric table. Also, for CSRNumericTable the column indices & row Offsets has to be unsigned integer - 64, any other dtype for indices would lead to non-implementation error while creating numeric table.


    DF [Column name].as_matrix().astype(dtype)
  3. Set the number of observations and features
  4. Create the CSR numeric table
    CSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)

Code Snippet

import pandas as pd
import numpy as np
from daal.data_management import CSRNumericTable

#Create DataFrames for values, column indices and rowoffsets
df_Cols_Values = pd.DataFrame(columns = ["values","colIndices"])
df_RowOffsets= pd.DataFrame(columns = ["rowOffsets"])
df_Cols_Values['values'] = [1, -1, -3, -2, 5, 4, 6, 4, -4, 2, 7, 8, -5]
# Column indices "colIndices" corresponding to each element in "values" array
df_Cols_Values['colIndices'] =[1, 2, 4, 1, 2, 3, 4, 5, 1, 3, 4, 2,5]
# Row offsets for every first non zero element encountered in each row
df_RowOffsets['rowOffsets'] = [1, 4, 6, 9, 12,  14]
# Creation of CSR numeric table with the arguments discussed above

#Convert df to numpy arrays with PyDAAL standard dtypes
values= df_Cols_Values['values'].as_matrix().astype(np.intc)
colIndices = df_Cols_Values['colIndices'].as_matrix(). \
			astype(np.uint64)
rowOffsets = df_RowOffsets['rowOffsets'].as_matrix(). \
			astype(np.uint64)

nObservations = 5 # Number of rows in the numpy array
nFeatures = 5# Number of columns in numpy array
# Pass the parameters for CSR numeric table creation
CSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)

3. Conclusion:

Intel® DAAL utilizes high performance Intel® Math Kernel Libraries and Intel® Performance Primitives to accelerate data analysis process. The introduced Data Management system, being an integral part of Intel® DAAL, majorly operates on Numeric Table data structures supporting various data types and data layouts. Popular libraries used in the data analysis process like numpy can be easily interfaced with Intel® DAAL to create numeric tables and achieve desired data layout, hence allowing reduced memory footprint and efficient processing.

As an extension to this document that explains creation of numeric tables using numpy, pandas and Intel® DAAL’s data source object, the next volume in this “Gentle Introduction to PyDAAL” series (Volume 2 of 3) introduces to numeric table life cycle and basic operations on numeric tables. As part of the data preprocessing stage, numeric tables present a wide range of methods to perform sanity checks, data retrieval or manipulation operations, and serialization with compression techniques. Additionally, this document has introduced a wide range of helper functions to execute common numeric table operations in data processing stages.

4. Other Related Links:

Appendix

Packed Matrix Numeric Table Initializations:

Packed Triangular Matrix:
  • Upper Triangular Matrix
    • intc
      PackedTriangularMatrix(packedLayout=NumericTableIface.upperPackedTriangularMatrix, DataType=intc)
      PackedTriangularMatrix_UpperPackedTriangularMatrixIntc
    • float32
      PackedTriangularMatrix(packedLayout=NumericTableIface.upperPackedTriangularMatrix, DataType=float32)
      PackedTriangularMatrix_UpperPackedTriangularMatrixFloat32
    • float64
      PackedTriangularMatrix(packedLayout=NumericTableIface.upperPackedTriangularMatrix, DataType=float64)
      PackedTriangularMatrix_UpperPackedTriangularMatrixFloat64
  • Lower Triangular Matrix
    • intc
      PackedTriangularMatrix(packedLayout=NumericTableIface.lowerPackedTriangularMatrix, DataType=intc)
      PackedTriangularMatrix_LowerPackedTriangularMatrixIntc
    • float32
      PackedTriangularMatrix(packedLayout=NumericTableIface.lowerPackedTriangularMatrix, DataType=float32)
      PackedTriangularMatrix_LowerPackedTriangularMatrixFloat32
    • float64
      PackedTriangularMatrix(packedLayout=NumericTableIface.lowerPackedTriangularMatrix, DataType=float64)
      PackedTriangularMatrix_LowerPackedTriangularMatrixFloat64
Packed Symmetric Matrix:
  • Upper Symmetric Matrix
    • intc
      PackedSymmetricMatrix(packedLayout=NumericTableIface.upperPackedSymmetricMatrix, DataType=intc)
      PackedSymmetricMatrix_UpperPackedSymmetricMatrixIntc
    • float32
      PackedSymmetricMatrix(packedLayout=NumericTableIface.upperPackedSymmetricMatrix, DataType=float32)
      PackedSymmetricMatrix_UpperPackedSymmetricMatrixFloat32
    • float64
      PackedSymmetricMatrix(packedLayout=NumericTableIface.upperPackedSymmetricMatrix, DataType=float64)
      PackedSymmetricMatrix_UpperPackedSymmetricMatrixFloat64
  • Lower Symmetric Matrix
    • intc
      PackedSymmetricMatrix(packedLayout=NumericTableIface.lowerPackedSymmetricMatrix, DataType=intc)
      PackedSymmetricMatrix_LowerPackedSymmetricMatrixIntc
    • float32
      PackedSymmetricMatrix(packedLayout=NumericTableIface.lowerPackedSymmetricMatrix, DataType=float32)
      PackedSymmetricMatrix_LowerPackedSymmetricMatrixFloat32
    • float64
      PackedSymmetricMatrix(packedLayout=NumericTableIface.lowerPackedSymmetricMatrix, DataType=float64)
      PackedSymmetricMatrix_LowerPackedSymmetricMatrixFloat32 float64 PackedSymmetricMatrix(packedLayout=NumericTableIface.lowerPackedSymmetricMatrix, DataType=float64)

Next: Vol 2: Basic Operations on Numeric Tables

For more complete information about compiler optimizations, see our Optimization Notice.