By PREETHI VENKATESH, Nathan G Greeneltch
Published:09/11/2017 Last Updated:09/11/2017
The Intel® Data Analytics Acceleration Library (Intel® DAAL) is written on Intel® architecture optimized building blocks and includes support for all data analytics stages. Data-driven decision making is empowered by Intel® DAAL with foundations for data acquisition, preprocessing, transformation, data mining, modeling and validation. Python users can access these foundations with the Python API for Intel® DAAL (named PyDAAL). Machine learning with Python gets an injection of power with PyDAAL, accessed via a simple scripting API. Furthermore, PyDAAL provides the unique capability to easily extend Python scripted batch analytics to online (streaming) data acquisition and/or distributed math processing. To achieve best performance on a range of Intel® processors, Intel® DAAL uses optimized algorithms from the Intel® Math Kernel Library and Intel® Integrated Performance Primitives. Intel® DAAL provides APIs for C++, JAVA, and Python. In this Gentle Introduction series, we will cover the basics of PyDAAL from the ground up. The first installment will introduce Intel® DAAL’s custom data structure, Numeric Table, and data management in the world of PyDAAL.
The demonstrations in this article require IDP and Intel® DAAL installation which are available for free on Anaconda cloud.
1. Install IDP full environment to install all the required packages
conda create -n IDP –c intel intelpython3_full python=3.6
2. Activate IDP environment
source activate IDP
(or)
activate IDP
Refer to installation options and a complete list of Intel packages for more information
Intel DAAL supports the following ways of processing data:
This document’s primary focus will be on batch processing. Online and distributed data management will be discussed in subsequent volumes in the Gentle Introduction series.
Strong Typing: Python scripting heavily utilizes the concept of dynamic (duck) typing, relying on Python’s interpreter to infer type at run time. However, this practice can cause problems when memory footprint requires attention, or when mixed code is deployed. The PyDAAL API calls libraries written in C++ and assembly language, forcing a mixed code environment on the user. Thus, PyDAAL requires consistent typing, conveniently supporting numpy types “np.float32”, “np.float64”, and “np.intc”. Static/Strong typing not only allows explicit declaration of datatypes for optimal memory management but also enforces a type check during compiling, significantly reducing run time.
Memory Access Patterns: Multidimensional arrays are stored as contiguous data in memory. These memory segments can be used to arrange array elements. One possible way to store these elements is one row vector after another called “row major”. Of course, an equally valid approach in storing columns one-after-another, known as “column major”. These two data layout patterns are both supported by Intel DAAL’s numeric table data structure and should be chosen based on expected memory access patterns of the program being written. The former is default in C programming and Intel DAAL’s standard numeric table. The latter is default in Fortran programming and is achieved with Intel DAAL’s Structure of Array’s (SOA) numeric table.
Numpy has the ascontiguousarray () method for converting a numpy array to row-major storage in memory. Furthermore, PyDAAL will attempt to convert any passed input array to contiguous automatically.
SWIG Interface Objects: An important component of Intel DAAL is SWIG, which is a simplified wrapper and interface generator for C/C++ programs. ( Wikipedia information on SWIG). PyDAAL uses SWIG, enabling Python scripting control of the Intel DAAL C++ libraries. An important note is this allows PyDAAL to in effect, escape the Python’s global interpreter lock (GIL) and dispatch processing/threading with compiled C++ code. INTEL’s PyDAAL API team has exposed Intel DAAL’s C++ member functions to the Python user as familiar class methods, visible in Python’s convenient interactive console through the dir(DAAL_object) call.
Numeric Table: The primary data structure utilized by Intel DAAL is a Numeric Table. Raw data is streamed into the numeric table structure, and stored in-memory for further access to construct analytics model and fit machine learning algorithms. An Intel DAAL numeric table defines a tabular view of a data set where rows represent observations and columns represent features in-memory. Numeric data presented in the numeric table is accessed through numeric table interface.
Intel DAAL Components and Data Flow:
Numeric Tables can be constructed on the basis of data type and storage preferences: Initialization preferences can be branched out into data types and data layout.
In numpy, these dtypes are called “np.float32”, “np.float64”, and “np.intc”.
Below are the types of Numeric Tables that Intel DAAL supports: homogeneous, heterogeneous and memory saving numeric tables for dense and sparse data
Homogeneous Numeric Table |
||
---|---|---|
dtype | Class name | Alias |
intc | HomogenNumericTable(ndarray, ntype = intc) | HomogenNumericTable_Intc(numpy_array)) |
float32 | HomogenNumericTable(ndarray, ntype = float32) | HomogenNumericTable_Float32(numpy_array)) |
float64 | HomogenNumericTable(ndarray, ntype = float64) | HomogenNumericTable_Float64(numpy_array)) |
1. Array of Structures: heterogen_AOS_nT = AOSNumericTable(ndarray)
2. Structure of Arrays: heterogen_SOA_nT = SOANumericTable(nRows, nColumns)
1. Condensed Sparse Matrix:
homogenCSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)
2. Packed Matrix
As mentioned earlier, PyDAAL is a strongly-typed library. Intel DAAL’s math has efficient handling of data with common (homogeneous) typing, and separate handling for data of mixed (heterogeneous) types. To this end, Intel DAAL’s numeric tables have multiple flavors to store and serve data in both typing conditions, as well as memory-saving versions for sparse matrices.
Homogeneous numeric tables are Intel DAAL's data structure for storing features that are of the same basic data type. Values of the features are laid out in memory as one contiguous block in row-major order, that is, Observation 1, Observation 2, and so on. While creating the numeric table, Intel DAAL creates a data dictionary that stores assignments of feature type (Continuous, Ordinal, and Categorical) and can be accessed at any time to modify the assignments. Below are some code snippets written to help demonstrate creation of numeric tables using Numpy Array, Pandas DataFrame and PyDAAL’s FileDataSource (csv loading) class as inputArray.
PyDAAL supports direct and easy integration with Numpy. The Below code snippet creates HomogenNumericTable from Numpy array.
The 3 dtypes supported by numeric table creation are float32, float64, and intc. When an integer numpy array is created without declaring the dtype, Python infers the dtype and defaults integers to int32 instead of C identical int32/int64. Hence when creating an integer type numeric table, it is mandatory to declare dtype (np.intc) during initialization of the input numpy array.
Steps for creating Homogeneous nT from Numpy ndarray
array = np.array([], dtype=type)
Intel DAAL accepts np.float64, np.float32, np.intc
nT = HomogenNumericTable( array , ntype=dtype)
Code Snippet
import numpy as np
array = np.array([[0.5, -1.3],
[2.5, -3.3],
[4.5, -5.3],
[6.5, -7.3],
[8.5, -9.3]],
dtype = np.float32)
# import Available Modules for Homogen numeric table
from daal.data_management import(HomogenNumericTable)
nT = HomogenNumericTable(array, ntype = np.float32)
Pandas is a widely used library to prepare and manipulate datasets in spreadsheet form. Its ease-of-use and overall breadth have made the library seemingly ubiquitous in Python machine learning work. PyDAAL fully supports data input from Pandas DataFrames, both as homogeneous (through intermediate Numpy array) or heterogeneous (directly from DataFrame). See SOA dedicated section for details on the heterogeneous numeric table creation.
Steps for creating Homogeneous nT from Pandas DataFrame
df = pd.DataFrame(values, dtype=type)
Intel DAAL accepts np.float64, np.float32, np.intc
array = df.as_matrix()
nT = HomogenNumericTable(array , ntype=dtype)
Code Snippet
import pandas as pd
import numpy as np
#Initialize the columns with values
Col1 = [1,2,3,4,5]
Col2 = [6,7,8,9,10]
# Create a pandas DataFrame of dtype integer
df_int = pd.DataFrame({'Col1':Col1,
'Col2':Col2},
dtype=np.intc)
array = df_int.as_matrix()
from daal.data_management import(HomogenNumericTable)
nT = HomogenNumericTable(array, ntype = np.intc)
One of the prominent features offered by Pandas is to read data from a CSV file and load them up into a DataFrame for use in Machine Learning algorithms. Intel DAAL provides a class “FileDataSource” that can be leveraged to behave similar to the way pandas operates to read a csv file. PyDAAL’s FileDataSource creates an empty data source object and loads preferred blocks of rows from a csv file using Intel DAAL’s CSVFeatureManager class, followed by getNumericTable() method to create a numeric table. Currently only float64 dtype is available when using FileDataSource.
Steps for creating Homogeneous nT from CSV file source
dataSource = FileDataSource(path, DataSource.doAllocateNumericTable, DataSource.doDictionaryFromContext)
DataSource.doAllocateNumericTable
creates Homogeneous numeric table by default (so choose “notAllocateNumericTable” for AOS creation)
dataSource.loadDataBlock(nRows)
nT = dataSource.getNumericTable()
Default dtype is float64
Code Snippet
from daal.data_management \
import(FileDataSource, DataSource)
dataSource = FileDataSource(
r’path’, DataSource.doAllocateNumericTable, DataSource.doDictionaryFromContext)
# boilerplate method to load nRows in the csv file, default loads all rows if no argument passed
dataSource.loadDataBlock(30) # load first 30 rows
#dataSource.loadDataBlock() to load all rows
nT = dataSource.getNumericTable()
Python is dynamically typed and capable of inferring dtype during run time if explicitly not declared. Often, memory footprint consideration becomes an oversight due to this “duck” typing. As datasets grow large, explicit declaration of dtypes becomes beneficial as memory usage is of major concern – a key functionality supported by PyDAAL APIs connecting with C++ libraries.
When columns of incoming data contain different numeric data types (intc, float, or double) it becomes necessary to declare dtypes on respective columns in order to reduce memory footprint. Intel DAAL’s Heterogeneous numeric table delivers the capability of declaring dtypes on individual columns of the array, hence saving significant memory through static typing. AOSNumericTable (Array of Structures) and SOANumericTable (Structure of Arrays) are the 2 Heterogeneous structures available in the current version of PyDAAL.
Depending on access patterns, a practitioner can choose to layout a dataset in memory in row-major or column-major form. The resultant data structures are called Array of Structures (AOS) and Structure of Arrays (SOA), respectively. If downstream access is likely to be row by row (sequential observations), then AOS is the best layout choice. If column by column access (sequential features) is required, then SOA should be the better choice. Below is a straight forward representation to visualize the difference between an AOS and SOA data structure:
Example Incoming Data:
Memory Layout: AOS (Array of Structures):
Memory Layout: SOA (Structure of Arrays):
If desired data access is along row major and input data features have heterogeneous dtypes, Intel DAAL’s AOSNumericTable provides corresponding memory pattern for faster access. It does so by allocating contiguous memory on “observations”. Below are code snippets to illustrate creation of AOSNumericTable table using Numpy Array, Pandas and FileDataSource class of PyDAAL.
i. Data load and numeric table creation through Numpy Array:
Unlike Intel DAAL’s HomogenNumericTable, AOSNumericTable is created with a 1D Numpy array having tuples of elements with declared dtypes on each tuple. Each tuple is a complete row of the dataset. The resulting shape of the input Numpy array is therefore (nRows,).
Steps for creating AOS nT from Numpy array
array = np.array([], dtype=[(column, dtype)])
Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc
nT = AOSNumericTable(array)
Code Snippet
import numpy as np
from daal.data_management
import(AOSNumericTable)
array = np.array([(0.5, -1.3, 1),
(4.5, -5.3, 2),
(6.5, -7.3, 0))],
dtype=[('x', np.float32),
('categ', np.intc),
('value', np.float64)])
nT = AOSNumericTable(array)
ii. Data load and numeric table creation through Pandas DataFrame:
Steps for creating AOS nT from Pandas DataFrame
df = pd.DataFrame(values)
Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc
get_StructArray()
to:
array = get_StructArray(df, [dtype1, dtype2, etc.])
nT = ASONumericTable(np.1darray)
Code Snippet
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['c','f'])
df ['c']=[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
df ['f']=[3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0]
def get_StructArray(df,dtypes):
#*** inputs: df, [dtypes], output: structured Numpy array ***
dataList = []
for idx in range(df.shape[0]):
dataList.append(tuple(df.loc[idx]))
decDtype = list(zip(df.columns.tolist(),dtypes))
array = np.array(dataList,dtype = decDtype)
return array
array = get_StructArray(df, [np.intc,np.float64] )
from daal.data_management import AOSNumericTable
nT = AOSNumericTable(array)
iii. Data load and numeric table creation through CSV file:
Steps for creating AOS nT from CSV file source
dataSource = FileDataSource(path, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)
DataSource.doAllocateNumericTable creates Homogeneous numeric table by default (so choose “notAllocateNumericTable” for AOS creation)
array = np.empty([nRows,], dtype=[(column,dtype)])
Intel DAAL accepts numpy supported dtypes- “np.intc”, “np.float32”, “np.float64”
nT = AOSNumericTable(array)
dataSource.loadDataBlock(nRows,nT)
Code Snippet
import numpy as np
from daal.data_management import (FileDataSource, AOSNumericTable, DataSource)
dataSource = FileDataSource(
r’path’, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)
array = np.empty([10,],dtype=[('x','i4'),('y','f8')])
nT = AOSNumericTable(array)
dataSource.loadDataBlock(10,nT)
If desired data access is along column major and input data features have heterogeneous dtypes, Intel DAAL’s AOSNumericTable provides corresponding memory pattern for faster access. It does so by allocating contiguous memory along “features”. This data structure is a more natural conversion for pandas DataFrame, as pandas stores data in a similar pattern in memory
In contrast to Intel DAAL’s AOSNumericTable, which can be populated with array values at initialization, SOANumericTable requires the number of rows and columns to be defined first (allocating a contiguous block) at initialization, followed by setting array values one column at a time. In other words, a practitioner must create a SOANumericTable structure with proper data dimensions, then subsequently fill the table with data
Below are code snippets to illustrate creation of SOANumericTable table using Numpy Array, Pandas and FileDataSource class of PyDAAL.
array = np.array([], dtype=type)
Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc
nT = SOANumericTable(nColumns, nRows)
nT.setArray(array,column index)
Code Snippet
import numpy as np
from daal.data_management import SOANumericTable
Col1 = np.array([1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6,2.8], dtype=np.float64)
Col2 = np.array([3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0], dtype=np.float32)
Col3 = np.array([-10, -20, -30, -40, -50, -60, -70, -80, -90, -100],
dtype=np.intc)
nObservations = 10
nFeatures = 4
nT = SOANumericTable(nFeatures, nObservations)
nT.setArray(Col1, 0)
nT.setArray(Col2, 1)
nT.setArray(Col3, 2)
Steps for creating SOA nT from Pandas DataFrame
df = pd.DataFrame(values)
Intel DAAL accepts np.float64, np.float32, np.intc
nT = SOANumericTable(nColumns, nRows)
nT.setArray(array,column index)
Code Snippet
import pandas as pd
import numpy as np
#Initialize the columns with values
from daal.data_management import SOANumericTable
df = pd.DataFrame()
df['a']=[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
df['b']=[3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0]
df['c']=[1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8]
df = df.astype(dtype={'a' : np.intc,
'b' : np.float32,
'c' : np.float64})
nT = SOANumericTable(df.shape[1],df.shape[0])
for idx in range(len(df.columns)):
nT.setArray(df[df.columns[idx]].values,idx)
SOA nT can also be initialized without number of rows and columns and could be set at a later stage using methods setNumberOfRows(N) and setNumberOfColumns(N). However, setting the number of rows and columns to an existing SOA nT having values in it would recreate an empty SOA nT, deleting its previous values.
Steps for creating SOA nT from CSV file source
dataSource = FileDataSource(path, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)
DataSource.doAllocateNumericTable
creates Homogenous numeric table by default (so choose “notAllocateNumericTable” for SOA creation)
Col1_array = np.empty([nRows,],dtype=dtypes)
Intel DAAL supports numpy dtypes “np.intc”, “np.float32”, “np.float64”
nT = SOANumericTable(nRows,nObservations)
nT.setArray(array,idx)
dataSource.loadDataBlock(nRows,nT)
Code Snippet
from daal.data_management import(FileDataSource,SOANumericTable, DataSource)
import numpy as np
# CSV file 'path' with 10 rows and 2 columns
dataSource = FileDataSource(
r’path’, DataSource.notAllocateNumericTable, DataSource.doDictionaryFromContext)
#if data source has 2 columns
Col1_array = np.empty([10,],dtype=np.float64)
Col2_array = np.empty([10,],dtype=np.float64)
nT = SOANumericTable(2,10)
nT.setArray(Col1_array,0)
nT.setArray(Col2_array,1)
dataSource.loadDataBlock(10,nT)
As discussed in the Homogeneous Section of this document on Homogeneous nT, memory allocation is one contiguous block across every observation holding values with same data type. Homogeneous memory saving nT presents special classes to optimally store values with more efficient and reduced memory footprint when input data (matrix/numpy array) is sparse/ close to sparse
Homogeneous memory saving nT offers classes with special data layout to represent packed matrix and condensed spare row matrix. Memory saving numeric table stores matrix that are sparse/ close to sparse/ symmetrical by eliminating memory footprint inefficiencies, yet retaining their primitive data layout. Values in numeric table are stored in memory to accommodate entries based on any one of the matrix types discussed in our subsequent section.
Homogeneous Memory saving Numeric Table presents classes to handle 3 types matrix.
Symmetric Matrix: Matrix where upper/lower values of the diagonal are symmetrical. As values of both upper and lower positions to matrix diagonal are symmetrical, saving entire matrix seems near duplication. Memory saving numeric table provides options to save either upper/lower part of the diagonal to reduce redundant storage and memory footprint. Numeric tables store incoming symmetric matrix to project upper/lower packed symmetric matrix depending on layout user prefers.
Triangular Matrix: Matrix where upper/lower values to the diagonal are 0. While creating a regular Homogeneous numeric table, saving zeroes lower/upper to diagonal is not an efficient use of memory. Packed triangular matrix numeric table helps to reduce memory footprint by storing values upper/lower to diagonal depending on the type of triangular matrix. Numeric tables store incoming packed triangular matrix allowing upper/lower triangular matrix representation.
Below are code snippets to illustrate creation of Packed Triangular Matrix and Packed Symmetric Matrix table using Numpy Array, Pandas and FileDataSource class of PyDAAL.
Steps for creating Packed nT from Numpy array
array = np.array([], dtype=type)
Intel DAAL accepts numpy supported dtypes - np.float64, np.float32, np.intc
Code Snippet
from daal.data_management import PackedTriangularMatrix,PackedSymmetricMatrix, NumericTableIface
import numpy as np
from utils import printArray
array = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.intc)
nT_LowerPTM = PackedTriangularMatrix( NumericTableIface.lowerPackedTriangularMatrix,
array, DataType = np.intc)
nT_UpperPTM = PackedTriangularMatrix( NumericTableIface.upperPackedTriangularMatrix,
array, DataType = np.intc)
nT_LowerPSM = PackedSymmetricMatrix( NumericTableIface.lowerPackedSymmetricMatrix,
array, DataType = np.intc)
nT_UpperPSM = PackedSymmetricMatrix( NumericTableIface.upperPackedSymmetricMatrix,
array, DataType = np.intc)
Steps for creating Packed nT from Pandas DataFrame
df = pd.DataFrame(values)
Intel DAAL accepts np.float64, np.float32, np.intc
Code Snippet
from daal.data_management import PackedTriangularMatrix,PackedSymmetricMatrix, NumericTableIface
import numpy as np
import pandas as pd
df= pd.DataFrame([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4],
dtype= np.float64)
df= pd.DataFrame([1,2,3,4,5,6,7,8,9,0],
dtype= np.intc)
array = df.values.ravel()
nT_LowerPTM = PackedTriangularMatrix( NumericTableIface.lowerPackedTriangularMatrix,
array, DataType = np.intc)
nT_UpperPTM = PackedTriangularMatrix( NumericTableIface.upperPackedTriangularMatrix,
array, DataType = np.intc)
nT_LowerPSM = PackedSymmetricMatrix( NumericTableIface.lowerPackedSymmetricMatrix,
array, DataType = np.intc)
nT_UpperPSM = PackedSymmetricMatrix( NumericTableIface.upperPackedSymmetricMatrix,
array, DataType = np.intc)
Given that basic Homogeneous Numeric Table allocates contiguous memory locations to save matrix data, CSRNumeric table, another memory saving numeric table stores values of an array with reduced memory footprint, also retaining sparse matrix data format.
Intel DAAL offers the CSRNumericTable class for a special version of a homogeneous numeric table that encodes sparse data (data with a significant number of zero elements). The library uses Condensed Sparse Row (CSR) format for encoding:
Prior to creating CSRNumeric Table, the typical sparse matrix “M” is represented by three 1D arrays as follows:
The CSRNumeric Table created utilizing the 3 arrays mentioned above delivers a numeric table having sparse data format without having to feed a typical sparse data with zeroes. This way of representation allows faster row access as row indices are compressed
Below are the code snippets used to better understand creation of numeric table using Numpy Array, Pandas and FileDataSource class of PyDAAL.
Steps for creating CSR nT from Numpy array
array = np.array([], dtype=type)
Intel DAAL accepts np.float64, np.float32, np.intc for the values to be loaded into numeric table. Also, for CSRNumericTable the column indices & row Offsets have to be unsigned integer - 64, any other dtype for indices would lead to non-implementation error while creating numeric table.
nT = CSRNumericTable(non-zero Values, Column indices, Row Offsets, nColumns, nRows)
Code Snippet
import numpy as np
### import Available Modules for CSRNumericTable###
from daal.data_management import CSRNumericTable
# Non zero elements of the matrix
values = np.array([1, -1, -3, -2, 5, 4, 6, 4,-4, 2, 7,8, -5],
dtype=np.intc)
# Column indices "colIndices" corresponding to each element in "values" array
colIndices = np.array([1, 2, 4, 1, 2, 3, 4, 5,1, 3, 4,2, 5],
dtype=np.uint64)
# Row offsets for every first non zero element encountered in each row
rowOffsets = np.array([1,4,6,9,12,14],
dtype=np.uint64)
# Creation of CSR numeric table with the arguments dicussed above
nObservations = 5 # Number of rows in the numpy array
nFeatures = 5# Number of columns in numpy array
CSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)
Steps for creating CSR nT from Pandas DataFrame
Df_values= pd.DataFrame([columns = “Values”, “ColIndices”
Df_RowOffsets=pd.DataFrame(columns = [“rowOffsets”])
Note that if the row offset values are not of the same size as column indices/array values, create a different pandas df
Intel DAAL accepts np.float64, np.float32, np.intc for the values to be loaded into numeric table. Also, for CSRNumericTable the column indices & row Offsets has to be unsigned integer - 64, any other dtype for indices would lead to non-implementation error while creating numeric table.
DF [Column name].as_matrix().astype(dtype)
CSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)
Code Snippet
import pandas as pd
import numpy as np
from daal.data_management import CSRNumericTable
#Create DataFrames for values, column indices and rowoffsets
df_Cols_Values = pd.DataFrame(columns = ["values","colIndices"])
df_RowOffsets= pd.DataFrame(columns = ["rowOffsets"])
df_Cols_Values['values'] = [1, -1, -3, -2, 5, 4, 6, 4, -4, 2, 7, 8, -5]
# Column indices "colIndices" corresponding to each element in "values" array
df_Cols_Values['colIndices'] =[1, 2, 4, 1, 2, 3, 4, 5, 1, 3, 4, 2,5]
# Row offsets for every first non zero element encountered in each row
df_RowOffsets['rowOffsets'] = [1, 4, 6, 9, 12, 14]
# Creation of CSR numeric table with the arguments discussed above
#Convert df to numpy arrays with PyDAAL standard dtypes
values= df_Cols_Values['values'].as_matrix().astype(np.intc)
colIndices = df_Cols_Values['colIndices'].as_matrix(). \
astype(np.uint64)
rowOffsets = df_RowOffsets['rowOffsets'].as_matrix(). \
astype(np.uint64)
nObservations = 5 # Number of rows in the numpy array
nFeatures = 5# Number of columns in numpy array
# Pass the parameters for CSR numeric table creation
CSR_nT = CSRNumericTable(values, colIndices, rowOffsets, nFeatures, nObservations)
Intel® DAAL utilizes high performance Intel® Math Kernel Libraries and Intel® Performance Primitives to accelerate data analysis process. The introduced Data Management system, being an integral part of Intel® DAAL, majorly operates on Numeric Table data structures supporting various data types and data layouts. Popular libraries used in the data analysis process like numpy can be easily interfaced with Intel® DAAL to create numeric tables and achieve desired data layout, hence allowing reduced memory footprint and efficient processing.
As an extension to this document that explains creation of numeric tables using numpy, pandas and Intel® DAAL’s data source object, the next volume in this “Gentle Introduction to PyDAAL” series (Volume 2 of 3) introduces to numeric table life cycle and basic operations on numeric tables. As part of the data preprocessing stage, numeric tables present a wide range of methods to perform sanity checks, data retrieval or manipulation operations, and serialization with compression techniques. Additionally, this document has introduced a wide range of helper functions to execute common numeric table operations in data processing stages.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804