Gentle Introduction to PyDAAL: Vol 3 Analytics Model Building and Deployment

Previous: Vol 2: Basic Operations on Numeric Tables

Earlier in the Gentle Introduction Series (Volume 1 and Volume 2), we covered fundamentals of the Intel® Data Analytics Acceleration Library (Intel® DAAL) custom Data Structure and basic operations that can be performed. Volume 3 will focus on the algorithm component of Intel® DAAL where the data management element is leveraged to drive analysis and build machine learning models

Intel® DAAL has classes available to construct a wide range of popular machine learning algorithms for analytics model building that include classification, regression, recommender systems and neural networks. Training and Prediction are separated into 2 pieces in Intel® DAAL model building. This separation allows the user to store and transfer only what’s needed for prediction when it comes time for model deployment. Typical Machine Learning workflow involves:

  • Training stage that includes identifying patterns in input data that maps behavior of data features in accordance with a target variable.
  • Prediction stage that requires employing the trained model on a new data set.

Additionally, Intel® DAAL also contains on-board model scoring, in the form of separate classes to evaluate trained model performance and compute standard quality metrics. Various sets of quality metrics can be reported based on the type of analytics model built.

Volumes in Gentle Introduction Series

  • Vol 1: Data Structures - Introduces Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples
  • Vol 2: Basic Operations on Numeric Tables - Introduces possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples
  • Vol 3: Analytics Model Building and Deployment – Introduces analytics model building and evaluation in Intel® DAAL with serialized deployment in batch processing
  • Vol4: Distributed and Online Processing-  Introduces Intel DAAL’s advanced processing modes (distributed and online) that support data analysis and model fitting on large and streaming data.

IDP and Intel® DAAL Installation

The demonstrations in this article require IDP and Intel® DAAL installation which are available for free on Anaconda cloud.

1.    Install IDP full environment to install all the required packages

conda create -n IDP –c intel intelpython3_full python=3.6

2.    Activate IDP environment

source activate IDP
(or)
activate IDP

Analytics Modelling:

1. Batch Learning with PyDAAL

Intel DAAL includes classes that support the following stages in analytics model building and deployment process:

  1. Training
  2. Prediction
  3. Model Evaluation and Quality Metric
  4. Trained Model Storage and Portability

1.1 Analytics Modelling Training and Prediction Workflow:

1.2 Build and Predict with PyDAAL Analytics Models:

As described earlier, Intel DAAL model building is separated into two different stages with two associated classes (“training”, “prediction”)

The training stage usually involves complex computations with possibly very large datasets, calling for extensive memory footprint. DAAL’s two separate classes allows users to perform training stage on a powerful machine, and optionally the subsequent prediction stage on a simpler machine. Furthermore, this facilitates the user to store and transmit only necessary training stage results that are required for prediction stage.

Four numeric tables are created at the beginning of model building process, two in each stage (training and prediction) as listed below:

Stage

Numeric Tables

Description

Training

trainData

This includes the feature values/predictors

Training

trainDependentVariables

This includes the target values (i.e., labels/responses)

Prediction

testData

This includes the feature values/predictors of test data

Prediction

testGroundTruth

This includes the target (i.e., labels/responses)

Note: See Volume 2 for details on creating and working with numeric tables

Below illustrates a high-level overview on training and prediction stages of the analytics model building process:

Helper Functions: Linear Regression

The next section can be copy/pasted into a user’s script or adapted to a specific use case. The Helper function block provided below can be used directly to automate the training and prediction stages of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.

'''
training() function
-----------------
Arguments:
        train data of type numeric table, train dependent values of type numeric table
Returns:
        training results object 
'''
def training(trainData,trainDependentVariables):
    from daal.algorithms.linear_regression import training
    algorithm = training.Batch ()
    # Pass a training data set and dependent values to the algorithm
    algorithm.input.set (training.data, trainData)
    algorithm.input.set (training.dependentVariables, trainDependentVariables)
    trainingResult = algorithm.compute ()
    return trainingResult

'''
prediction() function
-----------------
Arguments:
        training result object, test data of type numeric table
Returns:
        predicted responses of type numeric table
'''
def prediction(trainingResult,testData):
    from daal.algorithms.linear_regression import  prediction, training
    algorithm = prediction.Batch()
    # Pass a testing data set and the trained model to the algorithm
    algorithm.input.setTable(prediction.data, testData)
    algorithm.input.setModel(prediction.model, trainingResult.get(training.model))
    predictionResult = algorithm.compute ()
    predictedResponses = predictionResult.get(prediction.prediction) 
    return predictedResponses

To use: copy the complete block of helper function and call training() and prediction()methods.

Usage Example: Linear Regression

Below is a code example implementing the provided training and predict helper functions:

#import required modules
from daal.data_management import HomogenNumericTable
import numpy as np
from utils import printNumericTable
seeded = np.random.RandomState (42)

#set up train and test numeric tables
trainData =HomogenNumericTable(seeded.rand(200,10))
trainDependentVariables = HomogenNumericTable(seeded.rand (200, 2))
testData =HomogenNumericTable(seeded.rand(50,10))
testGroundTruth = HomogenNumericTable(seeded.rand (50, 2))

#--------------
#Training Stage
#--------------
trainingResult = training(trainData,trainDependentVariables)
#--------------
#Prediction Stage
#--------------
predictionResult = prediction(trainingResult, testData)

#Print and compare results
printNumericTable (predictionResult, "Linear Regression prediction results: (first 10 rows):", 10)
printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)

Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain training and prediction stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.

1.3 Trained Model Evaluation and Quality Metrics:

Intel DAAL offers quality metrics classes for binary classifiers, multi-class classifiers and regression algorithms to measure quality of the trained model. Various standard metrics are computed by Intel DAAL quality metrics library for different types of analytics modeling.

 

 

Binary Classification:

Accuracy, Precision, Recall, F1-score, Specificity, AUC

Click here for more details on notations and definitions.

Multi-class Classification:

Average accuracy, Error rate, Micro precision (Precision μ ), Micro recall (Recall μ ), Micro F-score (F-score μ ), Macro precision (Precision M), Macro recall (Recall M), Macro F-score (F-score M)

Click here for more details on notations and definitions.

Regression:

For regression models, Intel DAAL computes metrics using 2 libraries:

  • Single Beta: Computes and produces metrics results based on Individual beta coefficients of trained model.

RMSE, Vector of variances, variance-covariance matrices, Z-score statistics

  • Group Beta: Computes and produces metrics results based on group of beta coefficients of trained model.

Mean and Variance of expected responses, Regression Sum of Squares, Sum of Squares of Residuals, Total Sum of Squares, Coefficient of Determination, F-Statistic

Click here for more details on notations and definitions.

Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain quality metrics methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.

1.4 Trained Model Storage and Portability:

Trained models can be serialized into byte-type numpy arrays and deserialized using Intel DAAL’s data archive classes to:

  • Support data transmission between devices.
  • Save and restore from disk at a later date to predict response for an incoming observation or re-train the model with a set of new observations.

Optionally, to reduce network traffic and memory footprint, serialized models can further be compressed and later decompressed using the deserialization method.

Steps to attain model portability in Intel DAAL:

  1. Serialization:
    1. Serialize training stage results(trainingResults) into Intel DAAL’s Input Data Archive object
    2. Create an empty byte type numpy array object(bufferArray) of size Input Data Archive object
    3. Populate bufferArray with Input Data Archive contents
    4. Compress bufferArray to numpy array object (optional)
    5. Save bufferArray as .npy file to disk (optional)
  2. Deserialization
    1. Load .npy file from disk to numpy array object(if Serialization step 1e was performed)
    2. Decompress numpy array object to bufferArray (if Serialization step 1d was performed)
    3. Create Intel DAAL’s Output Data Archive object with bufferArray contents
    4. Create an empty original training stage results object (trainingResults)
    5. Deserialize Output Data Archive contents into trainingResults

Note: As mentioned in deserialization step 2d, an empty original training results object is required for Intel DAAL’s data archive methods to deserialize the serialized training results object.

Helper Functions: Linear Regression

The next section can be copy/pasted into a user’s script or adapted to a specific use case. The helper function block provided below can be used directly to automate model storage and portability of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.

import numpy as np
import warnings
from daal.data_management import  (HomogenNumericTable, InputDataArchive, OutputDataArchive, \
                                   Compressor_Zlib, Decompressor_Zlib, level9, DecompressionStream, CompressionStream)
'''
Arguments: serialized numpy array
Returns Compressed numpy array
'''

def compress(arrayData):
    compressor = Compressor_Zlib ()
    compressor.parameter.gzHeader = True
    compressor.parameter.level = level9
    comprStream = CompressionStream (compressor)
    comprStream.push_back (arrayData)
    compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8)
    comprStream.copyCompressedArray (compressedData)
    return compressedData

'''
Arguments: deserialized numpy array
Returns decompressed numpy array
'''
def decompress(arrayData):
    decompressor = Decompressor_Zlib ()
    decompressor.parameter.gzHeader = True
    # Create a stream for decompression
    deComprStream = DecompressionStream (decompressor)
    # Write the compressed data to the decompression stream and decompress it
    deComprStream.push_back (arrayData)
    # Allocate memory to store the decompressed data
    bufferArray = np.empty (deComprStream.getDecompressedDataSize (), dtype=np.uint8)
    # Store the decompressed data
    deComprStream.copyDecompressedArray (bufferArray)
    return bufferArray

#-------------------
#***Serialization***
#-------------------
'''
Method 1:
    Arguments: data(type nT/model)
    Returns  dictionary with serailized array (type object) and object Information (type string)
Method 2:
    Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk)
    Saves serialized numpy array as "fileName" argument
    Saves object information as "filename.txt"
 Method 3:
    Arguments: data(type nT/model), useCompression = True
    Returns  dictionary with compressed array (type object) and object information (type string)
Method 4:
    Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk), useCompression = True
    Saves compresseed numpy array as "fileName" argument
    Saves object information as "filename.txt"
'''

def serialize(data, fileName=None, useCompression= False):
    buffArrObjName = (str(type(data)).split()[1].split('>')[0]+"()").replace("'",'')
    dataArch = InputDataArchive()
    data.serialize (dataArch)
    length = dataArch.getSizeOfArchive()
    bufferArray = np.zeros(length, dtype=np.ubyte)
    dataArch.copyArchiveToArray(bufferArray)
    if useCompression == True:
        if fileName != None:
            if len (fileName.rsplit (".", 1)) == 2:
                fileName = fileName.rsplit (".", 1)[0]
            compressedData = compress(bufferArray)
            np.save (fileName, compressedData)
        else:
            comBufferArray = compress (bufferArray)
            serialObjectDict = {"Array Object":comBufferArray,
                                "Object Information": buffArrObjName}
            return serialObjectDict
    else:
        if fileName != None:
            if len (fileName.rsplit (".", 1)) == 2:
                fileName = fileName.rsplit (".", 1)[0]
            np.save(fileName, bufferArray)
        else:
            serialObjectDict = {"Array Object": bufferArray,
                                "Object Information": buffArrObjName}
            return serialObjectDict
    infoFile = open (fileName + ".txt", "w")
    infoFile.write (buffArrObjName)
    infoFile.close ()
#---------------------
#***Deserialization***
#---------------------
'''
Returns deserialized/ decompressed numeric table/model
Input can be serialized/ compressed numpy array or serialized/ compressed .npy file saved to disk
'''
def deserialize(serialObjectDict = None, fileName=None,useCompression = False):
    import daal
    if fileName!=None and serialObjectDict == None:
        bufferArray = np.load(fileName)
        buffArrObjName = open(fileName.rsplit (".", 1)[0]+".txt","r").read()
    elif  fileName == None and any(serialObjectDict):
        bufferArray = serialObjectDict["Array Object"]
        buffArrObjName = serialObjectDict["Object Information"]
    else:
         warnings.warn ('Expecting "bufferArray" or "fileName" argument, NOT both')
         raise SystemExit
    if useCompression == True:
        bufferArray = decompress(bufferArray)
    dataArch = OutputDataArchive (bufferArray)
    try:
        deSerialObj = eval(buffArrObjName)
    except AttributeError :
        deSerialObj = HomogenNumericTable()
    deSerialObj.deserialize(dataArch)
    return deSerialObj

To use: copy the complete block of helper function and call serialize() and deserialize()methods.

Usage Example: Linear Regression:

The example below Implements serialize() and deserialize() functions on Linear Regression trainingResult. (Refer Linear Regression usage example from section Build and Predict with PyDAAL Analytics Models to compute trainingResult )

#Serialize
serialTrainingResultArray = serialize(trainingResult) # Run Usage Example: Linear Regression from section 1.2 
#Deserialize
deserialTrainingResult = deserialize(serialTrainingResultArray)

#predict
predictionResult = prediction(deserialTrainingResult, testData)

#Print and compare results
printNumericTable (predictionResult, "Linear Regression deserialized prediction results: (first 10 rows):", 10)
printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)

Examples below implement other combinations of serialize() and deserialize() methods with different input arguments

#---compress and serialize
serialTrainingResultArray = serialize(trainingResult, useCompression=True)
#---decompress and deserialize 
deserialTrainingResult = deserialize(serialTrainingResultArray, useCompression=True)

#---serialize and save to disk as numpy array
serialize(trainingResult,fileName="trainingResult")

#---deserialize file from disk
deserialTrainingResult = deserialize(fileName="trainingResult.npy")

Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain model storage and portability stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.

Conclusion

Previous volumes (Volume 1 and Volume 2) demonstrated Intel® Data Analytics Acceleration Library’s (Intel® DAAL) Numeric Table data structure and basic operations on Numeric Tables. Volume 3 discussed Intel® DAAL’s algorithm component and performing analytical modelling through different stages in batch processing. Also, Volume 3 demonstrated how to achieve model probability (Serialization) and perform model evaluation (Quality Metrics) process. Furthermore, this volume utilized Intel® DAAL classes to provide helper functions that deliver a standalone solution in model fitting and deployment process.

Other Related Links

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.