Gentle Introduction to PyDAAL: Vol 3 Analytics Model Building and Deployment
Por Preethi Venkatesh, Nathan Greeneltch, publicado em 16 de outubro de 2017
Previous: Vol 2: Basic Operations on Numeric Tables
Earlier in the Gentle Introduction Series (Volume 1 and Volume 2), we covered fundamentals of the Intel® Data Analytics Acceleration Library (Intel® DAAL) custom Data Structure and basic operations that can be performed. Volume 3 will focus on the algorithm component of Intel® DAAL where the data management element is leveraged to drive analysis and build machine learning models
Intel® DAAL has classes available to construct a wide range of popular machine learning algorithms for analytics model building that include classification, regression, recommender systems and neural networks. Training and Prediction are separated into 2 pieces in Intel® DAAL model building. This separation allows the user to store and transfer only what’s needed for prediction when it comes time for model deployment. Typical Machine Learning workflow involves:
- Training stage that includes identifying patterns in input data that maps behavior of data features in accordance with a target variable.
- Prediction stage that requires employing the trained model on a new data set.
Additionally, Intel® DAAL also contains on-board model scoring, in the form of separate classes to evaluate trained model performance and compute standard quality metrics. Various sets of quality metrics can be reported based on the type of analytics model built.
Volumes in Gentle Introduction Series
- Vol 1: Data Structures - Introduces Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples
- Vol 2: Basic Operations on Numeric Tables - Introduces possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples
- Vol 3: Analytics Model Building and Deployment – Introduces analytics model building and evaluation in Intel® DAAL with serialized deployment in batch processing
- Vol4: Distributed and Online Processing- Introduces Intel DAAL’s advanced processing modes (distributed and online) that support data analysis and model fitting on large and streaming data.
IDP and Intel® DAAL Installation
The demonstrations in this article require IDP and Intel® DAAL installation which are available for free on Anaconda cloud.
1. Install IDP full environment to install all the required packages
conda create -n IDP –c intel intelpython3_full python=3.6
2. Activate IDP environment
source activate IDP (or) activate IDP
Analytics Modelling:
1. Batch Learning with PyDAAL
Intel DAAL includes classes that support the following stages in analytics model building and deployment process:
1.1 Analytics Modelling Training and Prediction Workflow:
1.2 Build and Predict with PyDAAL Analytics Models:
As described earlier, Intel DAAL model building is separated into two different stages with two associated classes (“training”, “prediction”)
The training stage usually involves complex computations with possibly very large datasets, calling for extensive memory footprint. DAAL’s two separate classes allows users to perform training stage on a powerful machine, and optionally the subsequent prediction stage on a simpler machine. Furthermore, this facilitates the user to store and transmit only necessary training stage results that are required for prediction stage.
Four numeric tables are created at the beginning of model building process, two in each stage (training and prediction) as listed below:
Stage | Numeric Tables | Description |
---|---|---|
Training | trainData | This includes the feature values/predictors |
Training | trainDependentVariables | This includes the target values (i.e., labels/responses) |
Prediction | testData | This includes the feature values/predictors of test data |
Prediction | testGroundTruth | This includes the target (i.e., labels/responses) |
Note: See Volume 2 for details on creating and working with numeric tables
Below illustrates a high-level overview on training and prediction stages of the analytics model building process:
Helper Functions: Linear Regression
The next section can be copy/pasted into a user’s script or adapted to a specific use case. The Helper function block provided below can be used directly to automate the training and prediction stages of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.
''' training() function ----------------- Arguments: train data of type numeric table, train dependent values of type numeric table Returns: training results object ''' def training(trainData,trainDependentVariables): from daal.algorithms.linear_regression import training algorithm = training.Batch () # Pass a training data set and dependent values to the algorithm algorithm.input.set (training.data, trainData) algorithm.input.set (training.dependentVariables, trainDependentVariables) trainingResult = algorithm.compute () return trainingResult ''' prediction() function ----------------- Arguments: training result object, test data of type numeric table Returns: predicted responses of type numeric table ''' def prediction(trainingResult,testData): from daal.algorithms.linear_regression import prediction, training algorithm = prediction.Batch() # Pass a testing data set and the trained model to the algorithm algorithm.input.setTable(prediction.data, testData) algorithm.input.setModel(prediction.model, trainingResult.get(training.model)) predictionResult = algorithm.compute () predictedResponses = predictionResult.get(prediction.prediction) return predictedResponses
To use: copy the complete block of helper function and call training()
and prediction()
methods.
Usage Example: Linear Regression
Below is a code example implementing the provided training and predict helper functions:
#import required modules from daal.data_management import HomogenNumericTable import numpy as np from utils import printNumericTable seeded = np.random.RandomState (42) #set up train and test numeric tables trainData =HomogenNumericTable(seeded.rand(200,10)) trainDependentVariables = HomogenNumericTable(seeded.rand (200, 2)) testData =HomogenNumericTable(seeded.rand(50,10)) testGroundTruth = HomogenNumericTable(seeded.rand (50, 2)) #-------------- #Training Stage #-------------- trainingResult = training(trainData,trainDependentVariables) #-------------- #Prediction Stage #-------------- predictionResult = prediction(trainingResult, testData) #Print and compare results printNumericTable (predictionResult, "Linear Regression prediction results: (first 10 rows):", 10) printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)
Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain training and prediction stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.
1.3 Trained Model Evaluation and Quality Metrics:
Intel DAAL offers quality metrics classes for binary classifiers, multi-class classifiers and regression algorithms to measure quality of the trained model. Various standard metrics are computed by Intel DAAL quality metrics library for different types of analytics modeling.
Binary Classification:
Accuracy, Precision, Recall, F1-score, Specificity, AUC
Click here for more details on notations and definitions.
Multi-class Classification:
Average accuracy, Error rate, Micro precision (Precision μ ), Micro recall (Recall μ ), Micro F-score (F-score μ ), Macro precision (Precision M), Macro recall (Recall M), Macro F-score (F-score M)
Click here for more details on notations and definitions.
Regression:
For regression models, Intel DAAL computes metrics using 2 libraries:
- Single Beta: Computes and produces metrics results based on Individual beta coefficients of trained model.
RMSE, Vector of variances, variance-covariance matrices, Z-score statistics
- Group Beta: Computes and produces metrics results based on group of beta coefficients of trained model.
Mean and Variance of expected responses, Regression Sum of Squares, Sum of Squares of Residuals, Total Sum of Squares, Coefficient of Determination, F-Statistic
Click here for more details on notations and definitions.
Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain quality metrics methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.
1.4 Trained Model Storage and Portability:
Trained models can be serialized into byte-type numpy arrays and deserialized using Intel DAAL’s data archive classes to:
- Support data transmission between devices.
- Save and restore from disk at a later date to predict response for an incoming observation or re-train the model with a set of new observations.
Optionally, to reduce network traffic and memory footprint, serialized models can further be compressed and later decompressed using the deserialization method.
Steps to attain model portability in Intel DAAL:
- Serialization:
- Serialize training stage results(trainingResults) into Intel DAAL’s Input Data Archive object
- Create an empty byte type numpy array object(bufferArray) of size Input Data Archive object
- Populate bufferArray with Input Data Archive contents
- Compress bufferArray to numpy array object (optional)
- Save bufferArray as .npy file to disk (optional)
- Deserialization
- Load .npy file from disk to numpy array object(if Serialization step 1e was performed)
- Decompress numpy array object to bufferArray (if Serialization step 1d was performed)
- Create Intel DAAL’s Output Data Archive object with bufferArray contents
- Create an empty original training stage results object (trainingResults)
- Deserialize Output Data Archive contents into trainingResults
Note: As mentioned in deserialization step 2d, an empty original training results object is required for Intel DAAL’s data archive methods to deserialize the serialized training results object.
Helper Functions: Linear Regression
The next section can be copy/pasted into a user’s script or adapted to a specific use case. The helper function block provided below can be used directly to automate model storage and portability of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.
import numpy as np import warnings from daal.data_management import (HomogenNumericTable, InputDataArchive, OutputDataArchive, \ Compressor_Zlib, Decompressor_Zlib, level9, DecompressionStream, CompressionStream) ''' Arguments: serialized numpy array Returns Compressed numpy array ''' def compress(arrayData): compressor = Compressor_Zlib () compressor.parameter.gzHeader = True compressor.parameter.level = level9 comprStream = CompressionStream (compressor) comprStream.push_back (arrayData) compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8) comprStream.copyCompressedArray (compressedData) return compressedData ''' Arguments: deserialized numpy array Returns decompressed numpy array ''' def decompress(arrayData): decompressor = Decompressor_Zlib () decompressor.parameter.gzHeader = True # Create a stream for decompression deComprStream = DecompressionStream (decompressor) # Write the compressed data to the decompression stream and decompress it deComprStream.push_back (arrayData) # Allocate memory to store the decompressed data bufferArray = np.empty (deComprStream.getDecompressedDataSize (), dtype=np.uint8) # Store the decompressed data deComprStream.copyDecompressedArray (bufferArray) return bufferArray #------------------- #***Serialization*** #------------------- ''' Method 1: Arguments: data(type nT/model) Returns dictionary with serailized array (type object) and object Information (type string) Method 2: Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk) Saves serialized numpy array as "fileName" argument Saves object information as "filename.txt" Method 3: Arguments: data(type nT/model), useCompression = True Returns dictionary with compressed array (type object) and object information (type string) Method 4: Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk), useCompression = True Saves compresseed numpy array as "fileName" argument Saves object information as "filename.txt" ''' def serialize(data, fileName=None, useCompression= False): buffArrObjName = (str(type(data)).split()[1].split('>')[0]+"()").replace("'",'') dataArch = InputDataArchive() data.serialize (dataArch) length = dataArch.getSizeOfArchive() bufferArray = np.zeros(length, dtype=np.ubyte) dataArch.copyArchiveToArray(bufferArray) if useCompression == True: if fileName != None: if len (fileName.rsplit (".", 1)) == 2: fileName = fileName.rsplit (".", 1)[0] compressedData = compress(bufferArray) np.save (fileName, compressedData) else: comBufferArray = compress (bufferArray) serialObjectDict = {"Array Object":comBufferArray, "Object Information": buffArrObjName} return serialObjectDict else: if fileName != None: if len (fileName.rsplit (".", 1)) == 2: fileName = fileName.rsplit (".", 1)[0] np.save(fileName, bufferArray) else: serialObjectDict = {"Array Object": bufferArray, "Object Information": buffArrObjName} return serialObjectDict infoFile = open (fileName + ".txt", "w") infoFile.write (buffArrObjName) infoFile.close () #--------------------- #***Deserialization*** #--------------------- ''' Returns deserialized/ decompressed numeric table/model Input can be serialized/ compressed numpy array or serialized/ compressed .npy file saved to disk ''' def deserialize(serialObjectDict = None, fileName=None,useCompression = False): import daal if fileName!=None and serialObjectDict == None: bufferArray = np.load(fileName) buffArrObjName = open(fileName.rsplit (".", 1)[0]+".txt","r").read() elif fileName == None and any(serialObjectDict): bufferArray = serialObjectDict["Array Object"] buffArrObjName = serialObjectDict["Object Information"] else: warnings.warn ('Expecting "bufferArray" or "fileName" argument, NOT both') raise SystemExit if useCompression == True: bufferArray = decompress(bufferArray) dataArch = OutputDataArchive (bufferArray) try: deSerialObj = eval(buffArrObjName) except AttributeError : deSerialObj = HomogenNumericTable() deSerialObj.deserialize(dataArch) return deSerialObj
To use: copy the complete block of helper function and call serialize()
and deserialize()
methods.
Usage Example: Linear Regression:
The example below Implements serialize()
and deserialize()
functions on Linear Regression trainingResult
. (Refer Linear Regression usage example from section Build and Predict with PyDAAL Analytics Models to compute trainingResult
)
#Serialize serialTrainingResultArray = serialize(trainingResult) # Run Usage Example: Linear Regression from section 1.2 #Deserialize deserialTrainingResult = deserialize(serialTrainingResultArray) #predict predictionResult = prediction(deserialTrainingResult, testData) #Print and compare results printNumericTable (predictionResult, "Linear Regression deserialized prediction results: (first 10 rows):", 10) printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)
Examples below implement other combinations of serialize()
and deserialize()
methods with different input arguments
#---compress and serialize serialTrainingResultArray = serialize(trainingResult, useCompression=True) #---decompress and deserialize deserialTrainingResult = deserialize(serialTrainingResultArray, useCompression=True) #---serialize and save to disk as numpy array serialize(trainingResult,fileName="trainingResult") #---deserialize file from disk deserialTrainingResult = deserialize(fileName="trainingResult.npy")
Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain model storage and portability stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.
Conclusion
Previous volumes (Volume 1 and Volume 2) demonstrated Intel® Data Analytics Acceleration Library’s (Intel® DAAL) Numeric Table data structure and basic operations on Numeric Tables. Volume 3 discussed Intel® DAAL’s algorithm component and performing analytical modelling through different stages in batch processing. Also, Volume 3 demonstrated how to achieve model probability (Serialization) and perform model evaluation (Quality Metrics) process. Furthermore, this volume utilized Intel® DAAL classes to provide helper functions that deliver a standalone solution in model fitting and deployment process.