Performance Boosting in Seldon

Co-authored with: Clive Cox, CTO Seldon*

Introduction

Many AI applications have data pipelines that include several processing steps when executing inference operations. These steps can include data transformation, inference execution, and conditional processing. The Seldon* Core open source machine learning deployment platform facilitates management of inference pipelines using preconfigured and reusable components. Intel and Seldon data scientists have worked together to improve the performance of the inference pipeline. In this paper, we present our results as well as guidelines on how to adopt these steps in various use cases.

Improvements

Network Communication

In our performance analysis we focused on visual data processing, which typically involves big data sets and high traffic volumes. Our first observation was that we should rule out Representational State Transfer (REST) API interfaces and switch to gRPC, as REST API carries data in JavaScript* Object Notation (JSON) format, which increases message size. As well, it is very slow in serialization and deserialization. The REST API has a communication overhead that exceeds one second for input data with ImageNet picture (224x224x3 array).

We also compared the various types of data supported in the message content for Seldon API. Out of the two subtypes in DefaultData type we verified that Tensor is faster than google.protobuf.ListValue. The reason is that the ListValue has deserialization implemented as a Python* loop iterating over all elements in the input array. For image data, which includes many thousands of elements, it is very slow. Tensor data type means passing data as a vector of double elements. For input data with ImageNet pictures, it resulted in about 50 ms overhead, with about 15 ms related to data deserialization with the function numpy.array(datadef.get("tensor").get("values")). The deserialization impact was improved more with method optimization and by employing a Python buffer protocol and the numpy.frombuffer function to reduce communication latency by another 13 ms to approximately 35 ms.

Our next optimization was to include a new data type in the Seldon message, TensorFlow* Tensor, which uses an even more effective mechanism for data deserialization and transfer. TensorFlow Tensor primarily relies on the numpy.tostring function on the sender side and numpy.fromstring function on the receiver side. The full message is actually created via a function tf.make_ndarray(datadef.tftensor). Besides the image content, it also includes extra metadata.

Performance metrics graph
Figure 1. Visualization of performance improvements related to communication between the microservices.

This implementation with gRPC and TensorFlow Tensor data type is state of the art for data ingestion in an array format.

In high throughput systems, the data array format can cause a bottleneck on the network layer. In the case of 1000 inference requests per second with the ResNet50 model (1x224x224x3 float32), the incoming traffic might exceed 4.8 gigabits per second.

To address such scenarios, Seldon introduced the byte data type, which sends input data in binary formats like JPEG files. The pipeline presented below includes a Seldon Transformer component. It receives the compressed image and converts it to NumPy format via OpenCV library. Then, the model server component executes the inference operations.

We discovered that JPEG data compression was so effective that it reduced the network traffic by a factor of approximately 50. A smaller message size also sped up the data transfer and serialization. JPEG decompression with OpenCV is very fast (approximately 2 ms). As a result, the overall latency and network utilization are lower, increasing system scalability with a minimal CPU workload increase.

The final enhancement was in the Ambassador configuration tuning to ensure the proper load-balancing mechanism for gRPC calls on L7 level. The goal was to distribute each inference Seldon request evenly on all of the pods. We applied a fix to disable the client IP session affinity by using a headless Kubernetes* service.

Inference Execution

Having addressed the communication challenges, we focused on inference execution improvements.

Intel offers a range of solutions to speed up inference processing, from framework optimizations to software libraries, available through the Intel® AI Developer Program. In this article, we have presented optimizations for image classification using the Intel® Distribution of OpenVINO™ toolkit (a developer tool suite that stands for Open Visual Inference and Neural Network Optimization). Based on convolutional neural networks, the toolkit supports a range of computer vision accelerators including CPUs, GPU, Intel® Movidius™ Vision Processing Units (VPUs), and Intel® FPGA.

The Intel® Distribution of OpenVINO™ toolkit includes two key components: the Model Optimizer and the Inference Engine.

The toolkit’s Model Optimizer converts trained models in any supported framework to Intermediate Representation format. It can be integrated with TensorFlow*, Caffe*, Apache MXNet*, Kandi*, and Open Neural Network Exchange (ONNX*) models. The Inference Engine takes advantage of the optimized model. It executes inference operations with improved performance. The model optimization is a one-time task.

pipeline graphic
Figure 2. Model optimization and inference execution pipeline (source: Model Optimizer Developer Guide).

The following table shows benchmark results comparing inference execution in the Intel Distribution of OpenVINO toolkit model on CPU:

Table 1. Inference latency of Inference Engine execution on models with float32 precision with batch size 1.

ModelIntel® Xeon® Platinum 8180 processor
TensorFlow* Eigen 1.13.1
Intel Xeon Platinum 8180 processor
Intel Distribution of OpenVINO Toolkit 2019.1 with OpenMP*
ResNet v1.5026.9 ms4.35 ms
Inception v334.7 ms7.17 ms

Table 2. Inference latency of Inference Engine execution on models with Int8 precision with batch size 1 (TensorFlow Eigen not included as it does not support Int8 precision models).

ModelIntel® Xeon® Platinum 8280 processor
Intel Distribution of OpenVINO Toolkit 2019.1 with OpenMP*
ResNet v1.501.60 ms
Inception v33.23 ms

Model Quantization

Model quantization is a commonly used method to increase inference efficiency by reducing the precision of graph calculations. Learn more about it in Introducing int8 quantization for fast cpu inference using openvino blog post.

Performance improvements from reduced precision can be noticeable in any hardware. They are particularly visible in second generation Intel® Xeon® Scalable processors with Intel® Deep Learning Boost (Intel® DL Boost). The gain is a result of new vector neural network instructions (VNNI), which multiply matrices, and were designed to accelerate convolutional neural network-based algorithms.

Following are captured performance results from both the previous generation and second generation Intel Xeon Scalable processors. We compared latency and accuracy on the models with fp32 and Int8 precisions. The execution was performed using the Intel Distribution of OpenVINO toolkit R5 Inference Engine.

Table 3. Inference latency and accuracy for Inference Engine execution on Intel Xeon Platinum 8180 processor and Intel Xeon Platinum 8280 processor-based systems, both using Intel Distribution of OpenVINO, with batch size 1 and model ResNet v1.50. Load generated with 28 parallel clients.

ServerFloat32 Throughput
28 Clients
Int8 Throughput
28 Clients
Float32
Latency
1 Client
Int8
Latency
1 Client
Float32 AccuracyInt8 Accuracy
Intel Xeon Platinum 8180 processor763 fps1330 fps4.35 ms1.98 ms75.17%75.08%
Intel Xeon Platinum 8280 processor839 fps3117 fps4.25 ms1.60 ms

Demo Environment

The goal of the pipeline presented here is to demonstrate the enhancements listed in the previous section. The pipeline includes the following components:

  • Input transformer, converting JPEG compressed content into NumPy array
  • Two model components that execute inference requests using ResNet and DenseNet models
  • A combiner component implementing the ensemble of models
  • Output transformer that converts an array of classification probabilities to a human readable top1 class name

We also tested Seldon scalability by multiplying the pipeline with Kubernetes replicas and Ambassador load balancer.

In the tests we used two models with identical input and output shapes and comparable accuracy results and complexity:

  • ResNet v1.50
  • DenseNet 169

Seldon pipeline with ensemble of models
Figure 3. Seldon pipeline with ensemble of models.

As a side note, a similar pipeline can be implemented with the TensorFlow Serving Proxy component from Seldon. It translates the Seldon API into the TensorFlow Serving API, so the inference execution can be delegated to a dedicated service.

Seldon pipeline with ensemble of models
Figure 4. Seldon pipeline with ensemble of models using TensorFlow* Serving Proxy component.

In the pipeline presented above, the inference requests are handled by OpenVINO model server containers. They can run in the local Kubernetes pod or can be externally hosted on a separate infrastructure.

The components in the demo are created using a Seldon base docker image that includes a compiled OpenVINO toolkit Inference Engine with Python API, and can be used to implement the code for executing the inference operations. Refer to the documentation in the Inference Engine Developer Guide or the DLDT Inference Engine content on GitHub*. This base image also includes Intel® Optimization for TensorFlow* with Intel® Math Kernel Library (Intel® MKL) and OpenCV Python packages. It can be used to improve inference performance with standard TensorFlow models and images pre or post processing.

Note that the prediction model component with Intel Distribution of OpenVINO used in this pipeline was created to be generic. It runs prediction operations using Seldon’s Inference Engine. It imports the models in Intermediate Representation format from local storage, Google storage, or Amazon S3*. Model input and output tensors are determined automatically so minimal configuration is needed.

The demo pipeline can be reproduced in any Kubernetes cluster including Minikube. The deployment and testing steps are described in this Jupyter* Notebook.

Results

The pipeline was tested in Google Cloud Platform (GCP) infrastructure using Google Kubernetes Engine (GKE) service. Each Kubernetes node used 32 virtual cores of Intel Xeon Scalable processors.

Accuracy Impact

In the first step, we tested the accuracy of individual models used in the pipeline. This verification was completed for models with both float32 and int8 precision.

Tests were done by connecting directly to OpenVINO Model Server endpoints via gRPC client with TF Serving API calls. We used a sample of 5000 images from the ImageNet dataset to estimate the results. More accurate results can be collected with the complete ImageNet dataset.

Finally, the accuracy was collected using the Seldon pipeline with an ensemble of models with reduced int8 precision. The measurement was done with a similar gRPC client with Seldon API and identical sample dataset.

Table 4. Accuracy observed on the models and their ensemble collected with a sample of ImageNet data.

ModelAccuracy Top1
ResNet v1.50 float3274.91
ResNet v1.50 int874.87
DenseNet169 float3275.65
DenseNet169 int875.59
Ensemble of Resnet50 and DenseNet169 Int8 models77.37

Latency

It can be difficult to analyze the execution timelines in distributed systems due to the complexity of tracking the correlations between events. We employed Opentracing and Jaeger components to help with the pipeline execution visualization. Seldon has built-in integration with Jaeger, which can be easily enabled in the pipeline.

Following are execution timelines captured in Jaeger for the pipeline presented in Figure 3.

Seldon UI
Figure 5. Processing timelines for the pipeline with an ensemble of two models.

In Figure 5, we can observe that communication between the Seldon microservice adds fairly low latencies. Depending on the message size in our pipeline, it might be in the range of 2 to 4ms. The actual inference execution in the OpenVINO toolkit Inference Engine was about 5 ms shorter from the Jaeger results for Predict1 and Predict2 components. The reason for the difference is due to the gRPC communication overhead mentioned previously.

The inference execution times, based on the model components logs was, respectively:

  • approximately 28 ms for DenseNet 169
  • approximately 10 ms for ResNet 50

Those results were captured with multiple clients in parallel on every node.

We also measured that on the gRPC client side the pipeline execution had about 5 to 10 ms extra latency, a consequence of the communication hop between the client, Ambassador load-balancer, and eventually Seldon Service Orchestrator. All of these mini delays are dependent on the message size; that is also the reason why data compression might have a positive impact on the overall latency.

In the following figure, we have presented an execution timeline for an alternative pipeline using a TFS Proxy component. It delegates the inference execution to OpenVINO model server. It is using TensorFlow Serving API, but it can take advantage of OpenVINO toolkit model and execution optimizations.

Seldon UI
Figure 6. Processing timelines for the pipeline with TF Serving Proxy and OpenVINO model server.

This pipeline has similar latency results, but there is an extra 7 to 10ms consumed on additional communication hop and Seldon API to TFS API conversion.

The advantage of such a workflow might be in a possible infrastructure separation between the Seldon pipeline and inference execution service with TensorFlow Serving or OpenVINO model server.

Scalability

Seldon scalability was tested via horizontally expanding the Kubernetes cluster in the GKE service. At the same time, many requests were generated from multiple clients. A simple bash script was used to clone the gRPC clients submitting sequential Seldon requests:

export CLIENTS=32
time printf %s\\n {1..$CLIENTS} | xargs -n 1 -P $CLIENTS python grpc_client.py

The throughput was measured by checking the duration when all the clients received a given number of responses:

Formula

Seldon pipeline throughput graph
Figure 7. Seldon horizontal scalability.

We observed linear scaling up to 800 requests per second (1600 predictions per second) without a bottleneck identified. There was low network traffic despite the heavy load, thanks to JPEG compression. We measured approximately 50 Mb/second with throughput of about 900 images/second.

We also noticed that increasing the number of parallel clients improved the system throughput. In our tests, we used four times more clients then the Kubernetes pods. Over this ratio the throughput increase was minimal and the latency was getting irregular when some requests were delayed, due to waiting in the processing queue.

Threading Configuration

When running multiple workloads on a single node, proper threading configuration can have significant impact on the overall performance. The OpenVINO toolkit Inference Engine can be compiled with Threading Building Blocks (TBB) or OpenMP* libraries.

In the scenarios with the load constrained in isolated CPU resources, OpenMP usually gives better results. The pipelines included in the experiments above were all based on OpenMP. Typically, it requires tuning with extra environment variables. They define threading scheduling parameters.

In Kubernetes, it is also beneficial to constrain the CPU cores for individual containers by adding resources limits. It helps in reducing context switching between threads and multiple cores.

The parameters usually are set experimentally depending on the workload types. In the exemplary pipeline, review the use of the following settings:

KMP_AFFINITY
KMP_BLOCKTIME
OMP_NUM_THREADS
“resources”:{{“requests”{“cpu”:}}{“limits”{“cpu”:}}}

In most cases, the CPU resource limits should be set to the value that gives acceptable latency. OMP_NUM_THREAD usually is set to the number of physical cores (virtual divided by 2 when Hyperthreading is enabled) allocated to the execution.

With the OpenVINO toolkit workload running multiple models in the same CPU cores, it might be beneficial to switch to the TBB library. In such cases it might give better results without setting explicitly the number of threads to be used.

You can read more about the features of OpenMP versus TBB in the article, Intel® Threading Building Blocks, OpenMP, or native threads?.

Conclusions and Recommendations

As we have shown above, network and serialization overhead in Seldon can be reduced to about 5 ms for image classification data. The Intel Distribution of OpenVINO toolkit speeds up Seldon inference execution, and using an ensemble method to train multiple models can boost accuracy without adding latency. Seldon can be used to effectively build complex execution pipelines, with no scaling bottleneck. The examples we have included can be easily reused and adopted, and the components and base images presented in this blog can simplify adoption of Intel Distribution of OpenVINO, OpenCV, and Intel® Distribution for Python*. We invite you to follow us on @IntelAIDev and @Seldon_IO for future updates on machine learning framework optimizations.

About the Authors

Clive Cox is the CTO at Seldon. He has a research background in computational linguistics and studied speech and language processing at Cambridge University. Clive has spent the last few years helping to develop Seldon’s platform combining devops and machine learning into MLOps.

Dariusz Trawinski is a senior software engineer at Intel. After receiving his MSc degree from Technical University in Gdansk, Poland, Dariusz gained 19 years of experience in system and data center administration, devops, and software development. In the last four years he designed and developed AI-related solutions with a focus on performance optimization and the user experience.

References

Introducing int8 Quantization for Fast CPU Inference Using OpenVINO

Develop Multiplatform Computer Vision Solutions

Seldon

Acknowledgements

Kudos for the AIBT benchmarking team in Poland for assistance with performance testing.

Appendix

Test Configuration Details

Intel technologies®’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Performance Benchmark Test Disclosure.

Performance results are based on internal testing done on 16th May 2019 and may not reflect all publicly available security updates. No product can be absolutely secure. Test configurations are described below:

Following is the hardware specification for the Intel Xeon Scalable processor-based server for local Python inference execution:

BIOSRelease Date07/09/2018
VendorIntel Corporation
VersionSE5C620.86B.00.01.0014.070920180847
MemoryMem ModeFlat mode
Total Memory376.41 GB
Mem Node 0190.71 GB
Mem Node 1192.0 GB
EnvironmentOSUbuntu*-16.04-xenial
Host Namedkr-aipg-ra-skx-23.ra.intel.com
Kernel Version3.10.0-957.1.3.el7.x86_64
Mac02:42:AC:11:00:02
Disks0sda CT500MX500SSD1 SSD 465.8 GB
1nvme0n1 INTEL SSDPE2KX020T7 SSD 1.8TB
GPUs0ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
ProcessorAIBT NameSKX_8180
Processor NameIntel Xeon Platinum 8180
Microcode0x200004d
BoostON
HTON
Sockets2
Socket 0Core Count28
Core Enabled28
Thread Count56
Current Speed2500 MHz
Max Speed4000 MHz
VersionIntel Xeon Platinum 8180 CPU @ 2.50 GHz
FamilyXeon
ManufacturerIntel Corporation
SignatureType 0, Family 6, Model 85, Stepping 4
Socket 1Core Count28
Core Enabled28
Thread Count56
Current Speed2500 MHz
Max Speed4000 MHz
VersionIntel Xeon Platinum 8180 CPU @ 2.50 GHz
FamilyXeon
ManufacturerIntel Corporation
SignatureType 0, Family 6, Model 85, Stepping 4
Memory device
Configured Clock SpeedManufacturerRankSizeSpeedType
2666 MHzMicron232 GB2666 MHzDDR4

Following is the hardware specification for the CascadeLake based server for local Python inference execution:

IOSRelease Date12/07/2018
VendorIntel Corporation
VersionSE5C620.86B.0D.01.0271.120720180605
MemoryMem ModeFlat mode
Total Memory376.6 GB
Mem Node 0187.61 GB
Mem Node 1188.99 GB
EnvironmentOSUbuntu*-16.04-xenial
Host Namedkr-ai1
Kernel Version4.15.0-47-generic
Mac02:42:AC:11:00:02
Disks0nvme0n1 INTEL SSDPE2KX040T7 SSD 3.7TB
1nvme2n1 INTEL SSDPE2KX040T7 SSD 3.7TB
2nvme1n1 INTEL SSDPE2KX040T7 SSD 3.7TB
3sda INTEL SSDSC2BA80 SSD 745.2 GB
GPUs0ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
ProcessorAIBT NameCLX_8280_B0
Processor NameIntel Xeon Platinum 8280
Microcode0x4000013
BoostON
HTON
Sockets2
Socket 0Core Count28
Core Enabled28
Thread Count56
Current Speed2700 MHz
Max Speed4000 MHz
VersionIntel Xeon Platinum 8280 CPU @ 2.70 GHz
FamilyXeon
ManufacturerIntel Corporation
SignatureType 0, Family 6, Model 85, Stepping 6
Socket 1Core Count28
Core Enabled28
Thread Count56
Current Speed2700 MHz
Max Speed4000 MHz
VersionIntel Xeon Platinum 8280 CPU @ 2.70 GHz
FamilyXeon
ManufacturerIntel Corporation
SignatureType 0, Family 6, Model 85, Stepping 6
Memory device
Configured Clock SpeedManufacturerRankSizeSpeedType
2934 MHzSamsung232 GB2933 MHzDDR4

Following is the hardware specification for GKE cluster nodes for Seldon pipeline tests:

CPU32 vCPUs
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel Xeon CPU @ 2.00 GHz
Stepping: 3
CPU MHz: 2000.166
BogoMIPS: 4000.33
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32 K
L1i cache: 32 K
L2 cache: 256 K
L3 cache: 56320
Memory capacity64 GB memory
PlatformSkyLake
Host operating systemContainer-Optimized OS (cos)

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Performance Benchmark Test Disclosure.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel® microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture® are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notices

Intel technologies®’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting Resource & Design Center.

This sample source code is released under the Intel Sample Source Code License Agreement.

For more complete information about compiler optimizations, see our Optimization Notice.