Industrial Inspection Platform in Midea* and KUKA*: Using Distributed TensorFlow* on Analytics Zoo

Background

Industrial inspection for product defect detection is an essential part of the modern manufacturing industry. With the recent development of artificial intelligence (AI), computer vision and big data technologies, advanced industrial inspection systems can be built to achieve human level accuracy, with much higher efficiency and at a much lower cost. In this article, we will share our experience in building a deep learning-based industrial inspection platform in Midea* and KUKA* using distributed TensorFlow* on Analytics Zoo, a unified analytics and AI platform open sourced by Intel.

End-to-end Solution on Analytics Zoo

To make it easy to build and productionize deep learning applications for Big Data, Analytics Zoo provides a unified analytics and AI platform that seamlessly unites Apache Spark*, TensorFlow* and BigDL programs into an integrated pipeline; the entire pipeline can then transparently scale out to a large standard Intel® Xeon® processor-based Apache Hadoop* Apache Spark clusters for distributed training or inference.

flowchart

As illustrated in the figure above, the industrial inspection platform in Midea and KUKA is an end-to-end pipeline built on top of Analytics Zoo, including

  1. Processing the large amount images taken from the manufacturing pipelines in a distributed fashion using Spark.
  2. Constructing the object detection (e.g., MobileNet V2 computer vision neural network with shot object detector SSDLite) model directly using the TensorFlow* Object Detection API.
  3. Training (or fine-tuning) the object detection model directly using the resilient distributed dataset (RDD) of images (preprocessed in the first step), and on the Spark cluster in a distributed fashion.
  4. Evaluation (i.e., inference) of the trained model directly using the RDD of evaluation image set, and on the Spark cluster in a distributed fashion.
  5. Low latency, online serving of the entire pipeline using web services (with a plain old Java* object (POJO)-style serving API in Analytics Zoo).

During detection time, an industrial robot with cameras can automatically take pictures of the product, and send the images over HTTP to the web services to detect various defects (e.g., missing labels or bolts), as illustrated below.

industrial camera
missing bolt on industrial object

Put it All Together: Apache Spark*, TensorFlow* and BigDL

As mentioned before, Analytics Zoo provides a "data-analytics integrated" deep learning programming model, so that users can easily develop the end-to-end analytics and AI pipelines (using Spark, TensorFlow, Keras*, etc.), which can then transparently run on a large-scale Hadoop or Spark clusters for distributed training and inference (using BigDL and Spark). In addition, the users can also easily deploy the end-to-end pipeline for low latency online serving (using the POJO-style serving API provided by Analytics Zoo).

For instance, to process the training data for the defect detection pipeline in a distributed fashion, one can simply read the raw image data into an RDD using the Spark* Python* API (PySpark) and then apply a few transformations to decode images, and extract bounding boxes and class labels, as illustrated below.

<<<<
train_rdd = sc.parallelize(examples_list)
  .map(lambda x: read_image_and_label(x))
  .map(lambda image: decode_to_ndarrays(image))
<<<<

Each record in the result RDD (train_rdd) consists of a list of Python* NumPy ndrrays (namely, image, bounding boxes, classes, and number of detected boxes), which can then be directly used in TensorFlow models for distributed training on Analytics Zoo; this is accomplished by creating a TFDataset from the result RDD (as shown below).

<<<<
dataset = TFDataset.from_rdd(train_rdd,
            names=["images", "bbox", "classes", "num_detections"],
            shapes=[[300, 300, 3],[None, 4], [None], [1)]],
            types=[tf.float32, tf.float32, tf.int32, tf.int32],
            batch_size=BATCH_SIZE,
            hard_code_batch_size=True)

<<<<

In Analytics Zoo, TFDataset represents a distributed set of elements, in which each element contains one or more TensorFlow Tensor objects. We can then directly use these Tensors (as inputs) to build TensorFlow models; for instance, we can use TensorFlow Object Detection API to construct a SSDLite+MobileNet V2 model (as illustrated below):

<<<<
# using tensorflow object detection api to construct model
# https://github.com/tensorflow/models/tree/master/research/object_detection
from object_detection.builders import model_builder

images, bbox, classes, num_detections = dataset.tensors

detection_model = model_builder.build(model_config, is_training=True)
resized_images, true_image_shapes = detection_model.preprocess(images)
detection_model.provide_groundtruth(bbox, classes)
prediction_dict = detection_model.predict(resized_images, true_image_shapes)
losses = detection_model.loss(prediction_dict, true_image_shapes)
total_loss = tf.add_n(losses.values())
 >>>>

After the model construction, we first load a pre-trained TensorFlow model, and then fine tune the model using TFOptimizer in Analytics Zoo (as illustrated below), which achieves 0.97 mAP@0.5 on the validation dataset.

>>>>
with tf.Session() as sess:
    init_from_checkpoint(sess, CHECKPOINT_PATH)
    optimizer = TFOptimizer(total_loss, RMSprop(LR), sess)
    optimizer.optimize(end_trigger=MaxEpoch(20))
    save_to_new_checkpoint(sess, NEW_CHEKCPOINT_PATH)
>>>>

Under the hood, the input data is read from disk and preprocessed to generate an RDD of TensorFlow Tensors using PySpark; then the TensorFlow model is trained in a distributed fashion on top of BigDL and Spark (as described in the BigDL Technical Report). The entire training pipeline can automatically scale out from a single node to a large Intel Xeon processor-based Hadoop and Spark cluster (without code modifications or manual configurations).

Once the model is trained, we can also perform large-scale, distributed evaluation and inference on Analytics Zoo using PySpark, TensorFlow and BigDL (similar to the training pipeline above).

Low Latency, Online Serving

The inference pipeline can also be easily deployed for low latency, online serving (in, for instance, web services, Apache* Storm, Apache* Flink, etc.) using the POJO-style serving API provided by Analytics Zoo (illustrated below). You may refer to the inference programming guide for more details.

<<<<
        AbstractInferenceModel model = new AbstractInferenceModel(){};
        model.loadTF(modelPath, 0, 0, false);
        List<List<JTensor>> output = model.predict(inputs);
<<<<

Conclusion

By combining artificial intelligence, computer vision and big data technologies, Midea and KUKA have successfully built an advanced industrial inspection system on top of Analytics Zoo), which can automatically detect various product defects using industrial robots, cameras, Intel® Xeon® platforms, etc. In particular, Analytics Zoo provides a unified analytics and AI platform that seamlessly unites Spark, BigDL and TensorFlow programs into an integrated pipeline, which makes it easy to build and productionize deep learning applications for Big Data (including distributed training and inference, as well as low latency online serving); you may refer to the online document for more details.

For more complete information about compiler optimizations, see our Optimization Notice.