Scaling Object Detection with Kubernetes* and Kubeflow

Introduction

Object detection is a key task in computer-vision that involves localizing and categorizing objects, such as pedestrians or cars, in images or video. Advances in Machine Learning have led to significant breakthroughs in object detection applications ranging from detecting faces in popular social networks, to perception in autonomous vehicles. In the past, training and deploying these models at scale took a lot of manual effort and hand-rolling of solutions. However, with the rise of cloud platforms and orchestrators like Kubernetes* and Kubeflow, the process of training, serving and scaling of Machine Learning models has become less complex and more declarative.

Kubernetes* and Kubeflow

Kubernetes is a container orchestration cloud platform built to ease the deployment and management of containerized applications at scale.

Kubeflow is a Machine Learning toolkit that runs on top Kubernetes. Its mission is to simplify the deployment of Machine Learning workflows by providing a scalable and extensible stack of services which can run in diverse environments.

We will go through the steps of preparing the data, executing the distributed object detection training job, and serving the model based on the TensorFlow* Pets tutorial for detecting various breeds of cats and dogs.

Prerequisites

Step 1: Setup Kubeflow

Refer to the getting started guide for instructions on how to setup Kubeflow on your kubernetes cluster. Specifically, look at the quick-start guide on deploying Kubeflow. For this example, we will be using Kubeflow version v0.2.2 and a ksonnet default environment. If you plan to use cloud ksonnet environment, please make sure you follow the proper instructions in the Kubeflow getting started guide.

After completing the steps in the Kubeflow user guide you will have the following:

  • A ksonnet app directory called kubeflow_ks_app
  • A new namespace in your K8s cluster called kubeflow
  • The following pods in your kubernetes cluster under the kubeflow namespace:
kubectl -n kubeflow get pods
NAME                              READY     STATUS    RESTARTS   AGE
ambassador-7987df44b9-4pht8       2/2       Running   0          1m
ambassador-7987df44b9-dh5h6       2/2       Running   0          1m
ambassador-7987df44b9-qrgsm       2/2       Running   0          1m
tf-hub-0                          1/1       Running   0          1m
tf-job-operator-v1alpha2-75bcb7f5f7-wvpdd            1/1       Running             0          6d

Step 2: Prepare the training dataset

At its very core this step will be using Kubernetes jobs to create a persistent volume, and copy data to the volume. But we will be using the components previously created from the following ksonnet app in the kubeflow/examples repository: Examples

First, we need to clone the kubeflow/examples repository:

git clone https://github.com/kubeflow/examples.git
cd examples/object_detection/ks-app

In the ks-app directory you will find a components directory which contains a set of previously created ksonnet components that you can customize if needed. Such components have a set of pre-configured arguments that can be found at the components/params.libsonnet file.

Now let's setup our environment and create the persistent volume where our dataset and configurations for the training pipeline will be stored:

ENV=default
ks param set pets-pvc accessMode "ReadWriteMany"
ks param set pets-pvc storage "20Gi"
ks apply ${ENV} -c pets-pvc

After creating our persistent volume claim, next step is to get our training dataset, annotations, pipeline configuration file and a pre training model checkpoint to not start training from scratch.

#The name of the pvc we just created
PVC="pets-pvc" 

#The root mount path that will be used by the containers 
MOUNT_PATH="/pets_data"

#The remote URL of the dataset we will be using
DATASET_URL="http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz"

#The remote URL of the dataset annotations 
ANNOTATIONS_URL=http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz

#The remote URL to the pre-trained model we will be using as training start point
MODEL_URL="http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_2018_01_28.tar.gz"

#The remote url to the training pipeline config file 
PIPELINE_CONFIG_URL="https://raw.githubusercontent.com/kubeflow/examples/master/object_detection/conf/faster_rcnn_resnet101_pets.config"

# Setting up the arguments for our get-data-job component
ks param set get-data-job mounthPath ${MOUNT_PATH}
ks param set get-data-job pvc ${PVC}
ks param set get-data-job urlData ${DATASET_URL}
ks param set get-data-job urlAnnotations ${ANNOTATIONS_URL}
ks param set get-data-job urlModel ${MODEL_URL}
ks param set get-data-job urlPipelineConfig ${PIPELINE_CONFIG_URL}

#Applying the component to the K8s cluster
ks apply ${ENV} -c get-data-job

The get-data-job component will create four Kubernetes batch jobs that will download the data from the internet and store it in the pets-pvc volume. All of these jobs need to complete before moving to the next step. To make sure the jobs completed run:

kubectl -n kubeflow get pods | grep get-data-job

And make sure all the listed pods who's name starts with "get-data-job" are in "completed" status.

The jobs we just executed downloaded a few compressed files, next step is to decompress those files/directories so they can be used properly. For that, we will be using the decompress-data-job component in our ks-app.

ANNOTATIONS_PATH="${MOUNT_PATH}/annotations.tar.gz"
DATASET_PATH="${MOUNT_PATH}/images.tar.gz"
PRE_TRAINED_MODEL_PATH = "${MOUNT_PATH}/faster_rcnn_resnet101_coco_2018_01_28.tar.gz"

ks param set decompress-data-job mountPath ${MOUNT_PATH}
ks param set decompress-data-job pvc ${PVC}
ks param set decompress-data-job pathToAnnotations ${ANNOTATIONS_PATH}
ks param set decompress-data-job pathToDataset ${DATASET_PATH}
ks param set decompress-data-job pathToModel ${PRE_TRAINED_MODEL_PATH}

ks apply ${ENV} -c decompress-data-job

The decompress-data-job component will create a set of Kubernetes Jobs that will decompress the files we just downloaded. Similarly to the previous set of jobs, make sure that the "demcompress data jobs" completed before moving to the next step. You can run:

kubectl -n kubeflow get pods | grep decompress-data-job

And make sure the pods are in "completed" status.

Finally, and since TensorFlow Object Detection API uses the TFRecord format we need to create the TF pet records. For that, we will configure and apply the create-pet-record-job component:

#The docker image to use
OBJ_DETECTION_IMAGE="lcastell/pets_object_detection"

#The path to our training dataset
DATA_DIR_PATH="${MOUNT_PATH}/images"

#The path to our output directory. Where the pet records will be stored
OUTPUT_DIR_PATH="${MOUNT_PATH}"

# setting the parameters to our component
ks param set create-pet-record-job image ${OBJ_DETECTION_IMAGE}
ks param set create-pet-record-job dataDirPath ${DATA_DIR_PATH}
ks param set create-pet-record-job outputDirPath ${OUTPUT_DIR_PATH}
ks param set create-pet-record-job mountPath ${MOUNT_PATH}
ks param set create-pet-record-job pvc ${PVC}

ks apply ${ENV} -c create-pet-record-job

The command above will create a batch job to create the pet TF record and will dump the created records into the $ MOUNT_PATH we defined at the beginning of the tutorial.

Now we finished with the data preparation, now let's launch the training job.

Step 3: Launch a distributed object detection training job

Distributed training speeds up training of large Machine Learning models by partitioning work across multiple machines. During distributed training, TensorFlow uses workers to perform the bulk of the computation, parameter servers to store the model parameters, and a master to coordinate the training process. A rule-of-thumb for resource-allocation during training is to allocate two workers per CPU socket. More information on boosting training performance on Intel® Xeon® is available here.

To launch a distributed training job, we first configure the tf-training-job component in our ksonnet app. For this example, we will be using one worker and one parameter server. In order to scale your job, you can easily change the number of workers and, or parameter servers:

#This is the path to our training pipeline configuration file. Which is in the pets-pvc
PIPELINE_CONFIG_PATH="${MOUNT_PATH}/faster_rcnn_resnet101_pets.config"

#The Directory where our training job will be saving the training checkpoints
TRAINING_DIR="${MOUNT_PATH}/train"

ks param set tf-training-job image ${OBJ_DETECTION_IMAGE}
ks param set tf-training-job mountPath ${MOUNT_PATH}
ks param set tf-training-job pvc ${PVC}
ks param set tf-training-job numPs 1
ks param set tf-training-job numWorkers 1
ks param set tf-training-job pipelineConfigPath ${PIPELINE_CONFIG_PATH}
ks param set tf-training-job trainDir ${TRAINING_DIR}

#Submit the training job
ks apply ${ENV} -c tf-training-job

And with that we just submitted the object detection training job. Now you can start monitoring it.

Step 4: Monitor your job

To view your tf-job status, execute:

kubectl -n kubeflow describe tfjobs tf-training-job

To view individual logs for each pod, execute:

kubectl -n kubeflow get pods -a
kubectl -n kubeflow logs <name_of_master_pod>

While job is still running, you should see:

INFO:tensorflow:Saving checkpoint to path /pets_data/train/model.ckpt
INFO:tensorflow:Recording summary at step 819.
INFO:tensorflow:global step 819: loss = 0.8603 (19.898 sec/step)
INFO:tensorflow:global step 822: loss = 1.9421 (18.507 sec/step)
INFO:tensorflow:global step 825: loss = 0.7147 (17.088 sec/step)
INFO:tensorflow:global step 828: loss = 1.7722 (18.033 sec/step)
INFO:tensorflow:global step 831: loss = 1.3933 (17.739 sec/step)
INFO:tensorflow:global step 834: loss = 0.2307 (16.493 sec/step)
INFO:tensorflow:Recording summary at step 839

And once the job finishes, you should see:

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /pets_data/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 200006.
INFO:tensorflow:global step 200006: loss = 0.0091 (9.854 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

Now you have a trained model. Find it at /pets_data/train inside persistent volume pets-data-claim

Step 5: Export the TensorFlow* graph

Before serving the model and exporting the graph, we first need to identify a checkpoint candidate in the pets-pvc persistent volume under the ${MOUNT_PATH}/train directory.

To see what's being saved in ${MOUNT_PATH}/train, you can use:

kubectl -n kubeflow exec -it tf-training-job-master-r1hv-0-i6k7c sh

Note: The command above will only work if the job is still running.

This will open an interactive shell to your container and now you can execute: ls /pets_data/train and look for a checkpoint candidate.

Once you have identified the checkpoint next step is to configure the checkpoint in the export-tf-graph-job component and apply it:

#The path to the checkpoint candidate (replace <numeber> with your checkpoint number.
CHECKPOINT="${TRAINING_DIR}/model.ckpt-<number>" 

#The model input type
INPUT_TYPE="image_tensor"

#The directory where our exported model graph will be stored
EXPORT_OUTPUT_DIR="${MOUNT_PATH}/exported_graphs"

ks param set export-tf-graph-job mountPath ${MOUNT_PATH}
ks param set export-tf-graph-job pvc ${PVC}
ks param set export-tf-graph-job image ${OBJ_DETECTION_IMAGE}
ks param set export-tf-graph-job pipelineConfigPath ${PIPELINE_CONFIG_PATH}
ks param set export-tf-graph-job trainedCheckpoint ${CHECKPOINT}
ks param set export-tf-graph-job outputDir ${EXPORT_OUTPUT_DIR}
ks param set export-tf-graph-job inputType ${INPUT_TYPE}

Now let's apply the job:

ks apply ${ENV} -c export-tf-graph-job

Once the job is completed, a new directory called exported_graphs under /pets_data in the pets-pvc persistent volume will be created containing the model and the frozen graph.

Before serving the model, we need to perform a quick hack since the object detection export Python* API does not generate a "version" folder for the saved model. This hack involves creating a directory, and moving some files to it.

One way of doing this is by accessing to an interactive shell in one of your running containers and moving the data yourself with:

kubectl -n kubeflow exec -it pets-training-master-r1hv-0-i6k7c sh  
mkdir /pets_data/exported_graphs/saved_model/1  
cp /pets_data/exported_graphs/saved_model/* /pets_data/exported_graphs/saved_model/1

Step 6: Serve your model using TensorFlow serving

Configure and apply the pets-model component in our ksonnet app:

MODEL_PATH=/mnt/exported_graphs/saved_model
MODEL_STORAGE_TYPE=nfs
NFS_PVC_NAME=pets-pvc

ks param set pets-model modelPath ${MODEL_PATH}
ks param set pets-model modelStorageType ${MODEL_STORAGE_TYPE}
ks param set pets-model nfsPVC ${NFS_PVC_NAME}

ks apply ${ENV} -c pets-model

After applying the pets-model component you should be able to see the "pets-model" pod:

kubectl -n kubeflow get pods | grep pets-model

That will output:

pets-model-v1-57674c8f76-4qrqp 1/1 Running 0 4h

Take a look at the pod logs:

kubectl -n kubeflow logs pets-model-v1-57674c8f76-4qrqp

And you should see:

2018-06-21 19:20:32.325406: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: pets-model version: 1}  
E0621 19:20:34.134165172       7 ev_epoll1_linux.c:1051]     grpc epoll fd: 3  
2018-06-21 19:20:34.135354: I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at 0.0.0.0:9000 ...  

Step 7: Running inference using your model

We will be using tensorFlow serving api and an object detection Python* client to run inference using our trained model.

First we need to install the dependencies (Ubuntu* 16.04 )

sudo apt-get install protobuf-compiler python-pil python-lxml python-tk
pip install tensorflow
pip install matplotlib
pip install tensorflow-serving-api
pip install numpy
pip install grpcio

After installing the dependencies we will need to clone the TensorFlow models repository. We will be using some object detection modules from there.

#From your $HOME directory
git clone https://github.com/tensorflow/models.git
cd models/research

The TensorFlow Object Detection API uses Protobufs to configure model and training parameters. Before the framework can be used, the Protobuf libraries must be compiled. This should be done by running the following command from the TensorFlow models/research directory.

# from models/research/
protoc object_detection/protos/*.proto --python_out=.

Add TensorFlow models, and universal object detection to PYTHONPATH.

export PYTHONPATH=:${HOME}:${HOME}/models/research:${HOME}/models/research/slim

Now we will need to port-forward to out model server, in a different terminal session run:

kubectl -n kubeflow port-forward pets-model-v1-57674c8f76-4qrqp 9000:9000

Final step is to run the object detection client, we have prepared a python client script you can find in this gist. Download the object_detection_client.py file and the utils.py file to your $HOME directory and run the client like this:

python object_detection_client.py \
--server=localhost:9000 \
--input_image=data/path/to/some/pet/image.jpg \
--output_directory=. \
--label_map=models/research/object_detection/data/mscoco_label_map.pbtxt \ 
--model_name=pets-model \
--input_type=image_tensor

After that you should be seeing an image file in your $HOME directory with the results of the inference.

Summary

In this article, we have learned how to leverage current cloud-based platforms and tools to prepare the dataset, train and serve an object detection model as well as running inference on the served model. These steps can be replicated to train any other model and run more experiments.

About the Authors

Daniel Castellanos is a senior software engineer in the Automotive Solutions Group at Intel. He helps deliver the next generation cloud based Machine Learning and AI platforms for the autonomous vehicles data center. In his spare time Daniel enjoys contributing to the latest cloud based opensource projects and reading about the latest trends in tech. He also enjoys playing guitar and videogames.

Soila Kavulya is a senior research scientist in the Automotive Solutions Group at Intel. Her interests include distributed systems, big data analytics, and machine learning. She holds a PhD from Carnegie Mellon University where she specialized in fault-tolerant distributed systems. She is also passionate about open-source projects, and contributes to the Apache Spark* and Apache Hadoop* libraries in the TensorFlow ecosystem.

Related Resources

Lets Flow Within Kubeflow

有关编译器优化的更完整信息,请参阅优化通知