Model Downloader Essentials

There are lots of useful features included with the Intel® Distribution of OpenVINO™ toolkit for computer vision. One of the key utilities is known as the "Model Downloader" which provides a command line interface to downloading various open source pre-trained Deep Neural Network (DNN) models. This article explains how the "Model Downloader" works, which models you can download with this tool, and how you can use a downloaded model with the Intel® Distribution of OpenVINO™ toolkit and OpenCV to create computer vision applications.

About the Model Downloader

As already mentioned, the "Model Downloader" provides a command line interface to developers to download various publicly available open source pre-trained Deep Neural Network (DNN) models in a variety of problem domains. The "Model Downloader" contains a collection of pre-trained models along with the download location for each one, so you do not have to go searching to find commonly used models. Note that although the "Model Downloader" is written in Python*, you can use the models that you download with any of the programming languages supported by the Intel® Distribution of OpenVINO™ toolkit.

How to use

The "Model Downloader" code is installed automatically when you install the Intel® Distribution of OpenVINO™ toolkit itself. To run it, you first need to go to the deployment_tools/model_downloader/ folder inside your Intel® Distribution of OpenVINO™ toolkit install folder.

If the Intel® Distribution of OpenVINO™ toolkit was installed using the default installation paths, you can go to the model downloader folder by running:

cd /opt/intel/computer_vision_sdk/deployment_tools/model_downloader/

If you have chosen the system-wide installation of the Intel® Distribution of OpenVINO™ toolkit, you will need to use sudo to run the command. To download a model, for example the densenet-121 model, you use the --name flag, as follows:

sudo ./downloader.py --name densenet-121

This command will download the densenet-121 model files from the internet, and will output the details about where the files have been downloaded to:

###############|| Start downloading models ||###############

...100%, 74 KB, 62336 KB/s, 0 seconds passed ========= densenet-121.prototxt ====> /opt/intel/computer_vision_sdk_2018.3.343/deployment_tools/model_downloader/classification/densenet/121/caffe/densenet-121.prototxt

###############|| Start downloading weights ||###############

...100%, 31546 KB, 16704 KB/s, 1 seconds passed ========= densenet-121.caffemodel ====> /opt/intel/computer_vision_sdk_2018.3.343/deployment_tools/model_downloader/classification/densenet/121/caffe/densenet-121.caffemodel

###############|| Start downloading topologies in tarballs ||###############


###############|| Post processing ||###############

As you can see, the "Model Downloader" has downloaded both the pre-trained model data (densenet-121.caffemodel) and also the model configuration file (densenet-121.prototxt). Now you can copy these files to whatever directory you wish to use them from.

Object classification output format

The typical model output for an object classifier consists of one value for each possible classification. The blob shape is expressed as a multi-dimensional array of [1, 1, N, 1], where N is the number of classifications, and the value is the confidence for that classification. The largest value in the blob indicates the greatest confidence in that particular classification.

Note that the matching classifications themselves are normally stored in a separate text file, and that they will match the classifier. In other words, you must use the classification descriptions file that matches the classifier you are using.

Object detection output format

The most common output format when using a DNN for object detection is a blob that contains one value for each detected item. The blob shape is expressed as a multi-dimensional array of [1, 1, N, 7], where N is the number of detected items.

Each tracked item will return a seven-value array with the information for that item. The values returned for each individual tracked object have the following format:

[image_id, label, conf, x_min, y_min, x_max, y_max]
  • image_id - ID of the image in the batch
  • label - predicted class ID
  • conf - confidence for the predicted class
  • (x_min, y_min) - coordinates of the top left bounding box corner
  • (x_max, y_max) - coordinates of the bottom right bounding box corner.

Note that each value contains a floating point value, which will probably require conversion to another type in order to be useful.

For example, if an object tracker is able to locate 2 items, the data returned will look something like this:

[[0.0, 0.0, 0.75, 25.0, 30.0, 75.0, 45.0], [1.0, 0.0, 0.83, 150.0, 45.0, 210.0, 65.0]]

Available models

There are a number of different models available:

  • densenet-121
  • densenet-161
  • densenet-169
  • densenet-201
  • squeezenet1.0
  • squeezenet1.1
  • mtcnn-p
  • mtcnn-r
  • mtcnn-o
  • mobilenet-ssd
  • vgg19
  • vgg16
  • ssd512
  • ssd300
  • inception-resnet-v2
  • dilation
  • googlenet-v1
  • googlenet-v2
  • googlenet-v4
  • alexnet
  • ssd_mobilenet_v2_coco

Here is a brief summary of each, along with some notes on how to use it.

densenet-121

The "densenet-121" model is one of the "DenseNet" group of models designed to perform image classification. Originally trained on Torch, the authors converted them into Caffe* format. All the DenseNet models have been pretrained on the ImageNet image database. For details about this family of models, check out the repository. 

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order. The BGR mean values need to be subtracted as follows: [103.94, 116.78, 123.68] before passing the image blob into the network. In addition, values must be scaled by 0.017.

The model output for "densenet-121" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

densenet-161

The "densenet-161" model is also one of the "DenseNet" group of models designed to perform image classification. The main difference with the "densenet-121" model is the size and accuracy of the model. The "densenet-161" is much larger at 100MB in size vs the "densenet-121" model's roughly 31MB size.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order, just like the "densenet-121" model. The BGR mean values need to be subtracted as follows: [103.94, 116.78, 123.68] before passing the image blob into the network. In addition, values must be scaled by 0.017.

The model output for "densenet-161" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

densenet-169

The "densenet-169" model is also one of the "DenseNet" group of models designed to perform image classification. Again, the main difference with the "densenet-121" model is the size and accuracy of the model. The "densenet-169" is larger at just about 55MB in size vs the "densenet-121" model's roughly 31MB size.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order, also like the "densenet-121" model. The BGR mean values need to be subtracted as follows: [103.94, 116.78, 123.68] before passing the image blob into the network. In addition, values must be scaled by 0.017.

The model output for "densenet-169" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

densenet-201

The "densenet-201" model is the last of the "DenseNet" group of models designed to perform image classification. Just like the other variations the main difference with the "densenet-121" model is the size and accuracy of the model. The "densenet-201" is larger at over 77MB in size vs the "densenet-121" model's roughly 31MB size.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order, also like the "densenet-121" model. The BGR mean values need to be subtracted as follows: [103.94, 116.78, 123.68] before passing the image blob into the network. In addition, values must be scaled by 0.017.

The model output for "densenet-201" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

squeezenet1.0

The "squeezenet1.0" model is one of the "SqueezeNet" topology models. Like the DenseNet pre-trained models, it is also designed to perform image classification. The SqueezeNet models have been pre-trained on the ImageNet image database. For details about this family of models, check out the repository.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order. The BGR mean values need to be subtracted as follows: [104, 117, 123] before passing the image blob into the network.

The model output for "squeezenet1.0" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

squeezenet1.1

The "squeezenet1.1" model is an updated version of the "SqueezeNet" topology. It requires 2.4x less computation than SqueezeNet v1.0 without diminishing accuracy.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order. The BGR mean values need to be subtracted as follows: [104, 117, 123] before passing the image blob into the network.

The model output for "squeezenet1.1" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

mtcnn-p

The "mtcnn-p" model is one of the "mtcnn" group of models designed to perform face detection. Short for "Multi-task Cascaded Convolutional Neural Network", it is implemented using the Caffe framework. The "p" designation indicates that this model is the "proposal" network intended to find the initial set of faces. For details about this family of models, check out the repository.

The model input is an image containing the data to be analyzed.

The model output is a blob with a vector containing the first pass of face data. If there are no faces detected, no further processing is needed. Otherwise, you will typically use this output as input to the "mtcnn-r" model.

mtcnn-r

The "mtcnn-r" model is one of the "mtcnn" group of models designed to perform face detection. The "r" designation indicates that this model is the "refine" network intended to refine the data returned as output from the "proposal" network described above.

The model input is a blob with a vector containing the first pass of face data, as returned by the "mtcnn-p" model.

The model output is a blob with a vector containing the refined face data. If there are no faces detected by the refine pass, no further processing is needed. Otherwise, you will typically use this output as input to the "mtcnn-o" model.

mtcnn-o

The "mtcnn-o" model is the third of the "mtcnn" group of models designed to perform face detection. The "o" designation indicates that this model is the "output" network intended to take the data returned from the "refine" network described above, and transform it into the final output data.

The model input is a blob with a vector containing the refined face data, as returned by the "mtcnn-r" model.

The model output is a blob with a vector containing the output face data.

mobilenet-ssd

The "mobilenet-ssd" model is a Single-Shot multibox Detection (SSD) network intended to perform object detection. This model is implemented using the Caffe framework. For details about this model, check out the repository.

The model input is a blob that consists of a single image of "1x3x300x300" in BGR order, also like the "densenet-121" model. The BGR mean values need to be subtracted as follows: [127.5, 127.5, 127.5] before passing the image blob into the network. In addition, values must be scaled by 0.007843.

The model output is a typical vector containing the tracked object data, as previously described.

vgg19

The "vgg19" model is one of the "vgg" models designed to perform image classification in Caffe format. For details about this model, check out the paper.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order. The BGR mean values need to be subtracted as follows: [103.939, 116.779, 123.68] before passing the image blob into the network.

The model output for "vgg19" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

vgg16

The "vgg16" model is another of the "vgg" models designed to perform image classification in Caffe format. For details about this model, check out the paper.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order. The BGR mean values need to be subtracted as follows: [103.939, 116.779, 123.68] before passing the image blob into the network.

The model output for "vgg16" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

ssd512

The "ssd512" model is a Single-Shot multibox Detection (SSD) network intended to perform face detection. This model is implemented using the Caffe framework. For details about this model, check out the repository.

The model input is a blob that consists of a single image of "1x3x512x512" in BGR order.

The model output is a typical vector containing the tracked object data, as previously described.

ssd300

The "ssd300" model is a Single-Shot multibox Detection (SSD) network intended to perform face detection. This model is implemented using the Caffe framework. For details about this model, check out the repository at https://github.com/weiliu89/caffe/tree/ssd

The model input is a blob that consists of a single image of "1x3x300x300" in BGR order.

The model output is a typical vector containing the tracked object data, as previously described.

inception-resnet-v2

The "inception-resnet-v2" model is one of the "Inception" family of models designed to perform image classification.1 Like the other Inception models, the "inception-resnet-v2" model has been pretrained on the ImageNet image database. For details about this family of models, check out the paper.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order.

The model output for "inception-resnet-v2" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

dilation

The "dilation" model is designed to perform semantic segmentation to label or categorize each pixel in an image. For details about this model, check out the repository.

The model input is a blob that consists of a single image of "1x3x1396x1396" in BGR order.

The model output is a blob with the predicted values for each pixel in the input image.

googlenet-v1

The "googlenet-v1" model is the first of the "Inception" family of models designed to perform image classification. Like the other Inception models, the "googlenet-v1" model has been pretrained on the ImageNet image database. For details about this family of models, check out the paper.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order.

The model output for "googlenet-v1" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

googlenet-v2

The "googlenet-v2" model is the second of the "Inception" family of models designed to perform image classification. Like all of the other Inception models, the "googlenet-v2" model has been pretrained on the ImageNet image database. For details about this family of models, check out the paper.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order.

The model output for "googlenet-v2" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

googlenet-v4

The "googlenet-v4" model is the most recent of the "Inception" family of models designed to perform image classification. Just like the other Inception models, the "googlenet-v4" model has been pretrained on the ImageNet image database. For details about this family of models, check out the paper.

The model input is a blob that consists of a single image of "1x3x224x224" in BGR order.

The model output for "googlenet-v4" is the typical object classifier output for the 1000 different classifications matching those in the ImageNet database.

alexnet

The "alexnet" model is designed to perform image classification. Just like other common classification models, the "alexnet" model has been pretrained on the ImageNet image database. For details about this model, check out the paper.

The model input is a blob that consists of a single image of "1x3x227x227" in BGR order.

The model output for "alexnet" is the usual object classifier output for the 1000 different classifications matching those in the ImageNet database.

ssd_mobilenet_v2_coco

The "ssd_mobilenet_v2_coco" model is a Single-Shot multibox Detection (SSD) network intended to perform object detection. The differnce bewteen this model and the "mobilenet-ssd" described previously is that there the "mobilenet-ssd" can only detect face, the "ssd_mobilenet_v2_coco" model can detect objects as it has been trained from the Common Objects in COntext (COCO) image dataset. For details about this model, check out the paper.

The model input is a blob that consists of a single image of "1x3x300x300" in BGR order.

The model output is a typical vector containing the tracked object data, as previously described. Note that the "class_id" data is now significant and should be used to determine the classification for any detected object.

Conclusion

There are many pre-trained DNN models available to use easily via the Intel® Distribution of OpenVINO™ toolkit "Model Downloader", and now we've looked at some of the possibilities and how to use them. Using one of these open-source models within the Intel® Distribution of OpenVINO™ toolkit can really help speed up the development of computer vision applications.

Learn More

Для получения подробной информации о возможностях оптимизации компилятора обратитесь к нашему Уведомлению об оптимизации.
Возможность комментирования русскоязычного контента была отключена. Узнать подробнее.