OpenMP Threads - BVLC AlexNet vs Intel AlexNet Timing

OpenMP Threads - BVLC AlexNet vs Intel AlexNet Timing

Hi All,

I am comparing performance of "models/bvlc_alexnet/train_val.prototxt" with BVLC Caffe and Intel Caffe. Though, Intel Caffe has much better performance, I fail to understand why when I run same model with BVLC and then with Intel Caffe, I see number of threads being 64?

I think BVLC Caffe is not supposed to make use of OpenMP threads and that's where Intel Caffe has an edge?

Thanks.

Chetan Arvind Patil
16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Dear Chetan,

I think you have tried running both Intel Caffe and BVLC Caffe using OpenMP ( by building Caffe with OpenMP). Is that correct?

Could you kindly share the commands you use to train/benchmark your caffe?

 

Thanks

Anand

HI Anand,

BVLC Caffe was build following the steps given on its website with Makefile.config having only this change: BLAS := mkl. 

Command to benchmark AlexNet for both BVLC Caffe and Intel Caffe:

./build/tools/caffe time --model=models/bvlc_alexnet/train_val.prototxt

Please note that caffe binaries for BVLC and Intel Caffe are different and above command was executed in respective build folder. Also, I do see Intel Caffe out performing BVLC. Just not sure why number of threads for BVLC is 64.

Thanks.

 

Chetan Arvind Patil

Dear Chetan,

 

For Intel Caffe please use

numactl -m 1 ./build/tools/caffe time --model=models/bvlc_alexnet/train_val.prototxt --engine=MKL2017

Let me know how it goes?

Thanks

Anand

 

 

Hi Anand,

Why numactl? I see same performance as I saw without numactl. Also, for now I don't want to test memory allocation to MCDRAM or DDR4. My question is why BVLC Caffe shows 64 threads?

Below is output for your reference with command shared by you above:

I0928 00:10:58.824509 25409 caffe.cpp:603] Average time per layer:
I0928 00:10:58.824573 25409 caffe.cpp:606]       data   forward: 8.50112 ms.
I0928 00:10:58.824654 25409 caffe.cpp:610]       data   backward: 0.00436 ms.
I0928 00:10:58.824795 25409 caffe.cpp:606]      conv1   forward: 30.3416 ms.
I0928 00:10:58.824898 25409 caffe.cpp:610]      conv1   backward: 23.5538 ms.
I0928 00:10:58.825004 25409 caffe.cpp:606]      relu1   forward: 2.19164 ms.
I0928 00:10:58.825426 25409 caffe.cpp:610]      relu1   backward: 4.05764 ms.
I0928 00:10:58.825538 25409 caffe.cpp:606]      norm1   forward: 8.70628 ms.
I0928 00:10:58.825636 25409 caffe.cpp:610]      norm1   backward: 8.84686 ms.
I0928 00:10:58.825886 25409 caffe.cpp:606]      pool1   forward: 3.22062 ms.
I0928 00:10:58.826004 25409 caffe.cpp:610]      pool1   backward: 16.0964 ms.
I0928 00:10:58.826361 25409 caffe.cpp:606]      conv2   forward: 31.826 ms.
I0928 00:10:58.826483 25409 caffe.cpp:610]      conv2   backward: 55.4061 ms.
I0928 00:10:58.826597 25409 caffe.cpp:606]      relu2   forward: 1.70506 ms.
I0928 00:10:58.826704 25409 caffe.cpp:610]      relu2   backward: 2.61486 ms.
I0928 00:10:58.827031 25409 caffe.cpp:606]      norm2   forward: 9.43468 ms.
I0928 00:10:58.827119 25409 caffe.cpp:610]      norm2   backward: 12.4072 ms.
I0928 00:10:58.827200 25409 caffe.cpp:606]      pool2   forward: 2.5184 ms.
I0928 00:10:58.827281 25409 caffe.cpp:610]      pool2   backward: 10.8106 ms.
I0928 00:10:58.827363 25409 caffe.cpp:606]      conv3   forward: 18.3688 ms.
I0928 00:10:58.827445 25409 caffe.cpp:610]      conv3   backward: 43.8893 ms.
I0928 00:10:58.827527 25409 caffe.cpp:606]      relu3   forward: 1.41792 ms.
I0928 00:10:58.827607 25409 caffe.cpp:610]      relu3   backward: 2.32604 ms.
I0928 00:10:58.827700 25409 caffe.cpp:606]      conv4   forward: 13.827 ms.
I0928 00:10:58.828032 25409 caffe.cpp:610]      conv4   backward: 33.6535 ms.
I0928 00:10:58.828114 25409 caffe.cpp:606]      relu4   forward: 1.43952 ms.
I0928 00:10:58.828193 25409 caffe.cpp:610]      relu4   backward: 2.33948 ms.
I0928 00:10:58.828274 25409 caffe.cpp:606]      conv5   forward: 9.31652 ms.
I0928 00:10:58.828353 25409 caffe.cpp:610]      conv5   backward: 22.5894 ms.
I0928 00:10:58.828433 25409 caffe.cpp:606]      relu5   forward: 0.86036 ms.
I0928 00:10:58.828775 25409 caffe.cpp:610]      relu5   backward: 1.74002 ms.
I0928 00:10:58.828855 25409 caffe.cpp:606]      pool5   forward: 1.25254 ms.
I0928 00:10:58.828934 25409 caffe.cpp:610]      pool5   backward: 2.698 ms.
I0928 00:10:58.829279 25409 caffe.cpp:606]        fc6   forward: 23.1727 ms.
I0928 00:10:58.829361 25409 caffe.cpp:610]        fc6   backward: 34.9388 ms.
I0928 00:10:58.829442 25409 caffe.cpp:606]      relu6   forward: 0.26536 ms.
I0928 00:10:58.829792 25409 caffe.cpp:610]      relu6   backward: 0.08718 ms.
I0928 00:10:58.829892 25409 caffe.cpp:606]      drop6   forward: 0.38286 ms.
I0928 00:10:58.829974 25409 caffe.cpp:610]      drop6   backward: 0.22284 ms.
I0928 00:10:58.830148 25409 caffe.cpp:606]        fc7   forward: 9.46336 ms.
I0928 00:10:58.830237 25409 caffe.cpp:610]        fc7   backward: 26.8225 ms.
I0928 00:10:58.830319 25409 caffe.cpp:606]      relu7   forward: 0.27676 ms.
I0928 00:10:58.830670 25409 caffe.cpp:610]      relu7   backward: 0.11186 ms.
I0928 00:10:58.830752 25409 caffe.cpp:606]      drop7   forward: 0.37444 ms.
I0928 00:10:58.830832 25409 caffe.cpp:610]      drop7   backward: 0.2322 ms.
I0928 00:10:58.830914 25409 caffe.cpp:606]        fc8   forward: 2.23664 ms.
I0928 00:10:58.830993 25409 caffe.cpp:610]        fc8   backward: 7.10434 ms.
I0928 00:10:58.831090 25409 caffe.cpp:606]       loss   forward: 1.28372 ms.
I0928 00:10:58.831171 25409 caffe.cpp:610]       loss   backward: 0.28512 ms.
I0928 00:10:58.831264 25409 caffe.cpp:616] Average Forward pass: 182.93 ms.
I0928 00:10:58.831329 25409 caffe.cpp:619] Average Backward pass: 313.349 ms.
I0928 00:10:58.831394 25409 caffe.cpp:621] Average Forward-Backward: 496.66 ms.
I0928 00:10:58.831459 25409 caffe.cpp:624] Total Time: 24833 ms.
I0928 00:10:58.831522 25409 caffe.cpp:625] *** Benchmark ends ***

Thanks.

Chetan Arvind Patil

Dear Chetan,

MCDRAM is much powerful and has got a better performance. You can see this when you have larger batch size. (But for this please check if your MCDRAM is enabled in BIOS).

For the threads please export OMP_NUM_THREADS = < number of physical cores -2 > and check if your problem is resolved

 

Thanks

Anand

 

 

Hi Anand,

I think, you haven't understood my question. I am asking why BVLC Caffe (not Intel Caffe) is using OpenMP threads? It's not supposed to use. I can see this when I run following command leading to 64 threads. I validate this by logging data using turbostat. Or is it that BVLC Caffe can also make use of OpenMP threads?

./build/tools/caffe time --model=models/bvlc_alexnet/train_val.prototxt

Thanks.

Chetan Arvind Patil

Dear Chetan,

Ideally BVLC Caffe is not supposed to use OPENMP since there is no support for threading/ vectorization for a CPU in it. But if you have build BVLC Caffe with BLAS:=MKL it will use OPENMP Threads because the support for OPENMP is enabled in the MKL level too. Please check with what BLAS have you built the BVLC Caffe.

Also apart from the MKL level, Intel has made code changes in several Caffe Layers like pooling, Relu e.t.c. You can easily see this difference if you see the source code in Intel Caffe branch. The source code for BVLC Caffe contains no pragma omp parallel . Few code modifications apart from this like efficient looping , usage of variables are done as a part of Intel Caffe.

I think that is what Intel Caffe has brought in apart for Integration with MKL. Hope your queries are cleared.

 

Thanks

 

Hi Anand,

I already shared above BLAS was set to MKL in Makefile.config while building BVLC Caffe and the reason to do so was due to the compilation issues of BVLC not finding openBLAS libraries.

However, even with MKL apart from threads I don't see any performance improvements for BVLC. I was just curious why threading is occuring for BVLC.

Thanks.

Chetan Arvind Patil

So I hope you are clear with why you are seeing lot of threads even with BVLC Caffe: It is just because you have built caffe with MKL.

I also don't understand how you are trying to check performance improvement in BVLC caffe Vs  Intel Caffe 

KIndy note that, for BVLC caffe, please fork the https://github.com/BVLC/caffe

and for Intel Caffe use , https://github.com/intel/caffe

and make sure that ,for Intel caffe use the option --engine=MKL2017 for training or benchmarking. You should be able to considerable difference  in the Caffe Time\

 

Thanks

Anand

 

 

 

 

Best Reply

Hi Anand,

The threading was indeed due to "mkl" usage in BVLC Caffe. I validated by compiling BVLC Caffe with BLAS as "open" and then with "mkl". I can observe that "mkl" leads to threading while "open" doesn't.

Lack of documentation lead to this, as BVLC gives three BLAS options: mkl/open/atlas. And, for some Linux systems (at least for CentOS, I have with Xeon Phi), "open" (got this to work now) and "atlas" (this just won't find libraries required) don't work out of the box, but "mkl" does and libraries can be found easily by build system without hassle. I didn't knew using "mkl" with BVLC Caffe can introduce OpenMP threading.

Thanks.

Chetan Arvind Patil

Dear Chetan,

So, I believe this thread could be closed. Kindly let me know if you have any further questions on this

 

Thanks

Anand

Hi Anand,

Yes. Please close.

Thanks.

Chetan Arvind Patil

Hi Anand,

Is there a specific algorithm or approach followed by Intel Caffe in terms of threads to spawn, how and where to map it on the architecture? Does it depend on the network architecture being run? How is the decision affected if I am running scatter vs compact mode?

Any possible documentation I can read apart from the code? As far as I understand, only layer section of the code is threaded?

Thanks.

Chetan Arvind Patil

Dear Chetan,

There are no specific documentation available as such in terms of the algorithms used for threading. And yes, the threading parameters change based on the network architecture as well as the hardware architecture. Intel publishes these parameters once they optimize the layers and we take them as a standard. If you really want more insight in to these, I will have discuss with SME and let you know.

BTW, please start a new thread for discussing this issue

Thanks

Anand

Hi Anand,

If you can discuss or connect me to understand specific optimization Intel did with Caffe in terms of thread mapping, then it will help. I will use new thread for future reference.

Thanks.

Chetan Arvind Patil

Leave a Comment

Please sign in to add a comment. Not a member? Join today