Tensorflow not working properly on Xeon Phi

Tensorflow not working properly on Xeon Phi

I have installed tensorflow on my Xeon Phi (Knights Landing with 288 logical cores) as per these links:

https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture?language=en-us&https=1

https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available#

When I run the tf_cnn_benchmark for inceptionv3 model, it gives me a warning:

(tf) $:~/benchmarks/scripts/tf_cnn_benchmarks# python tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=inception3 --variable_update=parameter_server --device=cpu
TensorFlow: 1.2
Model: inception3
Mode: training
Batch size: 32 global
32 per device
Devices: ['/cpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating model
2017-08-11 14:46:58.609446: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 14:46:58.609600: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 14:46:58.609660: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 14:46:58.609708: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 14:46:58.609757: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 14:46:58.609802: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 14:47:07.065491: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-08-11 14:47:07.065590: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-08-11 14:47:07.065653: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr

and gives me 6images/sec. What parameters are needed for this benchmark? How do I run inceptionv3 model on the Xeon Phi? There is no proper documentation for this and it would be grateful if the someone would upload a new article with the how-tos on the benchmarks claimed by intel.

 

Thank You

Krishna Sheth
 

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

Hi, some BKMs are needed to set before running the topologies on xeon phi.

For inception v3, please use the command below:

numactl -m 1 python tf_cnn_benchmarks.py  --model inception3 --cpu knl --batch_size 32 --data_format NCHW --num_intra_threads 66 --num_inter_threads 3 --data_dir /location/imagenet-data/ --data_name imagenet

please also export some environment variables,

KMP_BLOCKTIME = 0; OMP_NUM_THREADS = 66;

KMP_AFFINITY = "granularity=fine, verbose, compact, 1, 0"

Hi Jing,

From the command you gave, the tf_cnn_benchmars.py script does not have the --cpu flag and so the script will not pick that up. What script are you using which has the --cpu flag? Also, What about --device=cpu and --local_parameter_device=cpu flags? Don't I have to set these too?

The xeon phi that I am using has:

CPU(s):                288
On-line CPU(s) list:   0-287
Thread(s) per core:    4
Core(s) per socket:    72
Socket(s):             1
NUMA node(s):          2

With you command I am getting 23.57images/sec.

Hi Krishna,

We tried the bazel build with the following flags:
bazel build --config=mkl --copt="-DEIGEN_USE_VML" --copt="-mfma" --copt="-mavx2" --copt="-O3" -s -c opt //tensorflow/tools/pip_package:build_pip_package
Then we ran the tf_cnn_benchmarks.py. This removed the all the Warnings except AVX512F ones.
We are further working on resolving this. Will update you as we make progress.
Pasting the results of the benchmark run below:
$ python tf_cnn_benchmarks/tf_cnn_benchmarks.py

TensorFlow:  1.2
Model:       trivial
Mode:        training
Batch size:  32 global
              32 per device

Devices:     ['/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating model
2017-08-22 07:24:43.494236: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.
2017-08-22 07:24:45.845980: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-08-22 07:24:45.846223: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-08-22 07:24:45.846461: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
Running warm up
Done warm up
Step    Img/sec loss
Starting real work at step 10 at time Tue Aug 22 07:24:56 2017
1       images/sec: 41.2 +/- 0.0 (jitter = 0.0) 7.089
10      images/sec: 38.9 +/- 1.3 (jitter = 3.8) 7.088
20      images/sec: 40.5 +/- 0.8 (jitter = 1.7) 7.086
30      images/sec: 40.8 +/- 0.6 (jitter = 1.6) 7.084
40      images/sec: 41.0 +/- 0.5 (jitter = 2.1) 7.082
50      images/sec: 41.1 +/- 0.4 (jitter = 2.0) 7.080
60      images/sec: 41.1 +/- 0.3 (jitter = 2.0) 7.079
70      images/sec: 41.1 +/- 0.3 (jitter = 2.1) 7.077
80      images/sec: 41.2 +/- 0.3 (jitter = 2.2) 7.075
90      images/sec: 41.4 +/- 0.3 (jitter = 2.2) 7.073
Finishing real work at step 109 at time Tue Aug 22 07:26:13 2017
100     images/sec: 41.3 +/- 0.3 (jitter = 2.1) 7.071
----------------------------------------------------------------
total images/sec: 41.17

Regards
Ravi Keron

 

Hello Ravi,

The benchmark you ran is without any parameters or real data. Please run your benchmark on real data and parameters. I have got 28images/sec in the latest test. If run the script like you did, then I get 48.5images.sec.

Regards,

Krishna Sheth

 

Hi Krishna,
      We attempted to eliminate the warnings so that the optimization works better. We could eliminate the others but one is still there. we are working to eliminate that and run with data and get back.

Regards
Ravi Keron 

 

Hi Krishna,
        We tried to eliminate the warnings and also ran with larger data set. The results are not as expected and is being referred to Product SME. Will keep you posted.

Regards
Ravi Keron N

Hi Ravi,

Can you send me the link to product SME? Also did you try different cluster mode on the Phi and did you make it numactl aware? Also post the output that you got so that we have a real idea of what's really happening.

Regards,

Krishna Sheth

 

Hi Krishna, 

The Inception V3 issue on Xeon Phi had been referred to the Product SME. Once we get a response, we can discuss 
and clarify on the solution options which would be provided by the product SME. To check the performance we tried the runs on a different topology(Alexnet) with 50K images by setting the following compilation and environment flags
mfma and march=knl
"OMP_NUM_THREADS = "136"
"KMP_BLOCKTIME" = "30"
"KMP_SETTINGS" = "1"
"KMP_AFFINITY"= "granularity=fine,verbose,compact,1,0"
'inter_op' = 2
'intra_op' = 136
On 50K images this one gave a better image/sec output of 326 images/sec.
The Alexnet Benchmark run gave 630 images/sec
Please find below is the Bazel build options used for the above runs
bazel build --config=mkl --copt="-DEIGEN_USE_VML" --copt="-mfma" --copt="-mavx2"  --copt="-O3" -s -c opt //tensorflow/tools/pip_package:build_pip_package
The numactl option had been tried along with the above flags and benchmark performance improved by 34% 
$ numactl -m 1 python benchmark_alexnet_Phi.py

We still need to get a solution on the Inception V3 topology. Will keep you posted as we receive the response.

Regards
Ravi Keron

Hi Krishna,

We could replicate the problem with imagenet data and shared the issue details with product team, we will get back once response is received.

Regards,

Rajeswari Ponnuru.

 

Following up with product team. Will get back once we get a response.

Thanks

Ravi Keron N

Hey Ravi,

I have sent a mail to Rajeshwari P. and have cc-ed you in it. Please have a look at it.

Regards

Krishna Sheth

Leave a Comment

Please sign in to add a comment. Not a member? Join today