Running int8 model on Intel-Optimized-Tensorflow

TCE Options

TCE Level: 

TCE Open Date: 

Wednesday, January 15, 2020 - 09:15

Running int8 model on Intel-Optimized-Tensorflow

I read the article

It mentioned that the 2nd generation instructions such as AVX512_VNNI are optimized for Neural Network

I ran one of INT8 models in IntelAI

Here is my environment

- Docker:

- CPU info

Architecture:    x86_64

CPU op-mode(s):   32-bit, 64-bit

Byte Order:     Little Endian

CPU(s):       96

On-line CPU(s) list: 0-95

Thread(s) per core: 2

Core(s) per socket: 24

Socket(s):      2

NUMA node(s):    2

Vendor ID:      GenuineIntel

CPU family:     6

Model:        85

Model name:     Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

Stepping:      7

CPU MHz:       1838.080

BogoMIPS:      5000.00

Hypervisor vendor:  KVM

Virtualization type: full

L1d cache:      32K

L1i cache:      32K

L2 cache:      1024K

L3 cache:      36608K

NUMA node0 CPU(s):  0-23,48-71

NUMA node1 CPU(s):  24-47,72-95

Flags:        fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

I expect to run the Neural Network by the 2 gen instructions (AVX512_VNNI)

but it shows that the following optimized instructions are used:


Is the docker image the optimized version to run Neural Network?

​How can I get the information whether AVX512_VNNI is used or not?

How can I compile the code provided by IntelAI by the 2 gen Intel instructions?

Which docker image can I use to run the program?

Thanks in advance

16 posts / 0 new


May I know which model had you tested? Please also let me know your steps.

1. If you run the benchmark with environment variable DNNL_VERBOSE set to 1, you will see messages like the following at the beginning of all verbose messages. If VNNI is supported, it will shown in these messages.

dnnl_verbose,info,DNNL v1.1.0 (commit 5be2cfea21ec6d1d29f52600553baff53e30aedb)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

2. You don't need to compile code. DNNL will dispatch code in runtime automatically.

3. Please use the docker images mentioned in the github page, like




I run "wide & deep" model.

The Int8 Model can't run on docker ""

Some error occurs.


I choose docker "", which I think it's the last version of optimized-tensorflow with MKL-DNN

It shows some messages:

I tensorflow/core/platform/] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 AVX512F FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.

I think it works with instructions optimization.


But I can't see the message you said.

dnnl_verbose,info,DNNL v1.1.0 (commit 5be2cfea21ec6d1d29f52600553baff53e30aedb)

dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost


I also try "export DNNL_VERBOSE=1"

But it doesn't work


Is there any wrong?

Would you provide a docker image for us to run "deep & wide" by VNNI?

Thank you very much~


The dataset is large, and it takes time to download.

Could you please try MKLDNN_VERBOSE=1 instead?

I'll try and investigate this issue once I got dataset downloaded.

Thank you.

Hi Lin ChiungLiang,

Two items that may help.

1) The message that was output by the CPU feature guard is helpful. It means that the binary was compiled with GCC flags that used AVX instructions, but to allow the container to work on the greatest number of systems possible, it was not compiled with *static* AVX2, AVX512, or AVX512_VNNI instructions in the eigen library, which would cause TensorFlow in that container to crash when run on older systems. 

However, MKL-DNN detects CPU features at run-time and adjusts accordingly. Thus, when TensorFlow loads the MKL-DNN library, AVX512_VNNI instructions will be used if they are available on that system.

2) The TensorFlow version in the container is TensorFlow 1.15, which uses MKL-DNN version 0.x. If you want to see the verbose output in that version, as suggested above, you need to set MKLDNN_VERBOSE=1. DNNL_VERBOSE=1 will only work once MKL-DNN 1.x has been integrated into Tensorflow.

\\"Perhaps travel cannot prevent bigotry, but by demonstrating that all peoples cry, laugh, eat, worry, and die, it can introduce the idea that if we try and understand each other, we may even become friends.\\" Maya Angelou

Hi Robison,

Thanks for your information.

1) Is there any command to check the version of MKL-DNN?

2) As I mentioned, I'd like to run int8 "wide & deep" model, would you please let me know which docker image should I use?

3) I found that when I run the model, it can't fully utilize CPUs.

I am sure all cpus are used, but don't know why the utilization of CPUs are still low, about 30%~40%

Lot of thanks


I'm still checking with the dev team for the cpu usage. Probably the workload of this task is not large enough.

Alternatively, you may wish to try environment variables like KMP_AFFINITY. (

1) Please check the first MKLDNN verbose message. MKLDNN newer than 0.18 will print its version information as the first verbose message.

2) The docker image mentioned in the github page works for int8 of this model. Please just use that docker image.


Any update?

I'm still waiting for your reply.

I tried the docker image you mentioned in the github, but it didn't the optimized one.

It only used optimized instructions (AVX512F)

the performance of the model on the docker is even poor (longer computational time) than the docker I mentioned.

It used instructions AVX512F, AVX2, FMA

Please help me to check which docker image is the best

Lot of thanks

Best Reply


I tensorflow/core/platform/] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 AVX512F FMA

The message shown above doesn't make sense in Intel optimization for tensorflow, since either MKL-DNN or MKL will do the dynamic dispatch in runtime to take advantage of the latest instruction set that is supported on your hardware.

For MKL-DNN, it will show

dnnl_verbose,info,DNNL v1.1.0 (commit 5be2cfea21ec6d1d29f52600553baff53e30aedb)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

For MKL, it will show

MKL_VERBOSE Intel(R) MKL 2019.0 Update 3 Product build 20190125 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.10GHz lp64 intel_thread

If you see the above messages, then it will use VNNI in runtime.

Use the following commands to show verbose messages.

export MKLDNN_VERBOSE=1 [or] export DNNL_VERBOSE=1
export MKL_VERBOSE=1

As for the performance, have you tried environment variables like KMP_AFFINITY and/or OMP_NUM_THREADS?

These environment variables effect performance. They will be set automatically in some of the docker images, in some you have to set them by yourselves.

All docker images released in should all have been enabled for VNNI support.


Thanks for your response

Finally, I saw the messages when I enable the flags you mentioned.

mkldnn_verbose,info,Intel MKL-DNN v0.20.3 (commit N/A)
mkldnn_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost


There are some questions

1) If the flags are not set as 1, does the program run without VNNI? Or it only run without printing the information?

The results are interesting, when I enable the flags, the computational time increases a little.

I think it only run without printing when the flags are disabled, so when run the program with printing information, the results got worse.


2) About CPU utilization

I set the number of cores = number of threads = #cores, number of inter-threads = 2

utilization rate can't reach 100%

If you have any comment, please let me know


Lot of thanks


1) Exactly your understanding is correct. The environment variable just controls whether to print message or not.

2) It is possible that CPU usage not going to 100%, depending on use case. Are you satisfying with the performance?


The performance is good.

I just want to know how to perform the best result.

Thanks for your help.


For simple methods you can take a reference to the following article.

For more advanced ways, you need to profile the execution to see which parts take the longest time, and improve them accordingly.


Could you please confirm whether the solution provided was helpful.


Yes, thanks for your help


Thanks for the confirmation. We are closing this thread. Feel free to open a new thread if you have any further queries.


Leave a Comment

Please sign in to add a comment. Not a member? Join today