Intel® oneAPI Deep Neural Network Library Release Notes

By Hung-Ju Tsai, Jennifer L Jiang

Published: 12/05/2020   Last Updated: 06/28/2021

This document provides a summary of new and changed product features.

Where to Find the Release

Please follow the steps to download Intel® oneAPI Base toolkit contained oneDNN from the Web Configurator, and follow the installation instructions to install.

2021.3

Performance Optimizations

  • Extended primitive cache to improve primitive descriptor creation performance.
  • Improved primitive cache performance in multithreaded configurations.
  • Intel® Processor Graphics and Xe architecture-based Graphics:
    • Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats.
    • Improved performance of reduction primitive.
    • Improved performance of depthwise convolution primitive with NHWC activations for training cases
  • Intel® Architecture processors
    • Introduced initial optimizations for bfloat16 functionality for future Intel® Xeon Scalable processor with Intel® AMX support (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of int8 compute functionality for future Intel® Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control. 
    • Introduced initial performance optimizations for future Intel® Core processor with Intel® AVX2 and Intel® DL Boost instructions support (code name Alder Lake).
    • Improved performance of int8 primitives for processors with Intel® SSE4.1 instruction set support.
    • Improved performance of int8 and bfloat16 RNN and inner product primitives.
    • Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel® AVX512 compute unit.
    • Improved forward convolution performance for Intel® AVX-512 systems.
    • Improved convolution and batch normalization performance with threadpool.
    • Improved performance of bfloat16 shuffle primitive.
    • Improved performance of `dnnl_gemm` and functionality relying on this implementation for cases with `n=1` on all supported processors.
       

New Functionality

Usability

  • Introduced support for DPC++ debug configuration on Windows

Breaking changes

  • Updated minimal supported CMake version from to 2.8.12 (was 2.8.11)

Known issues and limitations

  • Backward inner product primitive may produce incorrect result for the shapes with number of output channels not been multiple by 16 for future Intel Xeon Scalable processor (code name Sapphire Rapids)
  • Convolution with binary post-op may produce incorrect results for formats with channel padding.
  • Pooling and batch normalization primitives may hang on Windows GEN9 and DG1 in DPC++/L0 configuration.
  • Pooling and batch normalization primitives with 4D double blocked memory formats may produce NaNs or hang on Linux DG1 platforms.
  • See DPC++ limitations that impact the library as well.

Performance Optimizations

  • Reduced overheads associated with primitive cache.
  • Intel® Processor Graphics and Xe architecture-based Graphics:
    • Improved performance of int8 primitives with NHWC activations format.
    • Improved functionality performance for padded memory formats.
    • Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
    • Improved performance of fp16 pooling primitive.
    • Improved performance of lnorm primitive for plain memory formats.
    • Improved performance of resampling primitive for blocked memory formats .
    • Improved performance of Winograd convolution.
  • Intel® Architecture processors
    • Introduced initial optimizations for bfloat16 functionality for future Intel® Xeon Scalable processor with Intel® AMX support (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of int8 compute functionality for future Intel® Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control. 
    • Introduced initial performance optimizations for future Intel® Core processor with Intel® AVX2 and Intel® DL Boost instructions support (code name Alder Lake).
    • Improved performance of int8 primitives for processors with Intel® SSE4.1 instruction set support.
    • Improved performance of int8 and bfloat16 RNN and inner product primitives.
    • Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel® AVX512 compute unit.
    • Improved forward convolution performance for Intel® AVX-512 systems.
    • Improved convolution and batch normalization performance with threadpool.
    • Improved performance of bfloat16 shuffle primitive.
    • Improved performance of `dnnl_gemm` and functionality relying on this implementation for cases with `n=1` on all supported processors.
       

New Functionality

  • Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs. Extended the number of supported post-ops for primitives to 20.
  • Extended eltwise primitive with support for `logsigmoid`, `mish`, `hardswish`, and `clip_v2` algorithms.
  • Introduced support for PRelu primitive
  • Introduced int8 support for LSTM primitive with projection for CPU.
  • Introduced asymmetric quantization support for int8 deconvolution.
  • Extended matmul implementation with support for per-output channel zero-points for quantization.
  • Extended support for broadcasting in binary primitive to both inputs for CPU.
  • Extended binary primitive with support for comparison operators.
  • Introduced float16 support in reduction primitive for GPU.
  • Introduced support for mixed input and output types in binary primitive for GPU.
  • Introduced support for post-ops in GPU resampling implementation.

Usability

  • Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.
  • Improved presentation of oneDNN primitives in  Intel® VTune™ Profiler.

Validation

  • Extended benchdnn to report operation bandwidth.
  • Added ability to choose target GPU in benchdnn.

Known issues and limitations

  • When using driver version older than 27.20.100.9316 for Intel® UHD Graphics for 9th Gen Intel® Processor on Windows, convolution/de-convolution functions may sporadically hang or produce incorrect results in DPC++ configuration with LevelZero. Please upgrade your driver version to fix the issue. An alternative solution is to use DPC++ with OpenCL backend with DPC++ compiler.
  • Reorder, prelu, softmax, and pooling primitives on GPUs may be slower for zero padded memory formats than Intel oneDNN 2021.1.
  • Reorder operation for 5D tensor with two dimensions equal to 16 and one uneven dimension can produce incorrect results on Intel® Iris® Xe Max Graphics.
  • Eltwise primitive may produce incorrect results for oneDNN DPC++ configuration with Level Zero runtime. In order to avoid this, use DPC++ with OpenCL backend with DPC++ compiler.
  • Deconvolution primitive may segfault with int8 data on processors for cases with non-trivial padding on processors with Intel AVX-512 support.
  • Deconvolution primitive may segault with int8 data when used with post-ops and per_oc broadcast on processors with Intel AVX2 support.
  • Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
  • Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
  • When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
  • See DPC++ limitations that impact the library as well.

New Functionality

Known issues and limitations

  • Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
  • Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
  • When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
  • See DPC++ limitations that impact the library as well.

 

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.