Intel® C++ Compiler Classic Developer Guide and Reference

ID 767249
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 Instructions

The prototypes for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 instruction intrinsics are located in the zmmintrin.h header file.

To use these intrinsics, include the immintrin.h file as follows:

#include <immintrin.h>
variable definition
a a source vector element
b a second source vector element
k mask used as a selector; depending on the intrinsic, it may be a writemask or a zeromask

_mm_cvtne2ps_pbh

__m128bh _mm_cvtne2ps_pbh (__m128 a, __m128 b)

Instructions: vcvtne2ps2bf16 xmm, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst.



_mm_mask_cvtne2ps_pbh

__m128bh _mm_mask_cvtne2ps_pbh (__m128bh src, __mmask8 k, __m128 a, __m128 b)

Instructions: vcvtne2ps2bf16 xmm {k}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm_maskz_cvtne2ps_pbh

__m128bh _mm_maskz_cvtne2ps_pbh (__mmask8 k, __m128 a, __m128 b)

Instructions: vcvtne2ps2bf16 xmm {k}{z}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.



_mm256_cvtne2ps_pbh

__m256bh _mm256_cvtne2ps_pbh (__m256 a, __m256 b)

Instructions: vcvtne2ps2bf16 ymm, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst.



_mm256_mask_cvtne2ps_pbh

__m256bh _mm256_mask_cvtne2ps_pbh (__m256bh src, __mmask16 k, __m256 a, __m256 b)

Instructions: vcvtne2ps2bf16 ymm {k}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm256_maskz_cvtne2ps_pbh

__m256bh _mm256_maskz_cvtne2ps_pbh (__mmask16 k, __m256 a, __m256 b)

Instructions: vcvtne2ps2bf16 ymm {k}{z}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and store the results in single vector dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.



_mm512_cvtne2ps_pbh

__m512bh _mm512_cvtne2ps_pbh (__m512 a, __m512 b)

Instructions: vcvtne2ps2bf16 zmm, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst.



_mm512_mask_cvtne2ps_pbh

__m512bh _mm512_mask_cvtne2ps_pbh (__m512bh src, __mmask32 k, __m512 a, __m512 b)

Instructions: vcvtne2ps2bf16 zmm {k}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm512_maskz_cvtne2ps_pbh

__m512bh _mm512_maskz_cvtne2ps_pbh (__mmask32 k, __m512 a, __m512 b)

Instructions: vcvtne2ps2bf16 zmm {k}{z}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.



_mm_cvtneps_pbh

__m128bh _mm_cvtneps_pbh (__m128 a)

Instructions: vcvtneps2bf16 xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst.



_mm_mask_cvtneps_pbh

__m128bh _mm_mask_cvtneps_pbh (__m128bh src, __mmask8 k, __m128 a)

Instructions: vcvtneps2bf16 xmm {k}, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm_maskz_cvtneps_pbh

__m128bh _mm_maskz_cvtneps_pbh (__mmask8 k, __m128 a)

Instructions: vcvtneps2bf16 xmm {k}{z}, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.



_mm256_cvtneps_pbh

__m128bh _mm256_cvtneps_pbh (__m256 a)

Instructions: vcvtneps2bf16 xmm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst.



_mm256_mask_cvtneps_pbh

__m128bh _mm256_mask_cvtneps_pbh (__m128bh src, __mmask8 k, __m256 a)

Instructions: vcvtneps2bf16 xmm {k}, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm256_maskz_cvtneps_pbh

__m128bh _mm256_maskz_cvtneps_pbh (__mmask8 k, __m256 a)

Instructions: vcvtneps2bf16 xmm {k}{z}, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.



_mm512_cvtneps_pbh

__m256bh _mm512_cvtneps_pbh (__m512 a)

Instructions: vcvtneps2bf16 ymm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst.



_mm512_mask_cvtneps_pbh

__m256bh _mm512_mask_cvtneps_pbh (__m256bh src, __mmask16 k, __m512 a)

Instructions: vcvtneps2bf16 ymm {k}, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm512_maskz_cvtneps_pbh

__m256bh _mm512_maskz_cvtneps_pbh (__mmask16 k, __m512 a)

Instructions: vcvtneps2bf16 ymm {k}{z}, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.



_mm_dpbf16_ps

__m128 _mm_dpbf16_ps (__m128 src, __m128bh a, __m128bh b)

Instructions: vdpbf16ps xmm, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst.



_mm_mask_dpbf16_ps

__m128 _mm_mask_dpbf16_ps (__m128 src, __mmask8 k, __m128bh a, __m128bh b)

Instructions: vdpbf16ps xmm {k}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm_maskz_dpbf16_ps

__m128 _mm_maskz_dpbf16_ps (__mmask8 k, __m128 src, __m128bh a, __m128bh b)

Instructions: vdpbf16ps xmm {k}{z}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set).



_mm256_dpbf16_ps

__m256 _mm256_dpbf16_ps (__m256 src, __m256bh a, __m256bh b)

Instructions: vdpbf16ps ymm, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst.



_mm256_mask_dpbf16_ps

__m256 _mm256_mask_dpbf16_ps (__m256 src, __mmask8 k, __m256bh a, __m256bh b)

Instructions: vdpbf16ps ymm {k}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm256_maskz_dpbf16_ps

__m256 _mm256_maskz_dpbf16_ps (__mmask8 k, __m256 src, __m256bh a, __m256bh b)

Instructions: vdpbf16ps ymm {k}{z}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set).



_mm512_dpbf16_ps

__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)

Instructions: vdpbf16ps zmm, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst.



_mm512_mask_dpbf16_ps

__m512 _mm512_mask_dpbf16_ps (__m512 src, __mmask16 k, __m512bh a, __m512bh b)

Instructions: vdpbf16ps zmm {k}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.



_mm512_maskz_dpbf16_ps

__m512 _mm512_maskz_dpbf16_ps (__mmask16 k, __m512 src, __m512bh a, __m512bh b)

Instructions: vdpbf16ps zmm {k}{z}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.