Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512)...

The prototypes for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 instruction intrinsics are located in the zmmintrin.h header file.

To use these intrinsics, include the immintrin.h file as follows:

#include <immintrin.h>

variable	definition
`a`	a source vector element
`b`	a second source vector element
`k`	mask used as a selector; depending on the intrinsic, it may be a writemask or a zeromask

_mm_cvtne2ps_pbh

__m128bh _mm_cvtne2ps_pbh (__m128 a, __m128 b)

Instructions: vcvtne2ps2bf16 xmm, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst.

_mm_mask_cvtne2ps_pbh

__m128bh _mm_mask_cvtne2ps_pbh (__m128bh src, __mmask8 k, __m128 a, __m128 b)

Instructions: vcvtne2ps2bf16 xmm {k}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm_maskz_cvtne2ps_pbh

__m128bh _mm_maskz_cvtne2ps_pbh (__mmask8 k, __m128 a, __m128 b)

Instructions: vcvtne2ps2bf16 xmm {k}{z}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.

_mm256_cvtne2ps_pbh

__m256bh _mm256_cvtne2ps_pbh (__m256 a, __m256 b)

Instructions: vcvtne2ps2bf16 ymm, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst.

_mm256_mask_cvtne2ps_pbh

__m256bh _mm256_mask_cvtne2ps_pbh (__m256bh src, __mmask16 k, __m256 a, __m256 b)

Instructions: vcvtne2ps2bf16 ymm {k}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm256_maskz_cvtne2ps_pbh

__m256bh _mm256_maskz_cvtne2ps_pbh (__mmask16 k, __m256 a, __m256 b)

Instructions: vcvtne2ps2bf16 ymm {k}{z}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and store the results in single vector dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.

_mm512_cvtne2ps_pbh

__m512bh _mm512_cvtne2ps_pbh (__m512 a, __m512 b)

Instructions: vcvtne2ps2bf16 zmm, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst.

_mm512_mask_cvtne2ps_pbh

__m512bh _mm512_mask_cvtne2ps_pbh (__m512bh src, __mmask32 k, __m512 a, __m512 b)

Instructions: vcvtne2ps2bf16 zmm {k}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm512_maskz_cvtne2ps_pbh

__m512bh _mm512_maskz_cvtne2ps_pbh (__mmask32 k, __m512 a, __m512 b)

Instructions: vcvtne2ps2bf16 zmm {k}{z}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in two vectors a and b to packed BF16 (16-bit) floating-point elements, and stores the results in a single vector dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.

_mm_cvtneps_pbh

__m128bh _mm_cvtneps_pbh (__m128 a)

Instructions: vcvtneps2bf16 xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst.

_mm_mask_cvtneps_pbh

__m128bh _mm_mask_cvtneps_pbh (__m128bh src, __mmask8 k, __m128 a)

Instructions: vcvtneps2bf16 xmm {k}, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm_maskz_cvtneps_pbh

__m128bh _mm_maskz_cvtneps_pbh (__mmask8 k, __m128 a)

Instructions: vcvtneps2bf16 xmm {k}{z}, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.

_mm256_cvtneps_pbh

__m128bh _mm256_cvtneps_pbh (__m256 a)

Instructions: vcvtneps2bf16 xmm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst.

_mm256_mask_cvtneps_pbh

__m128bh _mm256_mask_cvtneps_pbh (__m128bh src, __mmask8 k, __m256 a)

Instructions: vcvtneps2bf16 xmm {k}, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm256_maskz_cvtneps_pbh

__m128bh _mm256_maskz_cvtneps_pbh (__mmask8 k, __m256 a)

Instructions: vcvtneps2bf16 xmm {k}{z}, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.

_mm512_cvtneps_pbh

__m256bh _mm512_cvtneps_pbh (__m512 a)

Instructions: vcvtneps2bf16 ymm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst.

_mm512_mask_cvtneps_pbh

__m256bh _mm512_mask_cvtneps_pbh (__m256bh src, __mmask16 k, __m512 a)

Instructions: vcvtneps2bf16 ymm {k}, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm512_maskz_cvtneps_pbh

__m256bh _mm512_maskz_cvtneps_pbh (__mmask16 k, __m512 a)

Instructions: vcvtneps2bf16 ymm {k}{z}, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Converts packed single-precision (32-bit) floating-point elements in a to packed BF16 (16-bit) floating-point elements, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.

_mm_dpbf16_ps

__m128 _mm_dpbf16_ps (__m128 src, __m128bh a, __m128bh b)

Instructions: vdpbf16ps xmm, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst.

_mm_mask_dpbf16_ps

__m128 _mm_mask_dpbf16_ps (__m128 src, __mmask8 k, __m128bh a, __m128bh b)

Instructions: vdpbf16ps xmm {k}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm_maskz_dpbf16_ps

__m128 _mm_maskz_dpbf16_ps (__mmask8 k, __m128 src, __m128bh a, __m128bh b)

Instructions: vdpbf16ps xmm {k}{z}, xmm, xmm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set).

_mm256_dpbf16_ps

__m256 _mm256_dpbf16_ps (__m256 src, __m256bh a, __m256bh b)

Instructions: vdpbf16ps ymm, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst.

_mm256_mask_dpbf16_ps

__m256 _mm256_mask_dpbf16_ps (__m256 src, __mmask8 k, __m256bh a, __m256bh b)

Instructions: vdpbf16ps ymm {k}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm256_maskz_dpbf16_ps

__m256 _mm256_maskz_dpbf16_ps (__mmask8 k, __m256 src, __m256bh a, __m256bh b)

Instructions: vdpbf16ps ymm {k}{z}, ymm, ymm

CPUID Flags: AVX512_BF16 + AVX512VL

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set).

_mm512_dpbf16_ps

__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)

Instructions: vdpbf16ps zmm, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst.

_mm512_mask_dpbf16_ps

__m512 _mm512_mask_dpbf16_ps (__m512 src, __mmask16 k, __m512bh a, __m512bh b)

Instructions: vdpbf16ps zmm {k}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using writemask k. Elements are copied from src when the corresponding mask bit is not set.

_mm512_maskz_dpbf16_ps

__m512 _mm512_maskz_dpbf16_ps (__mmask16 k, __m512 src, __m512bh a, __m512bh b)

Instructions: vdpbf16ps zmm {k}{z}, zmm, zmm

CPUID Flags: AVX512_BF16 + AVX512F

Computes the dot-product of BF16 (16-bit) floating-point pairs in a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in src, and stores the results in dst using zeromask k. Elements are zeroed out when the corresponding mask bit is not set.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® C++ Compiler Classic Developer Guide and Reference

Intrinsics for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 Instructions