more granularity in __declspec(cpu_dispatch() ???

more granularity in __declspec(cpu_dispatch() ???

cpu_dispatch is a nice feature

How can I use cpu_dispatch to differentiate architectures where shld instruction (1 cycle on SNB only) is faster than rotl (faster on all previous and following cpu types, like NHM and IVB)

How can I get the list of supported identifiers which seem to follow an exotic naming convention "core_2nd_gen_avx" "core_i7_sse4_2" ...

Is it possible to use documented cpuid flags in cpu_dispatch clause (like sse4.2,aes,pclmul,rtm) , rather than undocumented names ?

8 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

+1 from me

+1 for implementation under Windows too.

-- With best regards, VooDooMan - If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.
imagem de iliyapolak


iliyapolak wrote:

Does it help?

thank you for the great article and for your effort. just out of my curiosity, how did you found this? what google keywords/or your own memory?

-- With best regards, VooDooMan - If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.
imagem de iliyapolak

You are welcome.

I simply entered this keyword: __declspec(cpu_dispatch()) in google search bar.

Btw. I have never used __declspec() keyword with cpu_dispatch() modifier.I simply query the cupid for existence of specific technology(SSEn,AVX).

Thanks, This is the right link for cpu_dispatch(figure_out_the_list)

This does not help to figure out at run time if the processor supports a fast shld/rotl  (word rotation is heavily used in cryptography)

This does not help with processors 4 generation where aes has been disabled for export. No way to use avx2 instructions.

More generally, there is a complete lack of granularity for options not enabled in virtualized environments, options fused out, options disabled in bios....

Using CPU_ID every time is a killer for performance. When we use highly optimized architecture-specific code to save cycles, it is not a good idea to waste them again in a serializing instruction .....

Looking at the implementation, the generated code by ICC compiler is suboptimal : it is based on the test of a constant initialized once, and when there are 5 or 6 versions of IA architecture (generic C, assembler, SSE2, AVX, AVX2, w/a and w/o AES-NI) ... the last architecture in the list get the impact of all tests. Something like the following written in C generates "better" logic.

// in a C file

fn_sse(args) {
// sse version

fn_generic(args) {
//  generic version

// use the right pointer at first usage
if (cpuid(sse)) fn_pointer = fn_sse;
if (cpuid(generic)) fn_pointer = fn_generic;

extern inline fn_pointer = fn_check_and_generic

And in a header file

extern inline fn_pointer;,

inline fn {args) {
fn_pointer(args)    // dispatch here, first call chnage the pointer


imagem de Jennifer J. (Intel)
Best Reply

there are several new intrinsics:

extern void __ICL_INTRINCC _allow_cpu_features(unsigned __int64);
extern int __ICL_INTRINCC _may_i_use_cpu_feature(unsigned __int64);

please see the immintrin.h for details about the possible parameter value. See if it's detail enough for your need. Also see this article about the "_allow_cpu_features()".


imagem de iliyapolak

Thanks Jennifer for this valuable information.

Faça login para deixar um comentário.